Skip to content

Fix Module::Metadata test suite compatibility#368

Merged
fglock merged 10 commits intomasterfrom
fix/module-metadata-tests
Mar 24, 2026
Merged

Fix Module::Metadata test suite compatibility#368
fglock merged 10 commits intomasterfrom
fix/module-metadata-tests

Conversation

@fglock
Copy link
Owner

@fglock fglock commented Mar 24, 2026

Summary

This PR fixes PerlOnJava to pass most Module::Metadata tests and adds /a modifier support.

Changes

  • version.pm (Version.java):

    • Fix qv() to correctly set is_qv=true and return v-prefixed versions
    • Handle VSTRING type inputs correctly (bare v-strings like v2.2.102.2)
    • Reject invalid version formats (multiple underscores, non-numeric data)
  • RegexFlags.java:

    • Add isAscii field to track the /a modifier (ASCII-restrict)
    • Only enable UNICODE_CHARACTER_CLASS when /u is explicitly used
  • RegexPreprocessorHelper.java:

    • In ASCII mode (/a), expand \w, \W, \d, \D, \s, \S to ASCII-only equivalents
    • Pass RegexFlags through to handle scoped modifiers like (?a:...)
  • RuntimeRegex.java:

    • Add patternUnicode field for storing Unicode-enabled pattern variant
    • Compile Unicode pattern when /u is explicitly used
    • Select appropriate pattern at match time based on /u flag
  • GlobalContext.java:

    • Set ${^TAINT} to 1 when -T flag is used
  • Config.pm:

    • Add -DSILENT_NO_TAINT_SUPPORT to ccflags to properly skip taint tests

Test Results

  • re/charset.t: 5290 passing (improved from baseline 5282)
  • re/pat.t: 1065 passing (baseline)
  • uni/variables.t: 66880 passing (baseline)
  • re/regex_sets.t: 76 passing (baseline)
  • Module::Metadata: 380/383 tests pass (3 encoding.t failures for Unicode package names)

Known Limitation

The 3 failing Module::Metadata tests are for Unicode package names (e.g., Cœur). In Perl, \w matches Unicode when the string has the UTF-8 flag. PerlOnJava doesn't track this per-string flag, so we require explicit /u modifier for Unicode matching. This is a trade-off to maintain compatibility with the core regex test suite.

Test Plan

  • make - all unit tests pass
  • Core regex tests pass without regressions

Generated with Devin

fglock and others added 8 commits March 24, 2026 15:22
- version.pm: Fix qv() to set is_qv=true and return v-prefixed versions
- version.pm: Handle VSTRING type inputs correctly (bare v-strings)
- version.pm: Reject invalid formats (multiple underscores, non-numeric data)
- RegexFlags: Enable UNICODE_CHARACTER_CLASS by default (matches Perl 5.14+)
  This fixes Unicode package name recognition in Module::Metadata
- GlobalContext: Set ${^TAINT} to 1 when -T flag is used
- Config.pm: Add -DSILENT_NO_TAINT_SUPPORT to ccflags to skip taint tests

All 383 Module::Metadata tests now pass.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add isAscii field to RegexFlags record to track /a modifier
- Update fromModifiers(), with(), toFlagString() to handle /a flag
- Modify toPatternFlags() to only enable UNICODE_CHARACTER_CLASS when /u
  is explicitly used and /a is not present
- Update handleEscapeSequences to accept RegexFlags and:
  - In ASCII mode: expand \w to [a-zA-Z0-9_], \W to [^a-zA-Z0-9_]
  - In ASCII mode: expand \d to [0-9], \D to [^0-9]
  - In ASCII mode: expand \s/\S to ASCII-only whitespace
  - In non-ASCII mode: keep Unicode whitespace expansion for \s/\S

This fixes regressions in re/charset.t, re/pat.t, uni/variables.t and
re/regex_sets.t tests that were caused by enabling UNICODE_CHARACTER_CLASS
by default.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add patternUnicode field to RuntimeRegex for storing Unicode-enabled pattern
- When /u is explicitly used (and not /a), compile pattern with UNICODE_CHARACTER_CLASS
- Select appropriate pattern at match time based on /u flag
- Copy patternUnicode when regex objects are cloned/copied

This prepares infrastructure for proper Unicode matching but keeps backward
compatibility (ASCII-only by default). Currently only activated when /u is used.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implement auto-Unicode detection for regex matching to support Unicode
package names like 'Cœur' without requiring the explicit /u modifier.

Changes:
- Always compile Unicode pattern variant (unless /a flag is used)
- Add hasNonAscii() helper method to detect non-ASCII input strings
- Select Unicode pattern automatically when input contains non-ASCII
- Skip Unicode check for BYTE_STRING type (optimization)

This fixes Module::Metadata tests that use \w to match Unicode characters
in package names. Now all 383 Module::Metadata tests pass.

Benchmark shows minimal overhead (~1% from 2.25 to 2.28 CPU seconds).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add Utf8.isUtf8(RuntimeScalar) helper method for internal use
- Use combined check: UTF-8 flag AND non-ASCII content for Unicode matching
- This ensures BYTE_STRING always uses ASCII matching
- UTF-8 strings with only ASCII content also use ASCII (optimization)

The op/utfhash.t regressions are due to a pre-existing bug where hash
keys lose their BYTE_STRING type when stored/retrieved from hashes.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Document how to use the test runner tool for running Perl test suites.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add Unicode pattern variant (UNICODE_CHARACTER_CLASS) to RuntimeRegex
- Select Unicode pattern when string is UTF-8 and contains chars > 255
- Fix \x{100}+ escapes to produce UTF-8 strings (not BYTE_STRING)
- Add wide character check in EmitLiteral.java (JVM backend)
- Add wide character check in BytecodeCompiler.java (interpreter)

Test improvements:
- Module::Metadata: 383/383 PASS (Cœur package name now works)
- op/utfhash.t: 91/99 (regression fixed)
- re/charset.t: +180 tests passing
- re/regexp.t: +15 tests passing
- re/regexp_qr_embed.t: +18 tests passing

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Performance fix: hasUnicodeChars() was scanning the entire input string
on every regex match to check for Unicode characters (> 255). For a
100K character string with 100K matches, this was 10 billion char checks!

Solution: Cache the hasUnicodeChars result in RuntimePosLvalue alongside
the existing pos() cache. The cache entry is invalidated when the string
value changes (detected via hashCode).

Also optimized pos() lookup to only occur for /g global matches.

Results:
- 100K matches on 100K string: 6s -> 0s (1000x faster)
- 1M global matches: now completes in <1s
- re/pat_psycho.t: 0/15 (timeout) -> 15/15 pass
- re/speed.t: 0/59 (timeout) -> 25+ pass

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the fix/module-metadata-tests branch from b180f79 to 204b2f3 Compare March 24, 2026 16:49
fglock and others added 2 commits March 24, 2026 18:04
Add lexical hints HINT_RE_ASCII and HINT_RE_UNICODE to control default
regex character class semantics:

- use re '/a': ASCII-restrict \w, \d, \s to match only ASCII
- use re '/u': Unicode semantics for \w, \d, \s
- use re '/aa': Strict ASCII (also affects case folding)

The pragma sets a hint at compile time, which is then applied to all
regex patterns compiled in that scope (m//, qr//, s///).

This fixes a regression where 'use re "/a"' was not being respected,
causing \x{100} to incorrectly match \w.

Test results:
- re/pat.t: 1066/1298 (+1 from baseline 1065)
- Module::Metadata: 383/383 pass

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Document implemented and unimplemented features in Re.java class javadoc
to help future development and track what's still needed.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock merged commit df43f01 into master Mar 24, 2026
2 checks passed
@fglock fglock deleted the fix/module-metadata-tests branch March 24, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant