Skip to content

Issue #19548: Fix IllegalTokenText in google_checks.xml to detect 2-d…#19549

Open
vivek-0509 wants to merge 1 commit intocheckstyle:masterfrom
vivek-0509:fix-google-checks-2digit-octal-escape-fn
Open

Issue #19548: Fix IllegalTokenText in google_checks.xml to detect 2-d…#19549
vivek-0509 wants to merge 1 commit intocheckstyle:masterfrom
vivek-0509:fix-google-checks-2digit-octal-escape-fn

Conversation

@vivek-0509
Copy link
Copy Markdown
Contributor

@vivek-0509 vivek-0509 commented Mar 31, 2026

Issue #19548:

The regex only detected 3-digit octal forms (e.g. \012) but missed equivalent 2-digit forms (e.g. \12) for characters with special escape sequences per Google Java Style Guide §2.3.2, and JLS §3.10.7

Fix: changed 0(10|11|...) to 0?(10|11|...) making the leading zero optional. Added integration test coverage for all 8 two-digit octal escapes.

Diff Regression config: https://gist.githubusercontent.com/vivek-0509/26f7dc0977fa0ecaf66e127eb4245194/raw/171fc84ff7037cad652da44dbd135f994a03e163/base_config.xml

Diff Regression patch config: https://gist.githubusercontent.com/vivek-0509/4076379193a16479693fc871ec0f1cd8/raw/e0cc1597cde728943ee64faec406b5310e512628/patch_config.xml

@vivek-0509
Copy link
Copy Markdown
Contributor Author

Github, generate report

@github-actions
Copy link
Copy Markdown
Contributor

@vivek-0509
Copy link
Copy Markdown
Contributor Author

@romani ping

1 similar comment
@vivek-0509
Copy link
Copy Markdown
Contributor Author

@romani ping

Copy link
Copy Markdown
Contributor

@stoyanK7 stoyanK7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue and PR look solid 👍

One small comment:

String r5 = "\\15"; // violation 'Consider using special escape sequence'
String r6 = "\\40"; // violation 'Consider using special escape sequence'
String r7 = "\\42"; // violation 'Consider using special escape sequence'
String r8 = "\\47"; // violation 'Consider using special escape sequence'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a separate issue, but should sequences starting with \\ actually be flagged?

For example, "\\47" is just a literal backslash \ followed by 47, not the octal escape \47, since the backslash itself is escaped.

So I’d expect cases like "\\47" (and "\\11") not to be a violation.

Please let me know if I’m misunderstanding how this escaping thing works.

@romani
Copy link
Copy Markdown
Member

romani commented Apr 12, 2026

This is false positive
https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/b2b6a47_2026084240/reports/diff/guava/index.html#A1
It is windows so path. Please add it to our inputs.

please add to inputs:
https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/b2b6a47_2026084240/reports/diff/lombok-ast/index.html#A2

it might be different issue , but in way we do violation I do not understand what is wrong here
https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/b2b6a47_2026084240/reports/diff/openjdk25/index.html#A1 , please add it inputs and make copy of such string with fixed values.

add to our inputs this https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/b2b6a47_2026084240/reports/diff/openjdk25/index.html#A126 , I am not sure what is expected form of code here, please add fixed copy of this line, to clearly point on part that need fix.

@vivek-0509
Copy link
Copy Markdown
Contributor Author

vivek-0509 commented Apr 12, 2026

@stoyanK7 @romani
i think this is a fundamental limitation of how IllegalTokenText works. The check uses find() to match the regex against raw token text as a substring, without any understanding of escape sequence boundaries. It simply scans the text for any substring that matches the pattern, regardless of whether that substring is the actual complete escape sequence in the source code.

In the openjdk case, \100 is a single 3-digit octal escape, the lexer consumes all three digits greedily. But find() sees the characters \, 1, 0, 0 and matches \10 as a substring, treating it as a 2-digit octal for backspace. It has no concept that \100 is one indivisible escape sequence. In the Guava case, \\11 in source is an escaped backslash followed by 11 as part of a Windows path date, the lexer consumes \\ as a pair, leaving 11 as literal characters. But find() matches \11 starting at the second backslash, not knowing the two backslashes were already paired by the lexer.

The \100 type of false positive can be fixed with a negative lookahead in the regex, if the 2-digit match is followed by another octal digit, it is part of a 3-digit octal and should be skipped, this would correctly allows \108 to still be flagged, since 8 is not an octal digit and the lexer would parse it as \10 (a genuine 2-digit octal) followed by a literal 8. However, the \\11 case cannot be cleanly fixed through regex alone. A lookbehind (?<!\\) would help simple cases like \\11, but regex fundamentally cannot count backslash pairs to determine whether a backslash starts a new escape sequence or is itself escaped.

I can add the lookahead to fix the \100-type false positives, and will add the patterns from the diff reports to the test inputs, and we can open a separate issue for the \\ escape boundary limitation since it affects both 2-digit and 3-digit patterns. What is your take on this

@stoyanK7
Copy link
Copy Markdown
Contributor

Thanks for the detailed response @vivek-0509

The \100 type of false positive can be fixed with a negative lookahead in the regex, if the 2-digit match is followed by another octal digit, it is part of a 3-digit octal and should be skipped, this would correctly allows \108 to still be flagged, since 8 is not an octal digit and the lexer would parse it as \10 (a genuine 2-digit octal) followed by a literal 8.

Is it too much of a change to fix this scenario in this PR? False positives can be problem for users so we should avoid introducing them. I'm in favor of fixing that here.

However, the \11 case cannot be cleanly fixed through regex alone. A lookbehind (?<!\) would help simple cases like \11, but regex fundamentally cannot count backslash pairs to determine whether a backslash starts a new escape sequence or is itself escaped.

Even if it wouldn't eliminate all false positives, it should eliminate most, right? I'm in favor of adding a lookbehind. But in a separate issue/PR because this false positive has been present in the check before this PR. We are not introducing it.

@vivek-0509
Copy link
Copy Markdown
Contributor Author

vivek-0509 commented Apr 16, 2026

@stoyanK7

Sounds good, I agree with both points.
I'll update this PR to add the negative lookahead to prevent false positives like \100. I'll also add the real world patterns from the diff reports to the test inputs as roman requested earlier. Yes it will eliminate most of the cases, I'll open a separate issue for the lookbehind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants