Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users#16520
Conversation
Token::Illegal handling for correct error message to users
Token::Illegal handling for correct error message to usersToken::Illegal handling in switch, giving misleading error to users
clonker
left a comment
There was a problem hiding this comment.
some minor adjustments. you can use the single-argument version of validateUTF8 if you're not using the invalidPos anyways. And I think you can get away without using m_skippedComments[...].error as intermediary if you set the error directly when it is encountered and then return Token::Illegal in scanSlash.
msooseth
left a comment
There was a problem hiding this comment.
Thanks, very good points! I think the last one is better left as-is though?
I have also fixed the python script for the error checking, please double-check that one... don't want to accidentally disable all tests or something :S
| if basename(f) == "invalid_utf8_sequence.sol": | ||
| continue # ignore the test with broken utf-8 encoding | ||
| if "invalid_utf8" in basename(f): | ||
| continue # ignore tests with invalid utf-8 encoding |
There was a problem hiding this comment.
Please don't generalize this hack :)
The issue is in the splitting script - it should just handle invalid UTF-8 properly and not generate UnicodeDecode error ever (see #9710 (comment)). It's very low priority, but if it's getting in the way, we should just do a proper fix instead of making the hack more elaborate :)
There was a problem hiding this comment.
I fixed this I hope. Please check :)
clonker
left a comment
There was a problem hiding this comment.
aside from two missing newlines (sue me) it looks good to me now, thanks!
| //� | ||
| } } | ||
| // ---- | ||
| // ParserError 8936: (59-62): Invalid UTF-8 sequence in comment. |
There was a problem hiding this comment.
It's a bit funny to get error 8936 here and in other places error 1465, once without and once with an Illegal token: prefix. I understand the different numbers are because one is in Parser the other in AsmParser. Still, a bit inconsistent with the prefix, too. :) Nothing to be done in this PR though, I'd think.
There was a problem hiding this comment.
We should add the prefix then. If we have it on the Yul/assembly side, I see no reason not to have the in the Solidity version as well. This is pretty much the same error message, just in two different parsers.
clonker
left a comment
There was a problem hiding this comment.
looks good to me now, thanks!
|
This pull request is stale because it has been open for 14 days with no activity. |
|
Can someone please review this? @cameel or @nikola-matic maybe? |
|
I'll try to get to it tomorrow. |
cameel
left a comment
There was a problem hiding this comment.
I think there are still cases that will produce invalid UTF-8 and crash the compiler on JSON output. Check my comments below.
Also there are a few more cases in Yul where the error message is subpar, though I'm on the fence whether it's worth addressing those. It's whack-a-mole. But I pointed them out in case you want to try.
Some tests are also missing and the changelog may not be entirely accurate (it seems to me that there was only one crash case here, not two).
There was a problem hiding this comment.
These tests would probably fit better in syntaxTests/comments/.
And the whole comments/ should probably just be a subdir in syntaxTests/parsing/, but that's something to fix another time so just FYI.
| { /*�*/ | ||
| } | ||
| // ---- | ||
| // ParserError 1465: (2-6): Illegal token: Invalid UTF-8 sequence in comment. |
There was a problem hiding this comment.
Note that the PR fixes the message only at the beginning of statements, which means that we still get a subpar message for this:
/* <ILLEGAL UTF-8 HERE> */
{}Error (4294): Expected keyword "object".
This may be worth fixing since this is basically what you did on Solidity level already.
The error for illegal UTF-8 in debug annotations is also of this kind:
/// @use-src 0:"<ILLEGAL UTF-8 HERE>"Error (9804): Error parsing arguments to @use-src. Expected: <number> ":" "<filename>", ...
There was a problem hiding this comment.
There are also many other cases that won't print the specialized message for UTF-8 errors. For example:
{
let /* <ILLEGAL UTF-8 HERE> */ x := 1
}Error (2314): Expected identifier but got 'ILLEGAL'
These are probably not worth fixing because I think you'd have to special-case a lot of individual token comparisons and the message at least references an illegal token, so it's clearer.
There was a problem hiding this comment.
Can we just please fix ONE thing? I understand that a million other things could be fixed, but I am trying to fix a single thing and it's now 50+ files, 250 LoC, and we are still not being done. I really really just want this damned thing not to crash. Really. And instead, we are nitpicking on the exact wording of the changelog. I just want to fuzz the compiler so we can find actual, SERIOUS bugs, but instead, it's and extra EOL in the test case that seems to cause the most issues. I really just want to fuzz the compiler so we can find serious bugs.
| error["formattedMessage"] = (_formattedMessage.length() > 0) ? _formattedMessage : _message; | ||
| std::string const& fmtMsg = (_formattedMessage.length() > 0) ? _formattedMessage : _message; | ||
| error["formattedMessage"] = util::validateUTF8(fmtMsg) ? fmtMsg : _message; |
There was a problem hiding this comment.
I think you're missing a test that covers this change. The two command-line tests you added did not have UTF-8 in the formatted message, but in the outputs. Those outputs no longer show up due to the fact that we error out earlier in the parser, not due to this change.
To actually trigger this, you need something where invalid UTF-8 is included in the code snippet that goes into the formatted message. For example:
contract {} /* <INVALID UTF-8 HERE> */Would be good to have a command-line test that has two files. One with and one without invalid UTF-8 so that in the output we can see the difference in formatted message between them.
There was a problem hiding this comment.
I addressed this in 183cef3 which adds standard_invalid_utf8_in_formatted_message
| @@ -72,7 +73,8 @@ Json formatError( | |||
| error["component"] = _component; | |||
| error["severity"] = Error::formatErrorSeverityLowercase(Error::errorSeverity(_type)); | |||
| error["message"] = _message; | |||
There was a problem hiding this comment.
The unformatted message might sometimes include invalid UTF-8 too. Try this:
import "\xff";Error: Source "�" not found: File not found. Searched the following locations: "".
I think we will actually need some escaping/stripping here. Perhaps we should convert any invalid unicode we find to escape sequences? Or just all Unicode? Or maybe replace it with placeholder chars?
There was a problem hiding this comment.
sourceLocation and secondarySourceLocation have the same problem as they can include file paths. Though I'm not sure if we can even reach this place without crashing the compiler first - these paths would have to be under sources in the input as well so this may crash the JSON parser already. They'll also be under contracts if things compile without errors.
We should check that - can you add a test case that has invalid UTF-8 under sources in the input?
Also, if such an input crashes the compiler, we need to add validateUTF8() on any JSON input we receive. Or at least catch nlohmann::json::type_error it triggers.
There was a problem hiding this comment.
I think re-interpreting what the user wrote, e.g. by stripping characters from an import could be EXTREMELY dangerous. I'd much rather fail spectacularly, than silently import some other code and potentially cause some serious unintended behaviour :S
| * Parser: Fix a crash caused by invalid UTF-8 sequences in comments (including after inline assembly blocks). Such sequences now produce a proper error message instead. | ||
| * Parser: Fix a crash caused by unhandled ``Token::Illegal`` at the top level of Solidity source and inside Yul ``parseStatement``. Invalid tokens now produce a fatal parser error with a descriptive message. | ||
| * Standard JSON: Fix invalid UTF-8 bytes in ``formattedMessage`` when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case. |
There was a problem hiding this comment.
First, these entries are aimed at end users so whenever possible they should refer to language concepts and observable compiler behavior rather than implementation details like Token::Illegal or parseStatement(). Users will have no idea what that is.
Second, were illegal tokens actually crashing Solidity parser and parseStatement()? I checked a few of your test cases and none of them does that for me on 0.8.34. At least not on the CLI, but if it happens only in Standard JSON then it's still a part of the same bug with comments.
This is how I see the situation right now, please correct me if you think this is not accurate:
| * Parser: Fix a crash caused by invalid UTF-8 sequences in comments (including after inline assembly blocks). Such sequences now produce a proper error message instead. | |
| * Parser: Fix a crash caused by unhandled ``Token::Illegal`` at the top level of Solidity source and inside Yul ``parseStatement``. Invalid tokens now produce a fatal parser error with a descriptive message. | |
| * Standard JSON: Fix invalid UTF-8 bytes in ``formattedMessage`` when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case. | |
| * Parser: Treat invalid UTF-8 sequences in comments and Natspec as errors. Such sequences used to be ignored and would cause a crash if they ended up inside JSON output. | |
| * Parser: Fix inaccurate or vague error messages in errors triggered by invalid UTF-8 in comments. | |
| * Standard JSON: Fix invalid UTF-8 bytes in the `formattedMessage` field of errors when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case. |
Perhaps we should explicitly mention that it affected both Yul and Solidity too?
Also, the last one may have to be updated later since I think not only formattedMessage is affected.
There was a problem hiding this comment.
You also don't have any tests covering the non-comment case. Or are the tests with strings meant to cover that? I expected to see at least one case similar to what you have with comments just with the invalid UTF-8 outside of a comment.
There was a problem hiding this comment.
I applied your changelog.
There was a problem hiding this comment.
I have now fixed this:
You also don't have any tests covering the non-comment case. Or are the tests with strings meant to cover that? I expected to see at least one case similar to what you have with comments just with the invalid UTF-8 outside of a comment
it shows up as Invalid token. I added 2 tests like this, one for yul, one for sol.
Notice that the system crashed before. It assert failed, crashed. That's not the case anymore. So the system is currently better, it does not crash. Arguably, it should not be Illegal token, but that's at least a valid rationale. A crash is not a rationale for stopping parsing, that's just a "??? best I crash!" which is not the same as Illegal token, which is actually correct, even if not precise.
Co-authored-by: Nikola Matić <nikola.matic@ethereum.org>
Co-authored-by: Kamil Śliwak <kamil.sliwak@codepoets.it>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixes #16519
Problem
The compiler crashes with an unhandled
nlohmann::json::type_errorwhen a source filecontains invalid UTF-8 bytes in a NatSpec comment (e.g.
/// \xF7).The scanner copies raw source bytes into the NatSpec comment literal without UTF-8
validation. These bytes end up in the metadata JSON via
natspecUser/natspecDev,and
jsonPrint'sdump(ensure_ascii=true)then throwstype_error::316on theinvalid byte.
A secondary issue: even after adding the scanner error, the compiler reported a
misleading "Expected pragma, import directive..." message instead of the real error,
because
Token::Illegalfell into thedefault:branch of the top-level parser switchwhich ignores
currentError().Fix
Scanner (
liblangutil/Scanner.cpp): After scanning each NatSpec comment literal,validate it using the existing
solidity::util::validateUTF8fromlibsolutil/UTF8.h(which
liblangutilalready links against). If invalid UTF-8 is found, report the newScannerError::InvalidUTF8InCommenterror.scanSingleLineDocComment: validate afterliteral.complete(); setm_skippedComments[NextNext].errorif invalid.scanSlashchecks and propagatesthis as
Token::Illegal.scanMultiLineDocComment: validate afterliteral.complete(); returnsetError(ScannerError::InvalidUTF8InComment)if invalid.Parser (
libsolidity/parsing/Parser.cpp): Add an explicitcase Token::Illegal:in the top-level source-unit switch that calls
fatalParserErrorwithto_string(m_scanner->currentError()), matching the pattern already used elsewherein the parser.
Output
Output now:
Disclaimers
Wirrten by Claude Code, reviewed by myself.