Skip to content

Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users#16520

Open
msooseth wants to merge 31 commits intodevelopfrom
fix-utf8
Open

Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users#16520
msooseth wants to merge 31 commits intodevelopfrom
fix-utf8

Conversation

@msooseth
Copy link
Copy Markdown
Contributor

@msooseth msooseth commented Mar 12, 2026

Fixes #16519

Problem

The compiler crashes with an unhandled nlohmann::json::type_error when a source file
contains invalid UTF-8 bytes in a NatSpec comment (e.g. /// \xF7).

The scanner copies raw source bytes into the NatSpec comment literal without UTF-8
validation. These bytes end up in the metadata JSON via natspecUser/natspecDev,
and jsonPrint's dump(ensure_ascii=true) then throws type_error::316 on the
invalid byte.

A secondary issue: even after adding the scanner error, the compiler reported a
misleading "Expected pragma, import directive..." message instead of the real error,
because Token::Illegal fell into the default: branch of the top-level parser switch
which ignores currentError().

Fix

Scanner (liblangutil/Scanner.cpp): After scanning each NatSpec comment literal,
validate it using the existing solidity::util::validateUTF8 from libsolutil/UTF8.h
(which liblangutil already links against). If invalid UTF-8 is found, report the new
ScannerError::InvalidUTF8InComment error.

  • scanSingleLineDocComment: validate after literal.complete(); set
    m_skippedComments[NextNext].error if invalid. scanSlash checks and propagates
    this as Token::Illegal.
  • scanMultiLineDocComment: validate after literal.complete(); return
    setError(ScannerError::InvalidUTF8InComment) if invalid.

Parser (libsolidity/parsing/Parser.cpp): Add an explicit case Token::Illegal:
in the top-level source-unit switch that calls fatalParserError with
to_string(m_scanner->currentError()), matching the pattern already used elsewhere
in the parser.

Output

Output now:

./solc/solc --optimize --ir minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33

Warning: This is a pre-release compiler version, please do not use it in production.

Error: Invalid UTF-8 sequence in NatSpec comment.
 --> minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33:1:1:
  |
1 | ///
  | ^ (Relevant source part starts here and spans across multiple lines).

Disclaimers

Wirrten by Claude Code, reviewed by myself.

@msooseth msooseth requested a review from clonker March 12, 2026 13:59
@msooseth msooseth changed the title Fix crash on invalid UTF-8 bytes in NatSpec comments Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling for correct error message to users Mar 12, 2026
@msooseth msooseth changed the title Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling for correct error message to users Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users Mar 12, 2026
Copy link
Copy Markdown
Member

@clonker clonker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor adjustments. you can use the single-argument version of validateUTF8 if you're not using the invalidPos anyways. And I think you can get away without using m_skippedComments[...].error as intermediary if you set the error directly when it is encountered and then return Token::Illegal in scanSlash.

Comment thread liblangutil/Scanner.cpp Outdated
Comment thread liblangutil/Scanner.cpp Outdated
Comment thread liblangutil/Scanner.cpp
Comment thread liblangutil/Scanner.cpp
Copy link
Copy Markdown
Contributor Author

@msooseth msooseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very good points! I think the last one is better left as-is though?

I have also fixed the python script for the error checking, please double-check that one... don't want to accidentally disable all tests or something :S

Comment thread liblangutil/Scanner.cpp
Comment thread liblangutil/Scanner.cpp Outdated
Comment thread libsolidity/parsing/Parser.cpp
Comment thread scripts/isolate_tests.py Outdated
Comment on lines +162 to +163
if basename(f) == "invalid_utf8_sequence.sol":
continue # ignore the test with broken utf-8 encoding
if "invalid_utf8" in basename(f):
continue # ignore tests with invalid utf-8 encoding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't generalize this hack :)

The issue is in the splitting script - it should just handle invalid UTF-8 properly and not generate UnicodeDecode error ever (see #9710 (comment)). It's very low priority, but if it's getting in the way, we should just do a proper fix instead of making the hack more elaborate :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this I hope. Please check :)

Comment thread test/libsolidity/syntaxTests/comments/natspec_invalid_utf8_multiline.sol Outdated
Comment thread test/libsolidity/syntaxTests/comments/natspec_invalid_utf8_multiline.sol Outdated
Comment thread liblangutil/Scanner.cpp Outdated
Comment thread scripts/isolate_tests.py
Comment thread liblangutil/Scanner.cpp Outdated
Copy link
Copy Markdown
Member

@clonker clonker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from two missing newlines (sue me) it looks good to me now, thanks!

Comment thread test/cmdlineTests/standard_invalid_utf8_in_natspec_multiline_comment/args Outdated
//�
} }
// ----
// ParserError 8936: (59-62): Invalid UTF-8 sequence in comment.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit funny to get error 8936 here and in other places error 1465, once without and once with an Illegal token: prefix. I understand the different numbers are because one is in Parser the other in AsmParser. Still, a bit inconsistent with the prefix, too. :) Nothing to be done in this PR though, I'd think.

Copy link
Copy Markdown
Collaborator

@cameel cameel Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the prefix then. If we have it on the Yul/assembly side, I see no reason not to have the in the Solidity version as well. This is pretty much the same error message, just in two different parsers.

Comment thread test/cmdlineTests/standard_invalid_utf8_in_comment_with_ir_requested/args Outdated
clonker
clonker previously approved these changes Mar 23, 2026
Copy link
Copy Markdown
Member

@clonker clonker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me now, thanks!

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

This pull request is stale because it has been open for 14 days with no activity.
It will be closed in 7 days unless the stale label is removed.

@github-actions github-actions bot added the stale The issue/PR was marked as stale because it has been open for too long. label Apr 7, 2026
@msooseth msooseth removed the stale The issue/PR was marked as stale because it has been open for too long. label Apr 8, 2026
@msooseth
Copy link
Copy Markdown
Contributor Author

msooseth commented Apr 8, 2026

Can someone please review this? @cameel or @nikola-matic maybe?

Comment thread Changelog.md Outdated
@cameel
Copy link
Copy Markdown
Collaborator

cameel commented Apr 8, 2026

I'll try to get to it tomorrow.

Copy link
Copy Markdown
Collaborator

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still cases that will produce invalid UTF-8 and crash the compiler on JSON output. Check my comments below.

Also there are a few more cases in Yul where the error message is subpar, though I'm on the fence whether it's worth addressing those. It's whack-a-mole. But I pointed them out in case you want to try.

Some tests are also missing and the changelog may not be entirely accurate (it seems to me that there was only one crash case here, not two).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests would probably fit better in syntaxTests/comments/.

And the whole comments/ should probably just be a subdir in syntaxTests/parsing/, but that's something to fix another time so just FYI.

Comment thread libsolidity/interface/StandardCompiler.cpp Outdated
Comment on lines +1 to +4
{ /*�*/
}
// ----
// ParserError 1465: (2-6): Illegal token: Invalid UTF-8 sequence in comment.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the PR fixes the message only at the beginning of statements, which means that we still get a subpar message for this:

/* <ILLEGAL UTF-8 HERE> */
{}
Error (4294): Expected keyword "object".

This may be worth fixing since this is basically what you did on Solidity level already.

The error for illegal UTF-8 in debug annotations is also of this kind:

/// @use-src 0:"<ILLEGAL UTF-8 HERE>"
Error (9804): Error parsing arguments to @use-src. Expected: <number> ":" "<filename>", ...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also many other cases that won't print the specialized message for UTF-8 errors. For example:

{
    let /* <ILLEGAL UTF-8 HERE> */ x := 1
}
Error (2314): Expected identifier but got 'ILLEGAL'

These are probably not worth fixing because I think you'd have to special-case a lot of individual token comparisons and the message at least references an illegal token, so it's clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just please fix ONE thing? I understand that a million other things could be fixed, but I am trying to fix a single thing and it's now 50+ files, 250 LoC, and we are still not being done. I really really just want this damned thing not to crash. Really. And instead, we are nitpicking on the exact wording of the changelog. I just want to fuzz the compiler so we can find actual, SERIOUS bugs, but instead, it's and extra EOL in the test case that seems to cause the most issues. I really just want to fuzz the compiler so we can find serious bugs.

Comment on lines +75 to +77
error["formattedMessage"] = (_formattedMessage.length() > 0) ? _formattedMessage : _message;
std::string const& fmtMsg = (_formattedMessage.length() > 0) ? _formattedMessage : _message;
error["formattedMessage"] = util::validateUTF8(fmtMsg) ? fmtMsg : _message;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing a test that covers this change. The two command-line tests you added did not have UTF-8 in the formatted message, but in the outputs. Those outputs no longer show up due to the fact that we error out earlier in the parser, not due to this change.

To actually trigger this, you need something where invalid UTF-8 is included in the code snippet that goes into the formatted message. For example:

contract {} /* <INVALID UTF-8 HERE> */

Would be good to have a command-line test that has two files. One with and one without invalid UTF-8 so that in the output we can see the difference in formatted message between them.

Copy link
Copy Markdown
Contributor Author

@msooseth msooseth Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this in 183cef3 which adds standard_invalid_utf8_in_formatted_message

@@ -72,7 +73,8 @@ Json formatError(
error["component"] = _component;
error["severity"] = Error::formatErrorSeverityLowercase(Error::errorSeverity(_type));
error["message"] = _message;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unformatted message might sometimes include invalid UTF-8 too. Try this:

import "\xff";
Error: Source "�" not found: File not found. Searched the following locations: "".

I think we will actually need some escaping/stripping here. Perhaps we should convert any invalid unicode we find to escape sequences? Or just all Unicode? Or maybe replace it with placeholder chars?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourceLocation and secondarySourceLocation have the same problem as they can include file paths. Though I'm not sure if we can even reach this place without crashing the compiler first - these paths would have to be under sources in the input as well so this may crash the JSON parser already. They'll also be under contracts if things compile without errors.

We should check that - can you add a test case that has invalid UTF-8 under sources in the input?

Also, if such an input crashes the compiler, we need to add validateUTF8() on any JSON input we receive. Or at least catch nlohmann::json::type_error it triggers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think re-interpreting what the user wrote, e.g. by stripping characters from an import could be EXTREMELY dangerous. I'd much rather fail spectacularly, than silently import some other code and potentially cause some serious unintended behaviour :S

Comment thread Changelog.md Outdated
Comment on lines +18 to +20
* Parser: Fix a crash caused by invalid UTF-8 sequences in comments (including after inline assembly blocks). Such sequences now produce a proper error message instead.
* Parser: Fix a crash caused by unhandled ``Token::Illegal`` at the top level of Solidity source and inside Yul ``parseStatement``. Invalid tokens now produce a fatal parser error with a descriptive message.
* Standard JSON: Fix invalid UTF-8 bytes in ``formattedMessage`` when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, these entries are aimed at end users so whenever possible they should refer to language concepts and observable compiler behavior rather than implementation details like Token::Illegal or parseStatement(). Users will have no idea what that is.

Second, were illegal tokens actually crashing Solidity parser and parseStatement()? I checked a few of your test cases and none of them does that for me on 0.8.34. At least not on the CLI, but if it happens only in Standard JSON then it's still a part of the same bug with comments.

This is how I see the situation right now, please correct me if you think this is not accurate:

Suggested change
* Parser: Fix a crash caused by invalid UTF-8 sequences in comments (including after inline assembly blocks). Such sequences now produce a proper error message instead.
* Parser: Fix a crash caused by unhandled ``Token::Illegal`` at the top level of Solidity source and inside Yul ``parseStatement``. Invalid tokens now produce a fatal parser error with a descriptive message.
* Standard JSON: Fix invalid UTF-8 bytes in ``formattedMessage`` when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case.
* Parser: Treat invalid UTF-8 sequences in comments and Natspec as errors. Such sequences used to be ignored and would cause a crash if they ended up inside JSON output.
* Parser: Fix inaccurate or vague error messages in errors triggered by invalid UTF-8 in comments.
* Standard JSON: Fix invalid UTF-8 bytes in the `formattedMessage` field of errors when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case.

Perhaps we should explicitly mention that it affected both Yul and Solidity too?

Also, the last one may have to be updated later since I think not only formattedMessage is affected.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also don't have any tests covering the non-comment case. Or are the tests with strings meant to cover that? I expected to see at least one case similar to what you have with comments just with the invalid UTF-8 outside of a comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied your changelog.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now fixed this:

You also don't have any tests covering the non-comment case. Or are the tests with strings meant to cover that? I expected to see at least one case similar to what you have with comments just with the invalid UTF-8 outside of a comment

it shows up as Invalid token. I added 2 tests like this, one for yul, one for sol.

Notice that the system crashed before. It assert failed, crashed. That's not the case anymore. So the system is currently better, it does not crash. Arguably, it should not be Illegal token, but that's at least a valid rationale. A crash is not a rationale for stopping parsing, that's just a "??? best I crash!" which is not the same as Illegal token, which is actually correct, even if not precise.

Co-authored-by: Nikola Matić <nikola.matic@ethereum.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash on invalid UTF-8 bytes in Natspec

4 participants