Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing `Token::Illegal` handling in switch, giving misleading error to users by msooseth · Pull Request #16520 · argotorg/solidity

msooseth · 2026-03-12T13:58:01Z

Problem

The compiler crashes with an unhandled nlohmann::json::type_error when a source file
contains invalid UTF-8 bytes in a NatSpec comment (e.g. /// \xF7).

The scanner copies raw source bytes into the NatSpec comment literal without UTF-8
validation. These bytes end up in the metadata JSON via natspecUser/natspecDev,
and jsonPrint's dump(ensure_ascii=true) then throws type_error::316 on the
invalid byte.

A secondary issue: even after adding the scanner error, the compiler reported a
misleading "Expected pragma, import directive..." message instead of the real error,
because Token::Illegal fell into the default: branch of the top-level parser switch
which ignores currentError().

Fix

Scanner (liblangutil/Scanner.cpp): After scanning each NatSpec comment literal,
validate it using the existing solidity::util::validateUTF8 from libsolutil/UTF8.h
(which liblangutil already links against). If invalid UTF-8 is found, report the new
ScannerError::InvalidUTF8InComment error.

scanSingleLineDocComment: validate after literal.complete(); set
m_skippedComments[NextNext].error if invalid. scanSlash checks and propagates
this as Token::Illegal.
scanMultiLineDocComment: validate after literal.complete(); return
setError(ScannerError::InvalidUTF8InComment) if invalid.

Parser (libsolidity/parsing/Parser.cpp): Add an explicit case Token::Illegal:
in the top-level source-unit switch that calls fatalParserError with
to_string(m_scanner->currentError()), matching the pattern already used elsewhere
in the parser.

Output

Output now:

./solc/solc --optimize --ir minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33

Warning: This is a pre-release compiler version, please do not use it in production.

Error: Invalid UTF-8 sequence in NatSpec comment.
 --> minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33:1:1:
  |
1 | ///
  | ^ (Relevant source part starts here and spans across multiple lines).

Disclaimers

Wirrten by Claude Code, reviewed by myself.

clonker

some minor adjustments. you can use the single-argument version of validateUTF8 if you're not using the invalidPos anyways. And I think you can get away without using m_skippedComments[...].error as intermediary if you set the error directly when it is encountered and then return Token::Illegal in scanSlash.

msooseth

Thanks, very good points! I think the last one is better left as-is though?

I have also fixed the python script for the error checking, please double-check that one... don't want to accidentally disable all tests or something :S

cameel · 2026-03-12T19:59:52Z

-                if basename(f) == "invalid_utf8_sequence.sol":
-                    continue  # ignore the test with broken utf-8 encoding
+                if "invalid_utf8" in basename(f):
+                    continue  # ignore tests with invalid utf-8 encoding


Please don't generalize this hack :)

The issue is in the splitting script - it should just handle invalid UTF-8 properly and not generate UnicodeDecode error ever (see #9710 (comment)). It's very low priority, but if it's getting in the way, we should just do a proper fix instead of making the hack more elaborate :)

I fixed this I hope. Please check :)

Add

Fixing

clonker

aside from two missing newlines (sue me) it looks good to me now, thanks!

clonker · 2026-03-23T13:36:16Z

+//�
+} }
+// ----
+// ParserError 8936: (59-62): Invalid UTF-8 sequence in comment.


It's a bit funny to get error 8936 here and in other places error 1465, once without and once with an Illegal token: prefix. I understand the different numbers are because one is in Parser the other in AsmParser. Still, a bit inconsistent with the prefix, too. :) Nothing to be done in this PR though, I'd think.

We should add the prefix then. If we have it on the Yul/assembly side, I see no reason not to have the in the Solidity version as well. This is pretty much the same error message, just in two different parsers.

clonker

looks good to me now, thanks!

github-actions · 2026-04-07T12:21:36Z

This pull request is stale because it has been open for 14 days with no activity.
It will be closed in 7 days unless the stale label is removed.

msooseth · 2026-04-08T14:48:23Z

Can someone please review this? @cameel or @nikola-matic maybe?

cameel · 2026-04-08T20:35:26Z

I'll try to get to it tomorrow.

cameel

I think there are still cases that will produce invalid UTF-8 and crash the compiler on JSON output. Check my comments below.

Also there are a few more cases in Yul where the error message is subpar, though I'm on the fence whether it's worth addressing those. It's whack-a-mole. But I pointed them out in case you want to try.

Some tests are also missing and the changelog may not be entirely accurate (it seems to me that there was only one crash case here, not two).

cameel · 2026-04-09T13:10:36Z

These tests would probably fit better in syntaxTests/comments/.

And the whole comments/ should probably just be a subdir in syntaxTests/parsing/, but that's something to fix another time so just FYI.

cameel · 2026-04-09T18:32:58Z

+{ /*�*/
+}
+// ----
+// ParserError 1465: (2-6): Illegal token: Invalid UTF-8 sequence in comment.


Note that the PR fixes the message only at the beginning of statements, which means that we still get a subpar message for this:

/* <ILLEGAL UTF-8 HERE> */ {}

Error (4294): Expected keyword "object".

This may be worth fixing since this is basically what you did on Solidity level already.

The error for illegal UTF-8 in debug annotations is also of this kind:

/// @use-src 0:"<ILLEGAL UTF-8 HERE>"

Error (9804): Error parsing arguments to @use-src. Expected: <number> ":" "<filename>", ...

There are also many other cases that won't print the specialized message for UTF-8 errors. For example:

{ let /* <ILLEGAL UTF-8 HERE> */ x := 1 }

Error (2314): Expected identifier but got 'ILLEGAL'

These are probably not worth fixing because I think you'd have to special-case a lot of individual token comparisons and the message at least references an illegal token, so it's clearer.

Can we just please fix ONE thing? I understand that a million other things could be fixed, but I am trying to fix a single thing and it's now 50+ files, 250 LoC, and we are still not being done. I really really just want this damned thing not to crash. Really. And instead, we are nitpicking on the exact wording of the changelog. I just want to fuzz the compiler so we can find actual, SERIOUS bugs, but instead, it's and extra EOL in the test case that seems to cause the most issues. I really just want to fuzz the compiler so we can find serious bugs.

cameel · 2026-04-09T18:45:35Z

-	error["formattedMessage"] = (_formattedMessage.length() > 0) ? _formattedMessage : _message;
+	std::string const& fmtMsg = (_formattedMessage.length() > 0) ? _formattedMessage : _message;
+	error["formattedMessage"] = util::validateUTF8(fmtMsg) ? fmtMsg : _message;


I think you're missing a test that covers this change. The two command-line tests you added did not have UTF-8 in the formatted message, but in the outputs. Those outputs no longer show up due to the fact that we error out earlier in the parser, not due to this change.

To actually trigger this, you need something where invalid UTF-8 is included in the code snippet that goes into the formatted message. For example:

contract {} /* <INVALID UTF-8 HERE> */

Would be good to have a command-line test that has two files. One with and one without invalid UTF-8 so that in the output we can see the difference in formatted message between them.

I addressed this in 183cef3 which adds standard_invalid_utf8_in_formatted_message

cameel · 2026-04-09T18:59:22Z

@@ -72,7 +73,8 @@ Json formatError(
 	error["component"] = _component;
 	error["severity"] = Error::formatErrorSeverityLowercase(Error::errorSeverity(_type));
 	error["message"] = _message;


The unformatted message might sometimes include invalid UTF-8 too. Try this:

import "\xff";

Error: Source "�" not found: File not found. Searched the following locations: "".

I think we will actually need some escaping/stripping here. Perhaps we should convert any invalid unicode we find to escape sequences? Or just all Unicode? Or maybe replace it with placeholder chars?

sourceLocation and secondarySourceLocation have the same problem as they can include file paths. Though I'm not sure if we can even reach this place without crashing the compiler first - these paths would have to be under sources in the input as well so this may crash the JSON parser already. They'll also be under contracts if things compile without errors.

We should check that - can you add a test case that has invalid UTF-8 under sources in the input?

Also, if such an input crashes the compiler, we need to add validateUTF8() on any JSON input we receive. Or at least catch nlohmann::json::type_error it triggers.

I think re-interpreting what the user wrote, e.g. by stripping characters from an import could be EXTREMELY dangerous. I'd much rather fail spectacularly, than silently import some other code and potentially cause some serious unintended behaviour :S

cameel · 2026-04-09T19:36:07Z

+* Parser: Fix a crash caused by invalid UTF-8 sequences in comments (including after inline assembly blocks). Such sequences now produce a proper error message instead.
+* Parser: Fix a crash caused by unhandled ``Token::Illegal`` at the top level of Solidity source and inside Yul ``parseStatement``. Invalid tokens now produce a fatal parser error with a descriptive message.
+* Standard JSON: Fix invalid UTF-8 bytes in ``formattedMessage`` when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case.


First, these entries are aimed at end users so whenever possible they should refer to language concepts and observable compiler behavior rather than implementation details like Token::Illegal or parseStatement(). Users will have no idea what that is.

Second, were illegal tokens actually crashing Solidity parser and parseStatement()? I checked a few of your test cases and none of them does that for me on 0.8.34. At least not on the CLI, but if it happens only in Standard JSON then it's still a part of the same bug with comments.

This is how I see the situation right now, please correct me if you think this is not accurate:

Suggested change

* Parser: Fix a crash caused by invalid UTF-8 sequences in comments (including after inline assembly blocks). Such sequences now produce a proper error message instead.

* Parser: Fix a crash caused by unhandled ``Token::Illegal`` at the top level of Solidity source and inside Yul ``parseStatement``. Invalid tokens now produce a fatal parser error with a descriptive message.

* Standard JSON: Fix invalid UTF-8 bytes in ``formattedMessage`` when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case.

* Parser: Treat invalid UTF-8 sequences in comments and Natspec as errors. Such sequences used to be ignored and would cause a crash if they ended up inside JSON output.

* Parser: Fix inaccurate or vague error messages in errors triggered by invalid UTF-8 in comments.

* Standard JSON: Fix invalid UTF-8 bytes in the `formattedMessage` field of errors when a source location excerpt contains invalid UTF-8. The field now falls back to the plain message in that case.

Perhaps we should explicitly mention that it affected both Yul and Solidity too?

Also, the last one may have to be updated later since I think not only formattedMessage is affected.

You also don't have any tests covering the non-comment case. Or are the tests with strings meant to cover that? I expected to see at least one case similar to what you have with comments just with the invalid UTF-8 outside of a comment.

I applied your changelog.

I have now fixed this:

You also don't have any tests covering the non-comment case. Or are the tests with strings meant to cover that? I expected to see at least one case similar to what you have with comments just with the invalid UTF-8 outside of a comment

it shows up as Invalid token. I added 2 tests like this, one for yul, one for sol.

Notice that the system crashed before. It assert failed, crashed. That's not the case anymore. So the system is currently better, it does not crash. Arguably, it should not be Illegal token, but that's at least a valid rationale. A crash is not a rationale for stopping parsing, that's just a "??? best I crash!" which is not the same as Illegal token, which is actually correct, even if not precise.

Co-authored-by: Nikola Matić <nikola.matic@ethereum.org>

Co-authored-by: Kamil Śliwak <kamil.sliwak@codepoets.it>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… wasn't pretty

msooseth requested a review from clonker March 12, 2026 13:59

msooseth changed the title ~~Fix crash on invalid UTF-8 bytes in NatSpec comments~~ Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling for correct error message to users Mar 12, 2026

clonker reviewed Mar 12, 2026

View reviewed changes

Comment thread liblangutil/Scanner.cpp Outdated

Comment thread liblangutil/Scanner.cpp Outdated

Comment thread liblangutil/Scanner.cpp

Comment thread liblangutil/Scanner.cpp

msooseth commented Mar 12, 2026

View reviewed changes

Comment thread liblangutil/Scanner.cpp

cameel reviewed Mar 12, 2026

View reviewed changes

msooseth force-pushed the fix-utf8 branch from 4fd651d to 478278a Compare March 16, 2026 13:34

msooseth commented Mar 16, 2026

View reviewed changes

Comment thread scripts/isolate_tests.py

clonker requested changes Mar 18, 2026

View reviewed changes

Comment thread liblangutil/Scanner.cpp Outdated

msooseth added 20 commits March 23, 2026 11:47

Fixing utf8 parsing

ca0fe0e

Adding tests

127d147

Fixing utf8 handling here, so we can test wrong utf8

d0c46be

Update

95223c7

Fix

e76683b

Update

30f7220

Update

80d3f8f

Let's add seatbelt

be7c6d5

Update

2937e98

Update

847d79a

No utf8 issues

e03fff5

Fix this

eaa06cc

Add

c9a5db9

Add

Add tests

add0b1d

Update

bd0f2ca

Update changelog

dfb3ece

Update test

6ffffda

Fixing

d94ee22

Fixing

Adding more tests

3799114

Better changelog

f672342

msooseth force-pushed the fix-utf8 branch from 5d28d33 to f672342 Compare March 23, 2026 10:50

clonker reviewed Mar 23, 2026

View reviewed changes

msooseth added 3 commits March 23, 2026 16:40

Add unicode tests, too

deddf67

Add newline at the end of file

1a4cbd5

Fix missing newline

67bce2a

clonker previously approved these changes Mar 23, 2026

View reviewed changes

github-actions bot added the stale The issue/PR was marked as stale because it has been open for too long. label Apr 7, 2026

msooseth removed the stale The issue/PR was marked as stale because it has been open for too long. label Apr 8, 2026

nikola-matic reviewed Apr 8, 2026

View reviewed changes

Comment thread Changelog.md Outdated

cameel requested changes Apr 9, 2026

View reviewed changes

Update Changelog.md

6446ccf

Co-authored-by: Nikola Matić <nikola.matic@ethereum.org>

msooseth dismissed clonker’s stale review via 6446ccf April 20, 2026 09:00

msooseth and others added 7 commits April 20, 2026 11:08

Update changelog as requested

f286a45

Adding tests outside of comment utf-8

c289ea6

Adding test for utf8 in formatted message

183cef3

Merge branch 'develop' into fix-utf8

85fa82e

Update libsolidity/interface/StandardCompiler.cpp

4607cbb

Co-authored-by: Kamil Śliwak <kamil.sliwak@codepoets.it>

Move UTF-8 comment tests from inlineAssembly/ to comments/

5d8d706

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fixing the fix that was meant to be a fix for the fix becuase the fix…

b8f676d

… wasn't pretty

Conversation

msooseth commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Output

Disclaimers

Uh oh!

clonker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msooseth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clonker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cameel Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clonker left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

msooseth commented Apr 8, 2026

Uh oh!

Uh oh!

cameel commented Apr 8, 2026

Uh oh!

cameel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msooseth Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msooseth commented Mar 12, 2026 •

edited

Loading

cameel Apr 9, 2026 •

edited

Loading

msooseth Apr 20, 2026 •

edited

Loading