Skip to content

fix(scripts): encode residual JSON control chars as \uXXXX instead of stripping#1872

Open
pierluigilenoci wants to merge 3 commits intogithub:mainfrom
pierluigilenoci:fix/json-escape-uxxxx-encoding
Open

fix(scripts): encode residual JSON control chars as \uXXXX instead of stripping#1872
pierluigilenoci wants to merge 3 commits intogithub:mainfrom
pierluigilenoci:fix/json-escape-uxxxx-encoding

Conversation

@pierluigilenoci
Copy link
Contributor

Summary

  • json_escape() in common.sh was silently deleting control characters (U+0000–U+001F) not individually handled (\n, \t, \r, \b, \f) via tr -d, causing data loss
  • Per RFC 8259, these characters must be encoded as \uXXXX sequences
  • Replaced the tr -d strip with a char-by-char loop that emits proper \uXXXX escapes, preserving data integrity
  • Multi-byte UTF-8 sequences are unaffected (first-byte values >= 0xC0 are never control characters)

Example

Before: "hello\x01world""helloworld" (SOH silently deleted)
After: "hello\x01world""hello\u0001world" (SOH properly escaped)

Test plan

  • Verify json_escape correctly encodes control characters like SOH (\x01), STX (\x02), etc. as \uXXXX
  • Verify previously handled characters (\n, \t, \r, \b, \f) still use their short escape forms
  • Verify multi-byte UTF-8 strings (e.g. emoji, CJK) pass through unchanged
  • Run check-prerequisites.sh --json and create-new-feature.sh --json to verify JSON output is valid

…pping

json_escape() was silently deleting control characters (U+0000-U+001F)
that were not individually handled (\n, \t, \r, \b, \f). Per RFC 8259,
these must be encoded as \uXXXX sequences to preserve data integrity.

Replace the tr -d strip with a char-by-char loop that emits proper
\uXXXX escapes for any remaining control characters.
@pierluigilenoci pierluigilenoci requested a review from mnriem as a code owner March 16, 2026 23:37
@pierluigilenoci
Copy link
Contributor Author

@mnriem @dhilipkumars, please take a look.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Bash json_escape() fallback (used when jq is unavailable) to avoid silent data loss by encoding previously-unhandled JSON control characters (U+0000–U+001F) as \uXXXX sequences instead of stripping them.

Changes:

  • Replaces tr -d control-character stripping with a per-character scan that emits \u%04x escapes for remaining control bytes.
  • Keeps existing short escapes (\n, \t, \r, \b, \f) and quote/backslash escaping behavior unchanged.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@mnriem mnriem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address Copilot feedback. If not applicable please explain why

- Set LC_ALL=C for the entire loop (not just printf) so that ${#s} and
  ${s:$i:1} operate on bytes deterministically across locales
- Fix comment: U+0000 (NUL) cannot exist in bash strings, range is
  U+0001-U+001F; adjust code guard accordingly (code >= 1)
- Emit directly to stdout instead of accumulating in a variable,
  avoiding quadratic string concatenation on longer inputs
@pierluigilenoci
Copy link
Contributor Author

@mnriem All three Copilot findings have been addressed in 1b7ce3b:

  1. Locale consistencylocal LC_ALL=C set for the entire loop so ${#s}, ${s:$i:1}, and printf all operate in byte-wise mode consistently
  2. NUL clarification — comment corrected to U+0001–U+001F (bash strings cannot contain NUL); code guard adjusted from >= 0 to >= 1
  3. Linear output — loop now emits directly to stdout via printf instead of quadratic out+= accumulation

Ready for re-review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Bash JSON escaping helper used by various scripts/bash/* utilities to preserve JSON validity by escaping (rather than stripping) remaining ASCII control characters when jq isn’t available.

Changes:

  • Replaced control-character stripping with \uXXXX escaping for remaining U+0001–U+001F bytes in json_escape().
  • Added byte-wise iteration under LC_ALL=C to keep UTF-8 bytes passing through unchanged.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mnriem mnriem self-requested a review March 17, 2026 19:35
Copy link
Collaborator

@mnriem mnriem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address Copilot feedback. You are almost there!

Replace code=$(printf ...) with printf -v code to assign the character
code without spawning a subshell on every byte, reducing overhead for
longer inputs.
@pierluigilenoci
Copy link
Contributor Author

@mnriem Fourth Copilot finding addressed in 0b1f005 — replaced code=$(printf ...) subshell with printf -v code for zero-fork overhead in the per-byte loop. All 4 review threads are now resolved. Ready for re-review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the bash JSON escaping fallback used by various scripts to produce valid JSON when jq isn’t available, by escaping (rather than stripping) remaining control characters.

Changes:

  • Replace removal of unescaped U+0001–U+001F control characters with \uXXXX escaping.
  • Add a byte-wise iteration step (under LC_ALL=C) to preserve UTF-8 bytes while escaping control bytes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mnriem mnriem self-requested a review March 18, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants