Summary
preprocess_snippet() in src/matcher.rs is a char-by-char state machine that handles $name, $name:Kind, $$$name, $($name)sep*, and $BODY/$STMT/$EXPR special tokens. As we add more expressiveness (multi-item repetition groups, alternatives, conditions), this approach will become fragile and hard to maintain.
Replace the ad-hoc preprocessor with a proper tokenizer + parser for the patchwork pattern language.
Current state
preprocess_snippet() walks char_indices() with a peekable() iterator and branches on $:
$( -> repetition syntax $($name)sep*
$$$ -> multi-child placeholder
$name -> plain placeholder, checks for :Kind suffix
- Special token check (
$BODY, $STMT, $EXPR) interleaved with placeholder logic
This works but has issues:
- Ownership/borrow friction (the name move bug fixed in earlier commits)
- Error messages are nonexistent (syntax errors become tree-sitter parse failures)
- Adding new syntax requires threading more state through the state machine
- No source location tracking for error reporting
Proposed design
Tokenizer
Produce a token stream from the raw pattern string:
| Input |
Token |
$name |
Placeholder("name") |
$name:Kind |
TypedPlaceholder("name", "Kind") |
$$$name |
MultiPlaceholder("name") |
$$$ |
MultiPlaceholder(None) |
$BODY |
Special("BODY") |
$STMT |
Special("STMT") |
$EXPR |
Special("EXPR") |
$( |
RepOpen |
) |
RepClose |
* / + / ? (after RepClose+sep) |
Quant(Star|Plus|Optional) |
| code text |
Literal("...") |
Parser
Consume the token stream and emit the processed pattern string with metadata, building:
ProcessedSnippet {
pattern_text: String, // sentinel-replaced pattern
placeholders: HashMap, // name -> sentinel
repetitions: HashMap, // sentinel -> RepetitionInfo
specials: HashMap, // sentinel -> SpecialKind
type_constraints: HashMap, // sentinel -> kind name
}
Migration
- Keep the existing
ProcessedSnippet struct interface unchanged
- Replace
preprocess_snippet() body with tokenize() then parse() pipeline
- All existing callers (
find_snippet_matches, tests) remain untouched
- Tokenizer and parser are independently unit-testable
Benefits
- Better errors - tokenizer tracks byte offsets, so we can report
error: unexpected $ at position 12 instead of relying on tree-sitter parse failures
- Extensible - adding regex constraints
$name ~ /pattern/ or alternatives ($A \| $B) is a new token type and a parser case
- Testable - tokenizer and parser can be unit-tested independently instead of only through
find_snippet_matches
- Multi-item repetitions - the parser naturally handles
$($a + $b),* by allowing multiple placeholders in Repetition.inner
- No more ownership fights - the tokenizer owns the tokens, the parser owns the output
Out of scope for this issue
- Pattern validation against language grammar (tree-sitter handles this)
- Combinatorial patterns (multiple patterns combined with and/or)
- Conditions/comparisons on captured values
Summary
preprocess_snippet()insrc/matcher.rsis a char-by-char state machine that handles$name,$name:Kind,$$$name,$($name)sep*, and$BODY/$STMT/$EXPRspecial tokens. As we add more expressiveness (multi-item repetition groups, alternatives, conditions), this approach will become fragile and hard to maintain.Replace the ad-hoc preprocessor with a proper tokenizer + parser for the patchwork pattern language.
Current state
preprocess_snippet()walkschar_indices()with apeekable()iterator and branches on$:$(-> repetition syntax$($name)sep*$$$-> multi-child placeholder$name-> plain placeholder, checks for:Kindsuffix$BODY,$STMT,$EXPR) interleaved with placeholder logicThis works but has issues:
Proposed design
Tokenizer
Produce a token stream from the raw pattern string:
$namePlaceholder("name")$name:KindTypedPlaceholder("name", "Kind")$$$nameMultiPlaceholder("name")$$$MultiPlaceholder(None)$BODYSpecial("BODY")$STMTSpecial("STMT")$EXPRSpecial("EXPR")$(RepOpen)RepClose*/+/?(after RepClose+sep)Quant(Star|Plus|Optional)Literal("...")Parser
Consume the token stream and emit the processed pattern string with metadata, building:
Migration
ProcessedSnippetstruct interface unchangedpreprocess_snippet()body withtokenize()thenparse()pipelinefind_snippet_matches, tests) remain untouchedBenefits
error: unexpected $ at position 12instead of relying on tree-sitter parse failures$name ~ /pattern/or alternatives($A \| $B)is a new token type and a parser casefind_snippet_matches$($a + $b),*by allowing multiple placeholders inRepetition.innerOut of scope for this issue