Skip to content

Formalize patchwork pattern language into a proper tokenizer + parser #1

Description

@ThatXliner

Summary

preprocess_snippet() in src/matcher.rs is a char-by-char state machine that handles $name, $name:Kind, $$$name, $($name)sep*, and $BODY/$STMT/$EXPR special tokens. As we add more expressiveness (multi-item repetition groups, alternatives, conditions), this approach will become fragile and hard to maintain.

Replace the ad-hoc preprocessor with a proper tokenizer + parser for the patchwork pattern language.

Current state

preprocess_snippet() walks char_indices() with a peekable() iterator and branches on $:

  1. $( -> repetition syntax $($name)sep*
  2. $$$ -> multi-child placeholder
  3. $name -> plain placeholder, checks for :Kind suffix
  4. Special token check ($BODY, $STMT, $EXPR) interleaved with placeholder logic

This works but has issues:

  • Ownership/borrow friction (the name move bug fixed in earlier commits)
  • Error messages are nonexistent (syntax errors become tree-sitter parse failures)
  • Adding new syntax requires threading more state through the state machine
  • No source location tracking for error reporting

Proposed design

Tokenizer

Produce a token stream from the raw pattern string:

Input Token
$name Placeholder("name")
$name:Kind TypedPlaceholder("name", "Kind")
$$$name MultiPlaceholder("name")
$$$ MultiPlaceholder(None)
$BODY Special("BODY")
$STMT Special("STMT")
$EXPR Special("EXPR")
$( RepOpen
) RepClose
* / + / ? (after RepClose+sep) Quant(Star|Plus|Optional)
code text Literal("...")

Parser

Consume the token stream and emit the processed pattern string with metadata, building:

ProcessedSnippet {
    pattern_text: String,        // sentinel-replaced pattern
    placeholders: HashMap,       // name -> sentinel
    repetitions: HashMap,        // sentinel -> RepetitionInfo
    specials: HashMap,           // sentinel -> SpecialKind
    type_constraints: HashMap,   // sentinel -> kind name
}

Migration

  • Keep the existing ProcessedSnippet struct interface unchanged
  • Replace preprocess_snippet() body with tokenize() then parse() pipeline
  • All existing callers (find_snippet_matches, tests) remain untouched
  • Tokenizer and parser are independently unit-testable

Benefits

  1. Better errors - tokenizer tracks byte offsets, so we can report error: unexpected $ at position 12 instead of relying on tree-sitter parse failures
  2. Extensible - adding regex constraints $name ~ /pattern/ or alternatives ($A \| $B) is a new token type and a parser case
  3. Testable - tokenizer and parser can be unit-tested independently instead of only through find_snippet_matches
  4. Multi-item repetitions - the parser naturally handles $($a + $b),* by allowing multiple placeholders in Repetition.inner
  5. No more ownership fights - the tokenizer owns the tokens, the parser owns the output

Out of scope for this issue

  • Pattern validation against language grammar (tree-sitter handles this)
  • Combinatorial patterns (multiple patterns combined with and/or)
  • Conditions/comparisons on captured values

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions