Skip to content

tree-sitter generate is ~5min for the TS grammar (9788 states vs official ~2000) #46

@johnsoncodehk

Description

@johnsoncodehk

The generated TypeScript tree-sitter parser has STATE_COUNT 9788 (parser.c ~16 MB); tree-sitter generate takes ~5 min single-threaded. The official tree-sitter-typescript is ~2000 states. This only bites at commit/CI time (when a grammar change regenerates gate:treesitter); the dev loop is unaffected (npm run check was parallelized to ~19s and does not run generate).

Root cause (measured)

~116 LR_CONFLICT_CLOSURE tuples (src/gen-treesitter.ts), incl. 34 self-conflicts (['expr'], ['type'], ['type_member']). The dominant driver is the genuine structural ambiguity of the unified grammar: a TS type A.B<C>[] is also a valid expression, so tree-sitter GLR explores both type and expr (and binding/prop/member_name) at every { / [ / computed position.

Ruled out (already optimal / not the cause)

  • Operator precedence — gen-treesitter already emits prec.left/prec.right.
  • Keyword lexer — already uses word: extraction.
  • nested-new self-conflict — removing it dropped only ~250 states.

Empirically-failed emission tweaks

  • inline never-captured rules (inline: [param, prop] + strip inlined symbols from conflicts): tree-sitter errored "Add a conflict for these rules: expr, member_name, binding_property" (stripping surfaced an undeclared conflict) and took 48 min to fail. param is referenced 82x, so inlining also duplicates its body.
  • type-Pratt (wrap the type rule’s left-recursive arms in prec.left, since Type is not a pratt rule and renders prec-0): STATE_COUNT 9788 → 9868 (worse), and the type/type_member conflicts remain — the ['type'] self-conflict is true type-vs-expr ambiguity, not left-recursion associativity, so precedence cannot resolve it.

The path that actually reaches ~2000 states

An external scanner that disambiguates type-vs-expr context (how the official grammar avoids these conflicts) — a major, multi-session refactor. The 5–48 min generate cycle (failures balloon to ~48 min) makes iteration expensive, so this needs a dedicated effort, not incremental tweaks.

Lower-effort follow-ups worth trying first

  • Prune the append-only LR_CONFLICT_CLOSURE (collect-conflicts.ts only adds, never removes) — a stale tuple still forces GLR splits.
  • supertypes: for member_name (and possibly type/expr).
  • Stagger the LED precedence ladder + a prec wrapper on the C-cast nud.
    (Each is a 5-min generate to verify; gate:treesitter ≥95.9% is the only regression guard since these are emission-only.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions