Skip to content

Emit backend gap audit: engine-parity automation, static-analysis single-source, arena reclamation, lexer dead-path #45

@johnsoncodehk

Description

@johnsoncodehk

Reviewed src/emit-parser.ts (4223 lines) and src/emit-lexer.ts (687 lines) across four axes (matcher/perf, incremental·recovery·arena, lexer, engine-parity debt), excluding already-shipped optimizations (SoA token stream, columnar arena, multi-word bitmask dispatch, SECOND-token refinement, node surgery, run-adoption). Each item carries file:line + confidence, and distinguishes a real gap from a necessary tradeoff.

The review twice hit "looks like a gap, actually already fixed" (the 32-arm dispatch cliff is solved with multi-word masks emit-parser.ts:1197; the incremental "rebuild a whole slice on one keystroke" is already mitigated by node surgery emit-parser.ts:3228), so only code-backed real gaps are kept below.

Highest leverage (suggested priority)

  1. A2 — wire the engine-parity gates into CI (cheapest, removes the largest risk). emit-parser-verify.ts and emit-reject-messages.ts are not in check.ts / package.json / .github/ (grep-confirmed zero matches); they run manually and depend on /tmp/ts-repo. Editing gen-parser.ts semantics has no mechanism — automated or human-enforced — that makes emit-parser.ts follow; it relies on a developer remembering to hand-sync both files and then manually running an un-wired gate against a corpus they must have cloned.
  2. A1 — static-analysis single-source. analyze() (emit-parser.ts:57-621) is a near-verbatim rewrite of gen-parser.ts:39-759 — ~560 lines of duplicated pure functions, and it hides a genuinely divergent algorithm (see A3's isLeftRecursive). Extractable into src/grammar-analysis.ts shared by both engines (precedent already exists: grammar-utils.ts shares collectLiterals/isKeywordLiteral).
  3. C1 — arena monotonic growth. edit() only appends, never reclaims; no free-list / compaction; only a full parse() resets the arena (which edit() never triggers). LSP-style long-lived sessions grow unbounded.
  4. B1 — DFA emitter dead path. emitTokenScannerBody (token-dfa.ts:319) has zero external callers (confirmed), its "~1.3–1.6×" claim has no gate measurement, and the only measured DFA path (the interpreter) is net-negative on all 12 TS tokens.

A. Engine-parity debt (emit-parser ↔ gen-parser)

gen-parser.ts is the createParser runtime interpreter (correctness oracle); emit-parser.ts compiles the same grammar into a standalone parser whose CST must be byte-identical. The debt splits into two layers of different shape.

# Title Location Type Confidence
A1 ~560 lines of static analysis duplicated (binding-power, ledPrec, FIRST/SECOND, nullability, mixfix, access-tail…) — pure CstGrammar→data functions, no technical reason to write twice emit-parser.ts:57-621gen-parser.ts:39-759 real debt (single-source) high
A2 the two parity gates are not automated: emit-parser-verify / emit-reject-messages are absent from check.ts/CI, run by hand, depend on /tmp/ts-repo; a gen-parser change has no mechanism propagating to emit-parser test/check.ts:17-52, .github/workflows/ci.yml:46 real debt (wire into CI) high
A3 isLeftRecursive is a different algorithm: gen-parser uses a left-corner transitive closure and rejects indirect/hidden LR at build time (gen-parser.ts:309-346); emit-parser does only the syntactic items[0]===self test (emit-parser.ts:121-127). They coincide on the current grammars, but A → opt(x) A … (recursion hidden behind a nullable prefix) would be routed through parseLeftRec by one engine and treated as an ordinary rule by the other → real CST divergence emit-parser.ts:121gen-parser.ts:309-346 latent divergence (gate can't catch) high
A4 both-sides-identically-wrong risk: the gate proves emit≡interp, never interp≡correct. The largest surface is two hand-copied ~180-line SECOND-set fixpoints (each carrying a "MUST stay algorithm-identical" self-warning); a mis-derivation hand-copied into both produces the same wrong CST and the gate reports agree emit-parser.ts:342-520gen-parser.ts:594-734 real risk medium

Single-source feasibility (assessed):

  • Layer A (static analysis) is eliminable, low-risk, high-value — they are pure functions of CstGrammar; move them to a shared module. The only friction: emit's FIRST is the richer qualKeys variant, so unifying means lifting the interpreter to it too (a sound refinement, guarded by the existing conformance suite).
  • Layer B (control loops) should NOT be unified — the interpreter runs object CSTs + a Map memo + zero incremental state; the emitter runs an SoA arena + a generation-stamped typed-array memo + adoption/recovery/EOF-relative spans. Dragging that into the interpreter destroys its value as a simple, independently-written oracle (an oracle sharing the suspect machinery can't catch bugs in it). The fix is verification, not unification: make the gate CI-blocking and keep the two hand-written loops, proving equivalence every CI run.

B. emit-lexer

# Title Location Type Confidence
B1 DFA emitter is a wired-up-and-ready dead path: emitTokenScannerBody has zero external callers, emit-lexer always emits new RegExp; the "~1.3–1.6×" claim is unmeasured, and the only measured DFA path (interpreter) is net-negative on all 12 TS tokens (Ident 0.30×…) token-dfa.ts:319, emit-lexer.ts:104 real gap (deliver+measure or delete+drop the unsupported claim) high
B2 windowed re-lex is an emit-only orphan: the whole resync/findRestart/reconstructParens machinery has no counterpart in gen-lexer.ts, so there is no emit↔runtime differential oracle — only an emit-self-consistency guard emit-lexer.ts:251-674 real gap (missing independent oracle) medium
B3 resync retract+diag-truncate logic duplicated verbatim: the identical one-liner appears inside the loop and after it; the two can drift independently (extremely hard to catch by corpus — needs a resync exactly at the EOF boundary) emit-lexer.ts:362 and :613 real gap (de-dup) high
B4 non-ASCII whitespace regex fallback duplicated + per-position slow path: every non-whitespace cc>127 start char first fires an LX_WS.exec; the non-ASCII codepoints of \s are a fixed small set that can be baked into a charCode test emit-lexer.ts:372-376 and :379-383 real gap (bakeable) medium
B5 char-loop fast path and lexKwT re-scan the same bytes: the identifier hot path scans [pos+1,p) twice (once in the fast loop, once char-by-char in lexKwT); a length sentinel could short-circuit lexKwT when the run exceeds maxKw emit-lexer.ts:529-531 vs :179-181 real gap (needs a benchmark to confirm the fusion form) medium

C. Incremental / recovery / arena

# Title Location Type Confidence
C1 arena monotonic growth: edit() only appends (nodeN/kidN are reset only in full parse() at 3695-3696; editCore never resets), no free-list/compaction; self-noted at 3742-3743 "long edit sessions grow until then" emit-parser.ts:3695-3696 (reset), 3399-3429 (leak) real gap (design-known) high
C2 deletion-shaped splice could be in-place but takes end-allocation and leaks: the in-place branch only fires on f===removed (kid count unchanged); when kid count shrinks (n2k<nD, e.g. deleting a statement) it fits the original range yet takes the else end-allocation emit-parser.ts:3391 vs :3398 real gap (relax to n2k<=nD in-place) medium
C3 SURG_ELEM table is too narrow: building it requires a pure seq/group around exactly one */+ with no alt/opt/?/multiple reps; any list rule with an optional element or alternative (trailing comma, member list with modifiers, union body) gets no surgery, so edits inside it always take a full adoption re-parse emit-parser.ts:2062-2083 (build), 3311-3320 (use) real gap (relaxable: opts around the rep region survivable via the rowKC watermark argument) medium
C4 recovery second pass re-runs the entry rule from pos=0 every attempt + discards the adoptPath cache each attempt, up to 33 passes; adoption mitigates row rebuild but not the entry walk or cache rebuild emit-parser.ts:4069-4086, 4075/4095 real gap (keep cache across attempts + resume from the bar-free prefix) medium
C5 strict full-pass success path clears docPar and rebuilds docDiags with a full sort, instead of the surgery path's in-place shift (shiftDiags) emit-parser.ts:4046-4048 / 3110 minor gap medium
C6 surgery hit-rate in recovery state is markedly lower than in strict state: an edit near any recovery bar exits surgery (position-pure commutativity gate); "type-with-red-squiggles" sessions are affected emit-parser.ts:3272/3275/3342/3383 necessary tradeoff (note) high

D. matcher / perf

# Title Location Type Confidence
D1 inline alt has no predictive dispatch: rule-level longest-match has altMaskDispatch bitmask + SECOND refinement (emit-parser.ts:1190), but an alt inside a seq is still a naive shared-save try-each-arm; FIRST-disjoint inline alts could switch-dispatch, and 2-arm ones could "pick the arm from the first token, no save/restore" (grammar scale: typescript.ts 78 alt(, javascript.ts 43) emit-parser.ts:895 real gap medium

Correction: I briefly suspected "packrat memo is a cold-parse burden" — retracted after reading parseRuleEntry (emit-parser.ts:2462-2552). The memo is already a per-rule typed-array (a lookup is two array loads, a store allocates nothing — not a Map), opened only for pratt/left-rec rules, and memoExt doubles as the infrastructure for incremental invalidation (intersected with an edit's damage window) and recovery cross-attempt reuse (barFreeWin). It is both cheap and selectively opened, and it is an incremental asset — not a gap.


Verified non-gaps (do not redo)

  • 32-arm dispatch cliff: already solved with multi-word masks (emit-parser.ts:1197, deliberately no arm-count ceiling).
  • incremental "edit one item in a big list rebuilds a slice": already mitigated by node surgery (emit-parser.ts:3228), full adoption only as the fallback.
  • packrat memo: see the correction under D — typed-array, zero-alloc store, selectively opened, incremental/recovery infrastructure.
  • most trySurgery return -1 (F1/F2/F3/F5/F7: cross-kid descent stop, swallowing a non-element row, zero-width non-advance, >-splice detach, out-of-range pure insert, whole-file token count not closing) = necessary correctness gates.
  • collectErrRows is genuinely O(error paths) (walks only the rowRM spine, emit-parser.ts:3072).
  • markup/indent/newline not emitted: intentional scope cut, explicit return null falling back to createLexer (emit-lexer.ts:33-34), and emit-lexer-verify.ts errors if a grammar that should emit falls back.
  • template open / first-char filter / punct chain / lexKwT perfect-hash: emit already bakes these, some better than the runtime.

Verification notes

  • Personally grep-confirmed: the two parity gates are absent from check.ts/package.json/.github/; the DFA emitter has zero external callers; arena nodeN=0/kidN=0 appear only in declarations and full parse().
  • One of the four review axes (matcher/perf) hit a 503 server error mid-run; that axis was finished by hand reading emit-parser.ts:895 (inline alt) and 2452-2560 (memo), so the D section is leaner — a full per-case perf sweep of matchInto/matchFn was not done. Re-run if an exhaustive pass is wanted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions