Emit backend gap audit: engine-parity automation, static-analysis single-source, arena reclamation, lexer dead-path

Reviewed `src/emit-parser.ts` (4223 lines) and `src/emit-lexer.ts` (687 lines) across four axes (matcher/perf, incremental·recovery·arena, lexer, engine-parity debt), excluding already-shipped optimizations (SoA token stream, columnar arena, multi-word bitmask dispatch, SECOND-token refinement, node surgery, run-adoption). Each item carries `file:line` + confidence, and distinguishes a **real gap** from a **necessary tradeoff**.

> The review twice hit "looks like a gap, actually already fixed" (the 32-arm dispatch cliff is solved with multi-word masks `emit-parser.ts:1197`; the incremental "rebuild a whole slice on one keystroke" is already mitigated by node surgery `emit-parser.ts:3228`), so only code-backed real gaps are kept below.

## Highest leverage (suggested priority)

1. **A2 — wire the engine-parity gates into CI** (cheapest, removes the largest risk). `emit-parser-verify.ts` and `emit-reject-messages.ts` are **not** in `check.ts` / `package.json` / `.github/` (grep-confirmed zero matches); they run manually and depend on `/tmp/ts-repo`. Editing `gen-parser.ts` semantics has **no mechanism** — automated or human-enforced — that makes `emit-parser.ts` follow; it relies on a developer remembering to hand-sync both files and then manually running an un-wired gate against a corpus they must have cloned.
2. **A1 — static-analysis single-source**. `analyze()` (`emit-parser.ts:57-621`) is a near-verbatim rewrite of `gen-parser.ts:39-759` — ~560 lines of duplicated pure functions, and it hides a genuinely divergent algorithm (see A3's `isLeftRecursive`). Extractable into `src/grammar-analysis.ts` shared by both engines (precedent already exists: `grammar-utils.ts` shares `collectLiterals`/`isKeywordLiteral`).
3. **C1 — arena monotonic growth**. `edit()` only appends, never reclaims; no free-list / compaction; only a full `parse()` resets the arena (which `edit()` never triggers). LSP-style long-lived sessions grow unbounded.
4. **B1 — DFA emitter dead path**. `emitTokenScannerBody` (`token-dfa.ts:319`) has zero external callers (confirmed), its "~1.3–1.6×" claim has no gate measurement, and the only measured DFA path (the interpreter) is net-negative on all 12 TS tokens.

---

## A. Engine-parity debt (emit-parser ↔ gen-parser)

`gen-parser.ts` is the `createParser` runtime interpreter (correctness oracle); `emit-parser.ts` compiles the same grammar into a standalone parser whose CST must be byte-identical. The debt splits into two layers of different shape.

| # | Title | Location | Type | Confidence |
|---|-------|----------|------|------------|
| A1 | **~560 lines of static analysis duplicated** (binding-power, ledPrec, FIRST/SECOND, nullability, mixfix, access-tail…) — pure `CstGrammar→data` functions, no technical reason to write twice | `emit-parser.ts:57-621` ↔ `gen-parser.ts:39-759` | real debt (single-source) | high |
| A2 | **the two parity gates are not automated**: `emit-parser-verify` / `emit-reject-messages` are absent from `check.ts`/CI, run by hand, depend on `/tmp/ts-repo`; a `gen-parser` change has no mechanism propagating to `emit-parser` | `test/check.ts:17-52`, `.github/workflows/ci.yml:46` | real debt (wire into CI) | high |
| A3 | **`isLeftRecursive` is a different algorithm**: gen-parser uses a left-corner transitive closure and rejects indirect/hidden LR at build time (`gen-parser.ts:309-346`); emit-parser does only the syntactic `items[0]===self` test (`emit-parser.ts:121-127`). They coincide on the current grammars, but `A → opt(x) A …` (recursion hidden behind a nullable prefix) would be routed through `parseLeftRec` by one engine and treated as an ordinary rule by the other → real CST divergence | `emit-parser.ts:121` ↔ `gen-parser.ts:309-346` | latent divergence (gate can't catch) | high |
| A4 | **both-sides-identically-wrong risk**: the gate proves `emit≡interp`, never `interp≡correct`. The largest surface is two hand-copied ~180-line SECOND-set fixpoints (each carrying a "MUST stay algorithm-identical" self-warning); a mis-derivation hand-copied into both produces the same wrong CST and the gate reports `agree` | `emit-parser.ts:342-520` ↔ `gen-parser.ts:594-734` | real risk | medium |

**Single-source feasibility (assessed):**
- **Layer A (static analysis) is eliminable, low-risk, high-value** — they are pure functions of `CstGrammar`; move them to a shared module. The only friction: emit's FIRST is the richer qualKeys variant, so unifying means lifting the interpreter to it too (a sound refinement, guarded by the existing conformance suite).
- **Layer B (control loops) should NOT be unified** — the interpreter runs object CSTs + a Map memo + zero incremental state; the emitter runs an SoA arena + a generation-stamped typed-array memo + adoption/recovery/EOF-relative spans. Dragging that into the interpreter destroys its value as a simple, independently-written oracle (an oracle sharing the suspect machinery can't catch bugs in it). **The fix is verification, not unification: make the gate CI-blocking** and keep the two hand-written loops, proving equivalence every CI run.

---

## B. emit-lexer

| # | Title | Location | Type | Confidence |
|---|-------|----------|------|------------|
| B1 | **DFA emitter is a wired-up-and-ready dead path**: `emitTokenScannerBody` has zero external callers, emit-lexer always emits `new RegExp`; the "~1.3–1.6×" claim is unmeasured, and the only measured DFA path (interpreter) is net-negative on all 12 TS tokens (Ident 0.30×…) | `token-dfa.ts:319`, `emit-lexer.ts:104` | real gap (deliver+measure or delete+drop the unsupported claim) | high |
| B2 | **windowed re-lex is an emit-only orphan**: the whole resync/findRestart/reconstructParens machinery has no counterpart in `gen-lexer.ts`, so there is no emit↔runtime differential oracle — only an emit-self-consistency guard | `emit-lexer.ts:251-674` | real gap (missing independent oracle) | medium |
| B3 | **resync retract+diag-truncate logic duplicated verbatim**: the identical one-liner appears inside the loop and after it; the two can drift independently (extremely hard to catch by corpus — needs a resync exactly at the EOF boundary) | `emit-lexer.ts:362` and `:613` | real gap (de-dup) | high |
| B4 | **non-ASCII whitespace regex fallback duplicated + per-position slow path**: every non-whitespace `cc>127` start char first fires an `LX_WS.exec`; the non-ASCII codepoints of `\s` are a fixed small set that can be baked into a charCode test | `emit-lexer.ts:372-376` and `:379-383` | real gap (bakeable) | medium |
| B5 | **char-loop fast path and `lexKwT` re-scan the same bytes**: the identifier hot path scans `[pos+1,p)` twice (once in the fast loop, once char-by-char in lexKwT); a length sentinel could short-circuit lexKwT when the run exceeds maxKw | `emit-lexer.ts:529-531` vs `:179-181` | real gap (needs a benchmark to confirm the fusion form) | medium |

---

## C. Incremental / recovery / arena

| # | Title | Location | Type | Confidence |
|---|-------|----------|------|------------|
| C1 | **arena monotonic growth**: `edit()` only appends (`nodeN`/`kidN` are reset only in full `parse()` at `3695-3696`; `editCore` never resets), no free-list/compaction; self-noted at `3742-3743` "long edit sessions grow until then" | `emit-parser.ts:3695-3696` (reset), `3399-3429` (leak) | real gap (design-known) | high |
| C2 | **deletion-shaped splice could be in-place but takes end-allocation and leaks**: the in-place branch only fires on `f===removed` (kid count unchanged); when kid count *shrinks* (`n2k<nD`, e.g. deleting a statement) it fits the original range yet takes the `else` end-allocation | `emit-parser.ts:3391` vs `:3398` | real gap (relax to `n2k<=nD` in-place) | medium |
| C3 | **`SURG_ELEM` table is too narrow**: building it requires a pure seq/group around exactly one `*`/`+` with no `alt`/`opt`/`?`/multiple reps; any list rule with an optional element or alternative (trailing comma, member list with modifiers, union body) gets no surgery, so edits inside it **always** take a full adoption re-parse | `emit-parser.ts:2062-2083` (build), `3311-3320` (use) | real gap (relaxable: opts around the rep region survivable via the rowKC watermark argument) | medium |
| C4 | **recovery second pass re-runs the entry rule from `pos=0` every attempt** + discards the `adoptPath` cache each attempt, up to 33 passes; adoption mitigates row rebuild but not the entry walk or cache rebuild | `emit-parser.ts:4069-4086`, `4075`/`4095` | real gap (keep cache across attempts + resume from the bar-free prefix) | medium |
| C5 | **strict full-pass success path clears `docPar` and rebuilds docDiags with a full `sort`**, instead of the surgery path's in-place shift (`shiftDiags`) | `emit-parser.ts:4046-4048` / `3110` | minor gap | medium |
| C6 | **surgery hit-rate in recovery state is markedly lower than in strict state**: an edit near any recovery bar exits surgery (position-pure commutativity gate); "type-with-red-squiggles" sessions are affected | `emit-parser.ts:3272`/`3275`/`3342`/`3383` | necessary tradeoff (note) | high |

---

## D. matcher / perf

| # | Title | Location | Type | Confidence |
|---|-------|----------|------|------------|
| D1 | **inline alt has no predictive dispatch**: rule-level longest-match has `altMaskDispatch` bitmask + SECOND refinement (`emit-parser.ts:1190`), but an `alt` inside a seq is still a naive shared-save try-each-arm; FIRST-disjoint inline alts could switch-dispatch, and 2-arm ones could "pick the arm from the first token, no save/restore" (grammar scale: typescript.ts 78 `alt(`, javascript.ts 43) | `emit-parser.ts:895` | real gap | medium |

> **Correction**: I briefly suspected "packrat memo is a cold-parse burden" — **retracted** after reading `parseRuleEntry` (`emit-parser.ts:2462-2552`). The memo is already a per-rule typed-array (a lookup is two array loads, a store allocates nothing — not a Map), opened only for pratt/left-rec rules, and `memoExt` doubles as the infrastructure for **incremental invalidation** (intersected with an edit's damage window) and **recovery cross-attempt reuse** (`barFreeWin`). It is both cheap and selectively opened, and it is an incremental asset — not a gap.

---

## Verified non-gaps (do not redo)

- **32-arm dispatch cliff**: already solved with multi-word masks (`emit-parser.ts:1197`, deliberately no arm-count ceiling).
- **incremental "edit one item in a big list rebuilds a slice"**: already mitigated by node surgery (`emit-parser.ts:3228`), full adoption only as the fallback.
- **packrat memo**: see the correction under D — typed-array, zero-alloc store, selectively opened, incremental/recovery infrastructure.
- **most `trySurgery` `return -1`** (F1/F2/F3/F5/F7: cross-kid descent stop, swallowing a non-element row, zero-width non-advance, `>`-splice detach, out-of-range pure insert, whole-file token count not closing) = necessary correctness gates.
- **`collectErrRows`** is genuinely O(error paths) (walks only the rowRM spine, `emit-parser.ts:3072`).
- **markup/indent/newline not emitted**: intentional scope cut, explicit `return null` falling back to `createLexer` (`emit-lexer.ts:33-34`), and `emit-lexer-verify.ts` errors if a grammar that should emit falls back.
- **template open / first-char filter / punct chain / lexKwT perfect-hash**: emit already bakes these, some **better** than the runtime.

---

## Verification notes

- Personally grep-confirmed: the two parity gates are absent from `check.ts`/`package.json`/`.github/`; the DFA emitter has zero external callers; arena `nodeN=0`/`kidN=0` appear only in declarations and full `parse()`.
- One of the four review axes (matcher/perf) hit a 503 server error mid-run; that axis was finished by hand reading `emit-parser.ts:895` (inline alt) and `2452-2560` (memo), so the D section is leaner — a full per-case perf sweep of `matchInto`/`matchFn` was not done. Re-run if an exhaustive pass is wanted.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit backend gap audit: engine-parity automation, static-analysis single-source, arena reclamation, lexer dead-path #45

Highest leverage (suggested priority)

A. Engine-parity debt (emit-parser ↔ gen-parser)

B. emit-lexer

C. Incremental / recovery / arena

D. matcher / perf

Verified non-gaps (do not redo)

Verification notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Title	Location	Type	Confidence
A1	~560 lines of static analysis duplicated (binding-power, ledPrec, FIRST/SECOND, nullability, mixfix, access-tail…) — pure `CstGrammar→data` functions, no technical reason to write twice	`emit-parser.ts:57-621` ↔ `gen-parser.ts:39-759`	real debt (single-source)	high
A2	the two parity gates are not automated: `emit-parser-verify` / `emit-reject-messages` are absent from `check.ts`/CI, run by hand, depend on `/tmp/ts-repo`; a `gen-parser` change has no mechanism propagating to `emit-parser`	`test/check.ts:17-52`, `.github/workflows/ci.yml:46`	real debt (wire into CI)	high
A3	`isLeftRecursive` is a different algorithm: gen-parser uses a left-corner transitive closure and rejects indirect/hidden LR at build time (`gen-parser.ts:309-346`); emit-parser does only the syntactic `items[0]===self` test (`emit-parser.ts:121-127`). They coincide on the current grammars, but `A → opt(x) A …` (recursion hidden behind a nullable prefix) would be routed through `parseLeftRec` by one engine and treated as an ordinary rule by the other → real CST divergence	`emit-parser.ts:121` ↔ `gen-parser.ts:309-346`	latent divergence (gate can't catch)	high
A4	both-sides-identically-wrong risk: the gate proves `emit≡interp`, never `interp≡correct`. The largest surface is two hand-copied ~180-line SECOND-set fixpoints (each carrying a "MUST stay algorithm-identical" self-warning); a mis-derivation hand-copied into both produces the same wrong CST and the gate reports `agree`	`emit-parser.ts:342-520` ↔ `gen-parser.ts:594-734`	real risk	medium

#	Title	Location	Type	Confidence
B1	DFA emitter is a wired-up-and-ready dead path: `emitTokenScannerBody` has zero external callers, emit-lexer always emits `new RegExp`; the "~1.3–1.6×" claim is unmeasured, and the only measured DFA path (interpreter) is net-negative on all 12 TS tokens (Ident 0.30×…)	`token-dfa.ts:319`, `emit-lexer.ts:104`	real gap (deliver+measure or delete+drop the unsupported claim)	high
B2	windowed re-lex is an emit-only orphan: the whole resync/findRestart/reconstructParens machinery has no counterpart in `gen-lexer.ts`, so there is no emit↔runtime differential oracle — only an emit-self-consistency guard	`emit-lexer.ts:251-674`	real gap (missing independent oracle)	medium
B3	resync retract+diag-truncate logic duplicated verbatim: the identical one-liner appears inside the loop and after it; the two can drift independently (extremely hard to catch by corpus — needs a resync exactly at the EOF boundary)	`emit-lexer.ts:362` and `:613`	real gap (de-dup)	high
B4	non-ASCII whitespace regex fallback duplicated + per-position slow path: every non-whitespace `cc>127` start char first fires an `LX_WS.exec`; the non-ASCII codepoints of `\s` are a fixed small set that can be baked into a charCode test	`emit-lexer.ts:372-376` and `:379-383`	real gap (bakeable)	medium
B5	char-loop fast path and `lexKwT` re-scan the same bytes: the identifier hot path scans `[pos+1,p)` twice (once in the fast loop, once char-by-char in lexKwT); a length sentinel could short-circuit lexKwT when the run exceeds maxKw	`emit-lexer.ts:529-531` vs `:179-181`	real gap (needs a benchmark to confirm the fusion form)	medium

#	Title	Location	Type	Confidence
C1	arena monotonic growth: `edit()` only appends (`nodeN`/`kidN` are reset only in full `parse()` at `3695-3696`; `editCore` never resets), no free-list/compaction; self-noted at `3742-3743` "long edit sessions grow until then"	`emit-parser.ts:3695-3696` (reset), `3399-3429` (leak)	real gap (design-known)	high
C2	deletion-shaped splice could be in-place but takes end-allocation and leaks: the in-place branch only fires on `f===removed` (kid count unchanged); when kid count shrinks (`n2k<nD`, e.g. deleting a statement) it fits the original range yet takes the `else` end-allocation	`emit-parser.ts:3391` vs `:3398`	real gap (relax to `n2k<=nD` in-place)	medium
C3	`SURG_ELEM` table is too narrow: building it requires a pure seq/group around exactly one ``/`+` with no `alt`/`opt`/`?`/multiple reps; any list rule with an optional element or alternative (trailing comma, member list with modifiers, union body) gets no surgery, so edits inside it always* take a full adoption re-parse	`emit-parser.ts:2062-2083` (build), `3311-3320` (use)	real gap (relaxable: opts around the rep region survivable via the rowKC watermark argument)	medium
C4	recovery second pass re-runs the entry rule from `pos=0` every attempt + discards the `adoptPath` cache each attempt, up to 33 passes; adoption mitigates row rebuild but not the entry walk or cache rebuild	`emit-parser.ts:4069-4086`, `4075`/`4095`	real gap (keep cache across attempts + resume from the bar-free prefix)	medium
C5	strict full-pass success path clears `docPar` and rebuilds docDiags with a full `sort`, instead of the surgery path's in-place shift (`shiftDiags`)	`emit-parser.ts:4046-4048` / `3110`	minor gap	medium
C6	surgery hit-rate in recovery state is markedly lower than in strict state: an edit near any recovery bar exits surgery (position-pure commutativity gate); "type-with-red-squiggles" sessions are affected	`emit-parser.ts:3272`/`3275`/`3342`/`3383`	necessary tradeoff (note)	high

Emit backend gap audit: engine-parity automation, static-analysis single-source, arena reclamation, lexer dead-path #45

Description

Highest leverage (suggested priority)

A. Engine-parity debt (emit-parser ↔ gen-parser)

B. emit-lexer

C. Incremental / recovery / arena

D. matcher / perf

Verified non-gaps (do not redo)

Verification notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions