Total parse/edit: never crash on input, errors as data, equivalence gated (closes #39)#40
Open
johnsoncodehk wants to merge 50 commits into
Open
Total parse/edit: never crash on input, errors as data, equivalence gated (closes #39)#40johnsoncodehk wants to merge 50 commits into
johnsoncodehk wants to merge 50 commits into
Conversation
…lence class parse/edit on the handle API never crash on input: the STRICT pass runs first (valid path byte-identical, full PEG arm exploration - gated by test/recovery.ts section 1 and the untouched parity suite), and only a strict reject re-parses under the recovery machinery: - Repetition recovery at spine-shaped loops (ref / alt-of-refs elements; deep-FIRST hooks measured 273-error cascades from arm probing and were reverted): a failing element absorbs tokens into an $error row up to the element FIRST set / the enclosing seq's follower literal / EOF. - BAR DISCIPLINE keeps recovery equivalence-safe and arm-blind: fires only where parsing is STUCK AT a strict-proven fail point (pos <= bar <= maxPos <= bar+2, stateless so losing arms cannot consume bars); failures past the bars abort the attempt and mint the next bar (32-attempt cap degrades to deterministic free-fire). The runParse safety net obeys the same discipline. - The lexer recovers under the same flag (error tokens + structured diagnostics; window truncation keeps the LEX_RETRY regrow path). - Diagnostics are DERIVED, not collected: $error rows found by descending the structurally-propagated rowRM spine (per-pass candidate lists double-counted under stateless re-adoption); lexer diagnostics live as structured entries formatted at settle time (stored message strings would embed stale offsets), maintained by the window splice and shifted by surgery. - Recovered streams break two strict-era invariants, both fixed: windowed relexing must anchor BELOW the earliest lexer diagnostic before the damage (a dangling quote pairs with a later edit - backward coupling; forward coupling is already guarded by resync equality), and rows built during a recovering pass may under-record their probe watermark when any arm fired recovery (recFires stamping refuses them to strict adoption; relocate-path surgery also normalizes copied prefix rels - an end-relative value below the remapped rowNF boundary would drift on every later length update). - '>' splits disable adoption for the rest of the parse (the frozen damage mapping is invalid after a mid-parse token-index shift). Gates: incremental-verify reworked to total semantics (every step compares tree+errors against a fresh recovering handle, 128 steps 0 mismatch), multi-doc reworked (60 interleaved steps incl. broken text, contract 9/9), 31/31 suite, strict parity 0 mismatches. KNOWN RESIDUAL (test/recovery.ts, not yet registered): typing-through- invalid session diverges at 1 of 20 keystrokes - a strict pass-1 edit ADOPTING over a post-recovery tree drops one Pratt wrap layer vs a fresh strict parse (single-keystroke repro in the gate; suspected adoption interplay with LED chains on recovering-built substrate).
…valence gated
The residual typing-session divergence traced to a watermark contract
violation that PREDATES recovery and was latent in strict incremental
parsing: a Pratt rule's winning row is finishNode'd BEFORE its failed
LED extension arms run (the NUD/shorter candidate survives the longest
match), so rowExt under-records the rule's true probe extent. The memo
watermark (maxPos at parseRuleEntry exit) was always correct - but the
memo dies with its generation, and ADOPTION reads the row. An edit
landing inside a failed arm's reads then kept a stale row alive ('const
x = f' adopted with ext=4 while typing ')' at token 20 turns the failed
call arm into a successful one). Strict sessions never caught it
because the texts that exercise it (unclosed calls) REJECT, and the
reject was the firewall; total parsing keeps such trees alive.
Fix: write the rule-level watermark back to the row at memo-store time
(rowExt[result] = max(rowExt, maxPos - start)). This subsumes the
recFires mode stamp (removed - rowRM is purely structural again for the
diagnostics walk), restoring broad strict adoption over recovered
substrates: broken-state keystrokes on 9MB dropped from ~1.6s to the
~0.3s bar-iteration cost (valid-state keystrokes stay at 0.05ms).
test/recovery.ts now fully green and REGISTERED (32/32): valid corpus
byte-identical to strict with empty errors, invalid corpus total and
deterministic, the char-by-char typing session 20/20 keystrokes
equivalent to fresh parses (tree AND errors). The interpreter gains
parseTotal/edit parity (no recovery machinery: degrades to a zero-width
$error root with the strict diagnostic).
incremental-verify 128 steps 0 mismatch, multi-doc 60 steps contract
9/9, strict parity 0 mismatches, lexer streams byte-identical, batch in
band (11.2x), agnostic 9/9.
The seeded mutation lists never inserted a bare ';' — splitting an existing expression's structure (f(a;, b) / (a +; b) / obj.m(;1).n) was covered only by the general machinery, not exercised. Both gates' INSERT pools gain ';' and the glue list gains three explicit break-then-compare pairs; verified break ≡ fresh and restore ≡ original byte-identically (tree and errors) before pinning. Observation for the conformance backlog: several of these broken shapes parse with ZERO errors - the strict grammar itself accepts them (over-accept surface, identical on both engines), not a recovery artifact.
…onsistency The incremental/recovery gates were TypeScript-only while every grammar shares the emitted runtime - the non-TS incremental behavior (markup lexer modes, the fallback-lexer path, other token algebras) was ungated. test/incremental-grammars.ts closes that: generative inputs (grammar-gen) per grammar x seeded char-level edit sessions, each step checking (1) edited tree + errors byte-identical to a fresh handle parse, (2) tree self-consistency - every span inside its ancestors (the engine-internal invariant an external compare misses when both sides share a corruption; the aggressiveChecks idea), and (3) totality. It immediately found three real holes, all fixed: - totalNet pushed its diagnostic into the VIEW layer, which the next settle rebuild wiped on exactly one side (now a kind-4 source entry formatted at settle - verbatim engine message). - the fallback-lexer full-relex path never cleared persisted docLex, so a totality-net diagnostic outlived the edit that fixed the text. - the window resync retracts the duplicated token push (tokN--) but left the lexer diagnostic emitted FOR that token: the persisted entry survives via the suffix shift AND the window's copy stayed - the same character double-reported. Retraction now pops the window's own entries at/after the retracted token (lexDiagBase floor). 672/672 steps across typescript/javascript/typescriptreact/ javascriptreact/yaml/html/vue (489 exercising recovery). 33/33 suite, lexer streams byte-identical, parser parity 0 mismatches, batch in band.
…erved
Required token matchers in recovering mode now synthesize a zero-width
\$missing leaf (expected identity in rowStart, LIT_NAMES/K_NAMES inverse for
the message) instead of failing, so 'const x = f(1, 2;' keeps its Call shape
and reports "expected ')'", and 'function g() { return 1;' closes the body
with "expected '}'". Synthesis is budget-free and position-pure: it fires
iff a recovery bar lies in [pos, pos+2] (missAt), never under probing
(not()/optional/separator probes) and never in free-fire.
Zero-width success is a synthesis-only artifact (a strict zero-width element
would never terminate its loop), so every loop discards it: plain reps break
on pos===before alone (restoring scn), hooked reps discard + recoverSkip,
leftRec continuations and Pratt LEDs refuse zero-width wraps. A rule can
still re-enter ITSELF at the same position through a synthesized leading
token — an unbounded recursion no grammar shape rules out — so recovering
runs keep a (rule, pos) in-progress set and fail the re-entry (PEG cycle
semantics; recRunning, zero strict-path cost). That sentinel also dissolved
the bar +1 ladders the recursion crashes were minting: broken-doc recovery
drops ~9x in the incremental gate (10.7s -> 1.2s).
Equivalence (edit == fresh) exposed that the bar protocol's input was not
adoption-invariant; three structural fixes:
- frameMax: a frame-local advance watermark (reset to the rule's start at
entry, folded into the parent on exit) replaces the global maxPos in
rowExt/memo watermarks, making recorded probe reaches EXACT instead of
contaminated by earlier-sibling probes. Bars (= strict-fail maxPos) now
reconstruct identically under adoption; the hot advance pays one extra
compare only at frontier breaches (frameMax <= maxPos nests the updates).
This also closes the recorded "exact per-frame extents" backlog item and
lands the bar on the true farthest probe (no more phantom synthesis from
inflated memo-jump watermarks).
- Recovery runs are adoption-free (edit-side attempt loop AND the
lex-recovered first run): a row recorded under a recovering frame carries
that run's bar-dependent reach, so replaying it makes the next bar a
function of the OLD bar history instead of (text, bars). Attempt 0 (empty
bars, behaviorally strict) re-derives the true strict frontier; every
attempt is byte-equal to the fresh side's. The barIn adoption-refusal
window from the first synthesis attempt is dead under this rule and
removed; adoptSeek's recovering rowRM bypass likewise.
- trySurgery refuses recovery-made trees (rowRM reaches the root
structurally): a strict splice into kept \$error/\$missing siblings was a
fake strict success that froze the OLD text's recovery shape, shifted.
Gates: incremental-grammars 672/672 across 7 grammars; recovery.ts gains a
synthesis-quality section (exact diagnostics + \$missing presence) and 4
session-found invalid shapes; incremental-verify gains the 5 protocol-pin
GLUE pairs; multi-doc 60/60 + contract 9/9; check suite 33/33; corpus
parity 401/401 sample, lexer parity 5695; perf-bench PASS (worst 803ms vs
802ms baseline; 9MB valid keystroke unregressed). verify-rejects: a tsc
Debug.assert crash on 'await using' shapes is counted as ORACLE-CRASH and
skipped (a crashed oracle has no verdict) instead of killing the gate.
Required RULE references failing inside the bar window now mint a zero-width
\$missing row carrying the rule identity (RULE_MISS_BASE + rid in rowStart),
reported as "expected Expr": 'const a = ;' / 'a + ;' / '-;' / 'x ? y : ;' /
'a, ;' / 'f(1, ;' all produce a single tsc-grade diagnostic at the right
offset. Hooks: parseRuleEntry's fail exit (memoized like any result) plus the
three Pratt rhs sites that bypass rule entries (operator LED, prefix NUD,
chain-rhs LED).
Synthesis placement follows COMMITMENT semantics, replacing the flat
probing counter for optionals: an optional group or repetition element may
fail freely while uncommitted (probeBase = its start; 'the optional thing is
absent' / 'the list ends' need no diagnostic), but once it consumes a real
token past that base, missing pieces synthesize — 'const a = ;' commits at
'=' and mints the Expr; rep(seq(',', Expr)) cannot mint a phantom ',' to
keep a list alive, yet after a real ',' the element synthesizes. not() and
separator probes stay absolutely suppressed (pure lookahead). FIRST-token
call-site guards open under recovering (one global read on the strict
guard-fail path): at a bar the next token is exactly what cannot start the
rule, and the hook lives inside parseRuleEntry — 'a, ;' must reach it.
Two latent bugs fixed in passing, both found by the new shapes:
- The frameMax conversion in the previous commit was double-applied at the
12 token-advance sites by a patch-script composition hole (edit #3's
pattern matched text edit #2 had just inserted; the anchor counts were
asserted on the pre-edit source), leaving the nested inner test
unreachable — token consumes never raised the global maxPos, so bars were
minted from a watermark that only memo jumps could move. Equivalence
gates stayed green because both engines ran the same wrong protocol;
the synthesis quality work surfaced it as losing-arm wins. Advances now
pair frameMax/maxPos correctly.
- The memo-jump coordinate refresh read toff(start) unguarded; for a
zero-width row minted AT EOF, start == tokN reads past the token columns
(stale slots from a longer previous document under handle reuse) — the
recovery gate's in-bounds check caught an "expected Expr" at offset 8 in
a 5-char document. The refresh now uses the same EOF guard as offset().
recovery.ts synthesis pins 3 -> 9 (the six nonterminal shapes above, exact
diagnostics + \$missing presence). All gates green: incremental-grammars
672/672, incremental-verify 136 steps, multi-doc 60 + 9/9, recovery
valid/invalid/typing/synthesis, suite 33/33, perf-bench PASS, 9MB fresh
438ms / valid keystroke warm ~0.6-5ms / breaking 649ms / while-broken
438ms / fixing 368ms (broken-state costs are the recorded follow-up).
Typing in a broken 9MB document drops from ~440ms to ~3-7ms per keystroke (avg 3.2ms over a 10-keystroke burst; incremental gate 9.9x vs fresh on its mixed valid/broken sessions). Recovery runs now ADOPT rows from the previous tree again — soundly this time, by making every recovery decision a pure function of the row's window: - recoverArmed takes (from, reach): a hook arms iff THE FAILING ELEMENT is stuck at a bar — its own frame-local probe reach (staged frameMax around hooked-loop elements) sits on the bar. The old form read the GLOBAL maxPos, so a frontier parked on a far bar could arm an unrelated loop whose own probes never approached it — a decision no window can reproduce. The runParse nets pass (pos, maxPos): top-level semantics unchanged. - barsWindowEq: a row adopts in a recovering run iff the bars inside its window [start, reach+2] are IDENTICAL (shifted) to the bars the build run saw there — with position-pure decisions, window text + window bars determine the frame's behavior completely, including losing-arm fires and synthesis. lastBars rides the document register set; strict trees carry [], free-fire trees null (free-fire is not bar-pure - never adopted while recovering). rowRM rows are adoptable under the predicate (the error region itself is what stays stable across far edits), and runExtend re-checks per member. The blanket adoption-off in the bar iteration and the lex-recovered first run is removed; attempt 0 (no bars) adopts exactly where the build run was also bar-free. The changed fire pattern exposed a latent message-derivation bug present in committed code: collectErrRows decoded a \$error row's first kid as a token leaf unconditionally, but the runParse leftover net builds a WRAPPER \$error whose kids are nodes ([partial-root, tail-error]) - (~nodeId)>>>2 indexed a garbage column, docText read text from an unrelated offset, and the two text layers (contiguous string vs pieces) resolved the garbage differently, which is how the gate caught it (equal trees, different messages). Wrapper-shaped \$error rows now fall through to the generic descent so the tail derives its message from its real first token. All equivalence gates green (incremental-grammars 672/672, incremental-verify 136 steps, multi-doc, recovery incl. synthesis pins 9/9), suite 33/33, perf-bench PASS, strict corpus parity intact. 9MB: fresh ~508ms, breaking keystroke ~409ms (the absorbed error region re-parses; recorded follow-up with fix-transition ~395ms), keystrokes while broken 3-13ms.
Recovery attempts within one sequence parse the same token stream under a monotonically growing bar list, so a memo entry from an earlier attempt is provably valid in a later one when its probe window [start, mx+2] contains no bars: no bars means no synthesis and no skip arming, and the opened dispatch guards only add non-consuming probes - the frame behaved strictly, a pure function of the window text. The one exception is the recRunning cycle refusal, which can fire without synthesis (open guards let a ref chain cycle at one position) and depends on which frames are on the stack. recRunning now maps each frame to an entry serial; a refusal leaning on a frame entered before the current one taints the current frame's memo entry (stamped -memoGenCur: reusable only in its own generation, and propagating the taint to whoever reuses it). This is the diagnosed hole that sank the first survival attempt. Survival is edit-side only: the fresh-parse attempt loop calls parseCore, which resets the arena cursor per attempt, so an earlier attempt's rows are clobbered there. A mid-parse '>'-splice disables survival for the rest of the sequence (pre-split positions can't be revalidated). Also removes recFires (dead since the rowExt write-back subsumed the recFires stamp). 9MB transitions: breaking 335ms -> 157ms, fixing 230ms -> 146ms (both now lexer-bound); while-broken typing 3.4ms unchanged. All equivalence gates green: incremental-grammars 672/672, incremental-verify 136, multi-doc 60, recovery pins 9/9, check 33/33, emit-parser corpus parity 401/401.
…liff
The window relex resynced only on exact stack-depth equality, so an edit
that changes paren balance shifts the entire suffix's absolute depth column
and the window regrows to EOF - a 9MB document paid ~130ms of relexing on
every break/fix transition for a one-token depth shift.
The resync now has two sufficient conditions, both proven from observable
state (template stacks empty on both sides; candidate token carries no
cross-token lexer flag a successor reads):
- FAST (O(1)): equal depth and neither lex dipped below it since the
divergence point (damage start) - every open entry is then common to
both lexes, the stacks are content-equal, and every future pop behaves
identically. Trajectory minimums are folded incrementally (old side
seeded from the damage-interior tokens, new side tracked per push).
- SHIFTED: the old suffix never pops an entry open at the candidate
(lazy suffix-min over the old depth records, pop-on-empty = -1): no open
entry's head-ness is ever read again, stack contents are irrelevant, and
the depths may differ by an arbitrary shift. The splice then re-bases the
adopted tkPd column by the shift, restoring true absolute depths ('('
head bits are local facts of their own neighbors and stay valid).
This also closes four latent unsoundness classes in the old equality path:
a resync candidate that is a postfix-ambiguous op, control keyword, '(' or
')' lets the adopted successor read state derived from tokens the window
re-lexed differently; and template-depth equality cannot prove the mutable
interp brace counters equal (resync inside templates now waits for depth
0). Each slides the resync at most a few tokens.
9MB transitions: breaking 157ms -> 5.8ms, fixing 146ms -> 2.9ms; valid
keystroke 1.8ms -> 1.1ms; while-broken typing 3.4ms -> ~2ms. Gates: lexer
parity 5695 diff=0, incremental-grammars 672/672, incremental-verify 136,
multi-doc 60, recovery pins 9/9, check 33/33, corpus parity 401/401,
perf-bench worst 472ms.
trySurgery refused any tree containing recovery rows (rowRM root). It now accepts them when the edit provably commutes with every recovery decision: decisions are position-pure functions of (window text, window bars), so a splice is sound when no bar window touches the damage or the re-parsed span's probe reach - kept rows replay identically at shifted positions, and a fresh recovering parse behaves strictly across the span, exactly like the strict re-parse the surgery runs (a fire inside the span would need a bar at/below the probe reach + 2; prefix attempts use prefixes of the same bar list, so one check against the final list covers every attempt). The spliced tree keeps its bar list with suffix bars shifted by the token delta; bars adjacent to the damage (unmappable) and free-fire trees (lastBars null, not window-pure) refuse. The multi-doc gate immediately caught a latent length bug this exposed: finishNode takes a node's char end from its LAST KID, which a trailing zero-width $missing row pushes past the last real token - but surgery re-derived ancestor lengths from the token columns, clipping that extension. A node whose token end lies strictly beyond the damage now keeps its end shape (rowLen += chrD: every end-determining coordinate sits in the shifted suffix); only nodes ending at/inside the damage use the token derivation (no zero-width row can end them - zero-width rows live at bars, and damage-adjacent bars were refused). Strict trees take either branch to the same value. 9MB while-broken typing now sits at valid-path parity (~1-1.7ms vs ~1ms valid; surgery additionally applies wherever its container shapes allow). Gates: multi-doc 60 + contract 9/9, incremental-grammars 672/672, incremental-verify 136, recovery pins 9/9, check 33/33, corpus parity 401/401.
Two grammar-derived enrichments of the $missing diagnostics, both resolved
at settle from the tree (zero parse-time cost, adoption/replay-safe):
- PAIR_OPEN: for each literal C, intersect - across every seq occurrence of
C with preceding literals in its sequencing scope (groups inlined;
quantifier/alt contents inherit a copy of the scope's accumulator, since
they physically follow its earlier literals; nothing leaks back) - the
sets of those preceding literals. A unique survivor is C's structural
opener: ')' keeps '(' through if/while/call alike, interior separators
intersect away, and ','/':'/'(' themselves die as ambiguous. The closer's
diagnostic then carries related info pointing at the matched opener leaf
found among its earlier siblings ("expected ')'" / "to match this '('"),
with keyword pairs like 'while'<-'do' falling out for free. shiftDiags
shifts the related anchor on its own coordinates (it can sit on the other
side of the damage from its diagnostic - the surgery path caught this).
- Viable-set messages: for a required literal C in a seq, the literals
PROVABLY still accepted when C's matcher fails - repetitions before C are
always re-enterable so their nullable-prefix-reachable literals stay
viable; nullable one-shot items are crossed but contribute nothing (they
may already have consumed). "expected ',' or ']'" therefore never names
an impossible continuation, unlike a static FIRST union (after `[1, 2` an
expression is not viable) - and unlike tsc, which under-reports the same
position as "')' expected". Registered per call site during emission and
threaded through the literal matchers into the $missing row (rowStart
bits 21+; the row is zero-kid, the slot is free), decoded at settle.
cst.errors entries gain an optional related: {offset, end, message} field.
Pins re-pinned (11/11, exact); gates: incremental-grammars 672/672,
incremental-verify 136, multi-doc 60, check 33/33, corpus parity 401/401,
perf-bench unchanged.
test/head-to-head.ts runs one 9MB TypeScript document through identical single-character edit scripts (warm valid keystrokes, a paren-deleting breaking edit, while-broken typing, the fixing edit) on all three engines, with positions recomputed from the current text so every engine sees byte-identical edits and timers wrapping only the engine call. tsc runs setParentNodes=false; node-tree-sitter caps input strings at 32767 chars, so it reads through its 16KB chunk-callback path. Results (node v24, Apple silicon): Monogram beats tsc on every phase (fresh 177 vs 212ms, valid keystroke 0.37 vs 37ms, while-broken 0.21 vs 13.6ms, fixing 1.0 vs 14.1ms) and beats or matches tree-sitter on fresh (177 vs 458ms) and while-broken typing; tree-sitter wins the two transition edits (0.26 vs 13ms breaking), where the strict-first architecture pays one adoption-assisted strict pass to prove rejection before recovering. Numbers + the two byte-identity guarantees added to the README under 'How it measures up'.
test/recovery-conformance.ts: on every single-file conformance test tsc's PARSER rejects (parseDiagnostics non-empty - the live source of the .errors.txt syntax baselines, with semantic noise excluded by definition), compare Monogram's total-parse cst.errors bidirectionally at +/-8 chars: recall (tsc errors we also report): 530/951 = 55.73% precision (our errors tsc also reports): 580/702 = 82.62% first-error agreement: 203/355 = 57.18% files we accept but tsc rejects: 116 The sample divergences localize the gap classes: the accept side is dominated by tsc's context-parameter checks ([Await]/[Yield] parameter positions, reserved names in declaration slots) plus a few CFG-expressible shapes; the missed side is recovery-policy granularity (one absorbed region vs tsc's several pointed diagnostics).
Two syntactic over-accepts found by the diagnostics comparison against tsc:
- parseTemplateExpr (both engines) treated a template HEAD as committing to
nothing: on EOF or any non-middle/tail token after a substitution it
closed the $template node and returned success, so 'let s = `tpl ${x;'
parsed clean. A head now commits to the full chain - every substitution
must hold an expression and every span must continue (middle) or close
(tail); an unterminated template is a parse failure, not a shorter match.
Also rejects empty substitutions ('`${}`'), matching tsc.
- notReservedExpr gains 'case': the bare-identifier expression fallback
accepted the reserved word, so 'switch (x) { case 1 y(); }' parsed as
three statements through the switch body's Stmt arm (the flat
many(SwitchCase) shape made the missing ':' invisible).
A full accept/reject flip scan over the single-file conformance corpus
shows exactly ONE flip: TemplateExpression1.ts (an intentionally-invalid
error test tsc rejects) now correctly rejects - no valid file regressed.
Error-recovery conformance recall 55.7% -> 59.1%; check 33/33, engine
parity 401/401, all 7 generated outputs byte-identical.
TOTAL-PARSING.md: the formal spine in one place - the totality contract, strict-first two-pass structure, the bar discipline with its determinism theorem (bars are a pure function of the token stream, forcing every ingredient to be adoption-invariant), position-pure recovery actions with commitment semantics, the three structural theorems the generative gates forced (zero-width = synthesis-only; same-position cycles and their taint refinement; exact adoption-invariant watermarks), the window-replay theorem with its three corollaries (recovering adoption, cross-attempt memo survival, recovering surgery) and the one known open caveat (row-level taint), the two lexer-resync soundness conditions, tree-derived diagnostics, and the measured head-to-head numbers. test/exhaustive-edits.ts (CI gate 34/34): over a small bracket-and-list grammar, EVERY document up to 4 chars over the grammar's alphabet x EVERY single-character edit (delete/replace/insert at every position) must parse byte-identically to fresh - tree and errors. Complete within its bound: ~330k steps (EXH_MAXLEN=5 runs the 3.2M-step deep version, also clean). The gate immediately earned its keep: it caught a one-case regression in the day-old surgery length update - a node whose BASE token sits at the damage start (leading trivia inserted at a node's very start) shifts base and end together, leaving the length alone, so rowLen += chrD was wrong exactly where the token derivation is right. keepEnd now also requires the base token to sit strictly before the damage.
Phase-timing the head-to-head's 13ms breaking edit: the strict-fail pass is 0.35ms and the recovery attempts 0.6ms - the cost is lexer-layer suffix bookkeeping on the bench's first-touch 4.5MB cursor jump (a one-time suffix-min allocation plus EOF-relative re-basing of the token columns across the jump). Repeated break/fix transitions at one cursor position settle to ~2ms. README and TOTAL-PARSING.md now say so instead of blaming the strict-first pass.
rowRM becomes bitwise: bit 1 keeps the structural error containment the
diagnostics walk descends; bit 2 marks a CONTEXT-TAINTED result - a frame
whose parse leaned on the cycle sentinel finding an ancestor (its outcome
is a function of the ancestor stack, not the text). The memo stamp alone
only protected the entry; the row adoptSeek can find was still reusable.
Tainted rows now also refuse recovering adoption and run extension,
closing the open caveat documented in TOTAL-PARSING.md. Strict adoption
already required rowRM === 0 and is unchanged.
notReservedExpr gains 'class': a valid class expression always out-matches
the bare-identifier fallback under longest-match, so forbidding the
fallback only rejects broken classes - 'const k = class extends D ;' with
no body parsed as three statements. A zero-flip accept/reject scan over
the whole single-file conformance corpus proves no valid shape regressed;
'extends' stays OUT - it is load-bearing for tsc's tolerated heritage
shapes ('interface I extends { }', 'extends A extends B', 'extends
Foo?.Bar' are all parse-accepted by tsc through the fallback, measured).
Gates: 34/34, corpus parity 401/401, generated outputs byte-identical,
transitions unchanged (~6ms first-touch, ~2ms steady).
The shifted lexer resync's dominant case is a depth-0 candidate (statement boundary), where 'the old suffix never pops an entry open at the candidate' collapses to 'no pop-on-empty beyond the candidate'. The lexer now records the token indices of ')' pops that found an empty paren stack (an ascending doc-level list, almost always empty - a stray closer beyond balance), recomposed by the window splice, shifted by the '>'-split, and persisted on the document register set. The depth-0 check is then one end-of-list comparison instead of an O(suffix) minimum build; only depth > 0 candidates (e.g. the fixing direction of a broken document) still build the suffix minimum, lazily once per edit. Steady-state breaking transitions on 9MB drop ~2.1ms -> ~1.6-1.9ms; the profile now reads strict-fail 0.23ms + attempts 0.46ms + spread bookkeeping, with the raw 7-column suffix memmove measured at 0.07ms - no storage floor in the way. README/TOTAL-PARSING tables refreshed from a fresh head-to-head run, with the cursor-jump amortization stated as what it is (a far jump pays once, proportional to distance; local typing never rewrites the suffix). Gates: 34/34, lexer parity 5695 diff=0, incremental-grammars 672/672, corpus parity, perf-bench under ceiling.
notReservedExpr grows by the statement keywords with no expression role:
break, continue, debugger, do, else, finally, for, if, return, switch,
try, while, with. Bare 'if' parsed as an identifier expression, which let
'namespace if {}' (the namespace arm correctly fails its notReserved name)
fall apart into three accepted identifier statements - the same fallback
family as 'case'/'class'. 'var' stays OUT: tsc parse-accepts 'for (var of
X)' through shapes that need it.
Blocking 'for' exposed a real grammar gap the fallback had been MASKING:
'for (a in b[c] = b[c] || [], d)' previously parsed as a CALL of the
identifier 'for' (the for-statement arm failed, the call parse won). The
for-in OBJECT is a full Expression - comma included - so both ForHead
in-arms gain many(',', Expr); for-of keeps a single AssignmentExpression
(tsc rejects 'for (x of a, b)', and so do we, where we previously
mis-accepted it through the call fallback).
Per-flip tsc verdict over the whole single-file conformance corpus:
7 flips, ALL toward tsc, 0 away. Error-recovery conformance recall
59.1% -> 61.2%, first-error agreement 57.5% -> 59.7%, we-accept files
115 -> 108. Gates 34/34, corpus parity 401/401, tree-sitter generate
clean on all 4 affected grammars, gate:treesitter 96.0%.
The 108 remaining accept-divergences split into the [Await]/[Yield] context class (31 files - needs exclude()-style identifier-text context threading in the engine) and 77 per-shape strictness items, each named with its fix recipe (fix + flip-scan FN=0 proof).
…reject ClassMember modeled decorators as a STANDALONE sibling alternative, which tolerated an orphan '@dec' with no member and (together with the modifier-named-field fallback) any decorator/modifier interleaving. Decorators are now a prefix of the member shape ([many(DecoratorExpr), many(Modifier), ...]) in both grammars, with the static-block arm taking the same prefix ('@dec static {}' is parse-clean for tsc - the decorator there is a semantic error only). Cumulative flip-scan with per-flip tsc adjudication: 7 toward tsc, 0 away (the first attempt rejected the decorated static block - tsc accepts it - and the scan caught it). The 'public @dec method()' sub-case still parses through the modifier-named-field fallback; matching tsc's greedy modifier commitment there needs the fallback's bare-name arm split, recorded in the ROADMAP item. Gates 34/34, corpus parity 401/401, tree-sitter generate clean on all 4 affected grammars, gate:treesitter green.
tsc's measured rule: '@' directly after a property on the SAME LINE binds
to that property ('Decorators must precede the name and all keywords of
property declarations') - 'x @dec y()' and 'x = 1 @dec y()' parse-reject,
while 'x; @dec y()' and a newline before '@' accept. Encoded exactly: the
field tails' no-';' ending carries not([sameLine, Decorator]) in both
grammars (alt([';'], [not([sameLine, Decorator])])). This also closes the
'public @dec method()' shape: the bare 'public' field reading now refuses
the same-line decorator, and the modifier reading correctly fails.
not() now accepts an array as a seq, like everywhere else in the rule DSL
(the NotNode conversion previously threw on arrays).
Cumulative flip-scan with per-flip tsc adjudication: 12 toward tsc, 0
away. Gates 34/34, corpus parity 401/401, tree-sitter generate clean x4,
gate:treesitter green.
The windowed-relex resync aligned candidates on kind/text/offset/end but NOT on the token's flags - yet the gap BEFORE the candidate can sit inside the edit: inserting '42' into '}\n privat' leaves every token byte identical from the candidate on while removing its preceding newline. The old token was adopted with a stale newlineBefore, and anything reading the flag downstream (sameLine assertions, comment-aware folds) diverged from a fresh parse. Found by delta-debugging an edit/fresh divergence to a 690-char repro and diffing full streams including flags; the leaf tilings were identical, which is why tree comparisons alone never caught it. The window lex has already computed the candidate's true flags when the resync fires (it lexed the gap), so the fix is one equality in the resync condition: the pushed candidate's flags must match the old token's. A mismatch just keeps lexing - the next candidate's gap lies beyond the edit, so the flags converge and the regrow terminates. Gates: 34/34, lexer parity 5695 diff=0, incremental-grammars 672/672, corpus parity 401/401.
Lands the full measured tsc class-member ruleset (probes 12/12, flip-scan
3-toward/0-away on top of the decorator-prefix + sameLine work already in):
- class-field ASI: a ';'-less field allows only a same-line '}' — 'x y',
'x = 1 y = 2', 'var x = 1;' parse-reject; newline / ';' / '}' accept.
Tail generalized to alt([';'], [not(sameLine)], [not(not('}'))]).
- modifier-vs-name: a modifier keyword followed by '('/'='/':'/';'/'?'/
'!'/'<'/'{'/'}' is the member NAME, not a modifier ('public() {}',
'static = 1', 'public public() {}').
- parse-tolerated member modifiers: declare (real), export/in/out
(semantic errors tsc's parser accepts) — 'export Foo;', 'in a = 0;'.
- accessors take optional type params ('get x<T>()' parses).
- static-block arm takes a modifier prefix ('async static {}').
The blocker was gen-cst-match: it drops parse-time not() guards and emits
GREEDY repeats, so [many(Modifier), 'static', Block] was destructurer-
ambiguous — the modifier-repeat swallowed the 'static' keyword leaf the
literal needed, and every static block failed to match. Fixed at the root:
a greedy loop / non-required optional now leaves at least minKids(suffix)
children for the required steps that follow it (threaded across nesting).
Proven a no-op on the parser's own trees — count + suffix-consumed = cc and
suffix-consumed >= minKids, so the cap cc-minKids never cuts below the
parser's actual count; it only blocks over-consumption a dropped guard used
to prevent. Verified: generated matchers byte-stable on all 7 grammars
before the recipe (cst-match-totality green), total after.
The js/jsx tmLanguage shift (async/accessor between storage.modifier
buckets) is scope-gap-NEUTRAL (95.7% correct / 77.0% exact / +5.1pt gap,
byte-for-byte identical before/after); ts/tsx tmLanguage unchanged.
Error-recovery conformance: recall 61.2% -> 62.4%, first-error 59.7% ->
62.3%, precision 82.7% -> 83.4%, we-accept 103 -> 100. Gates 34/34, corpus
parity 401/401, tree-sitter generate clean x4, gate:treesitter 96.0%.
tsc parses an interface with REPEATED extends clauses
("interface I extends A extends B {}") — the parser accepts them, the
checker reports the duplicate. Mono's single opt('extends', sep(Type,','))
clause rejected the second extends, so the construct only "parsed" by
splitting into garbage statements. many('extends', sep(Type,',')) mirrors
tsc and produces the correct interface-with-heritage tree
(parserInterfaceDeclaration1-4, interfaceThatInheritsFromItself).
Accept-neutral on the corpus (the split path already accepted these),
gates 34/34, corpus parity 401/401, gate:treesitter 96.0%; also a
prerequisite for statement-level ASI (Task #24), which otherwise rejects
these as a mid-line split.
…l augmentation
tsc's parser accepts a leading modifier before any declaration (the checker
rejects invalid combinations); mono only had piecemeal opt('async') before
function and opt('abstract') before class, so "async class C {}" /
"abstract interface I {}" only "parsed" by splitting into garbage
statements. A modifier-prefix arm [alt('async','abstract'), Decl] tried
after the dedicated arms now produces the correct modifier+declaration
tree while leaving valid "async function" / "abstract class" flat.
Also adds the two declare forms mono was missing: ambient module shorthand
"declare module \"foo\";" (no body — the module arm requires braces) and
"declare global { ... }" (global-scope augmentation; global is a
contextual-keyword block, not a namespace name).
Accept-neutral on the corpus (the old split path already accepted these
invalid-but-parseable shapes), gates 34/34, corpus parity 401/401,
gate:treesitter 96.0%. Value is CST correctness for these constructs and
as prerequisites for statement-level ASI (Task #24) — though that lever
remains a large multi-area round (measured whack-a-mole: with these
companions in place, ASI still leaves ~19 distinct tsc-accepted shapes it
breaks across regex/divide, unique-symbol, import-type-args, protected,
comma-operator, etc., so it does not land incrementally).
… false-rejects The modifier-prefix arm accepted only async/abstract before a declaration, so tsc-clean files leading with another modifier on a declaration (protected class, public interface, static interface, accessor class) were outright rejected — not even split-parsed, since protected/public/etc. are not expression starts. tsc's parser accepts any modifier before any declaration (the checker rejects the invalid combination). Widen the prefix to async/abstract/public/private/protected/readonly/static/ override/accessor. Measured over the single-file conformance corpus: false-rejects (tsc-parser-clean files mono throws on) drop from 19 to ZERO — mono now parses every tsc-clean single-file conformance test. Additive and over-accept-neutral: we-accept stays 100, recall 62.4%, gates 34/34, corpus parity 401/401, gate:treesitter 96.0%.
…essors
Two more tsc-clean shapes mono outright rejected (false-rejects):
- "class C { static const H = 1; }" — tsc parses const as a (semantically
invalid) member modifier; add it to the class-member modifier set, where
the not()-followed-by-name-token guard still treats "const = 1" as a
member NAMED const.
- "var v = { get foo() }" — an object-literal accessor with no body parses
in tsc (error recovery); the accessor body becomes opt(Block).
Both additive and over-accept-neutral: compiler-corpus false-rejects drop
28 -> 24, conformance stays 0, we-accept stays 100, recall 62.4%, gates
34/34, corpus parity 401/401, gate:treesitter 96.0%.
tsc parses index signatures more leniently than mono did (the missing
annotations/commas are checker errors): a class index signature without a
value type ("class C { [x: string]; }") and a trailing comma inside the
bracketed params of a class or type-literal index signature ("type A = {
[key: string,]: string }"). Class index-sig value type becomes optional
with an opt(',') param tail; the type-literal index branch gains the same
opt(',').
Additive, over-accept-neutral: compiler-corpus false-rejects 24 -> 21,
conformance stays 0, we-accept 100, gates 34/34, parity 401/401,
gate:treesitter 96.0%.
The ?. continuation accepted member / call / index / template forms but
not a typed call (a?.<T>(args)) — a valid TS optional-chain instantiation
that mono wrongly rejected. Adds the ['<', sep(Type), '>', '(', sep(Expr),
')'] form. Gates 34/34, parity 401/401, conformance FN stays 0, we-accept
100, gate:treesitter 96.0%.
The infrastructure for [Await]/[Yield] context-sensitive parsing via build-time grammar name-forking (workflow-selected approach C, the only one that survives the node-surgery reuse path — context becomes rule identity, which every reuse predicate already keys on, so a cross-family reuse is structurally unrepresentable rather than guarded). - types.ts: an optional ctxMode on the transparent `group` RuleExpr, and a `canon` field on RuleDecl (a fork's base rule for every derived artifact). - api.ts: awaitCtx / yieldCtx / asyncGenCtx / resetCtx combinators — transparent groups carrying ctxMode; every consumer but the fork transform treats them as plain groups, so no generator marker plumbing is needed. - src/await-yield-fork.ts: withAwaitYield(grammar) — marker-driven multi-family rule closure (the reset boundary is explicit via resetCtx, open question #3), clone + per-family ref reroute, reserved-guard variants that forbid the context keyword, forks appended after base rules (rid/entry-last preserved), canon set. NOT wired into any grammar yet (no marker uses) — a pure no-op: gates 34/34, all 7 generated outputs byte-identical. Verified the transform algorithm standalone on synthetic grammars (correct closure, reroute, guard, no dups, per-family). Next: the emitted-parser/cst-match canon plumbing, then route the TS/JS async/generator bodies.
…ist) The emitted parser gains RULE_DISPLAY (RuleDecl.canon ?? name) used by ruleNameOf and the $missing "expected X" message, while RULE_NAMES stays unique for memo/ adoption rule identity and the entry indexOf. The interpreter parser stamps a node's `rule` field with canon ?? name the same way. So a forked rule (Block$A) reports its base name (Block) on the green node — trees byte-identical to the base grammar — while the distinct rule identity drives the memo/adoption key. Identical to RULE_NAMES when no rule is forked: gates 34/34, emit==interp corpus parity 401/401, all generated outputs byte-identical. The derived-artifact generators (AST/TM/tree-sitter/cst-match) need fork handling only once a grammar actually forks; deferred to the wiring step.
emitParser and the interpreter createParser now apply withAwaitYield to their input grammar, so the [Await]/[Yield] forks live ONLY in the parser rule-identity / memo / adoption space. The derived-artifact generators (AST / TM / tree-sitter / cst-match) keep seeing the base grammar with the transparent-group ctx markers and so need no fork handling for their output — the markers are invisible to them. Verified byte-identical on typescript.ts (no ctx markers ⇒ empty closure ⇒ no forks): the emitted parser is diff-identical before/after, gates 34/34, emit≡interp parity 401/401. cst-match's rid-space agreement + the grammar marker wiring come next.
gen-cst-match applies the same withAwaitYield fork so its rule-id space matches the parser's tree, emits matchers/types for BASE rules only (a fork collapses to its base), and canonicalizes the CHILD side of every rule-id check through a RULE_CANON table (__nodeOf and the first-child dispatch switches), so a base matcher accepts a forked child node. RULE_CANON is the identity map without ctx forks: gates 34/34 including cst-match-totality, generated outputs byte-identical. All [Await]/[Yield] fork infrastructure is now in place and proven non-regressive (markers + transform + parser/cst-match canon, every step a verified byte-identical no-op). Remaining: wire the ctx markers into the JS/TS grammar (split async/non-async arms, mark bodies+params and reset boundaries), then gate strict acceptance + the new function<->async-function generative edit-class.
…en end-to-end)
First REAL behavioral use of the fork. javascript.ts splits the arrow arms into
async / non-async so each routes to the right rule family: an async arrow wraps its
params and body in awaitCtx (await is the operator, no identifier reading), a plain
arrow's body in resetCtx (context resets). The reserve is a `reservable` flag on the
notReservedExpr / notReserved guards (they are inline `not(alt(...))`, not rules, so
the earlier rule-fork path could not reach them); withAwaitYield's rewrite extends a
reservable guard with the family's keyword, so `await`/`yield` lose their identifier
reading inside the context. cst-match's MATCHERS_BY_ID maps a fork rid to its base
matcher, and the expected rids in __nodeOf / dispatch switches are canon-baked.
Verified: async (a = await) => 0 REJECTS, async () => await REJECTS, function f(a =
await){} ACCEPTS, async () => await x ACCEPTS, x => x / async (x) => x ACCEPT — exactly
tsc. And the whole thing holds the incremental guarantees: gates 34/34 (incl
incremental-grammars edit≡fresh over JS, cst-match-totality, emit≡interp parity),
all generated outputs byte-identical. The fork preserves window-replay because the
context IS the rule identity. Async functions / methods / generators (and typescript.ts)
are the same pattern, wired next.
typescript.ts mirrors the javascript.ts arrow split + context markers (async arrow
params/body await-context, plain arrow body reset; type params/annotations stay
plain — they are not [Await]-parameterized). `async (a = await): Promise<void> => {}`
now rejects (await needs an operand), while every valid async arrow (`async (x):
Promise<void> => await x`, `async <T>(x: T) => await x`) and non-async default
(`(a = await) => 0`, `function f(a = await){}`) still parses.
Error-recovery conformance: we-accept 100 -> 91 (the async-arrow over-accepts
cleared), recall 62.4% -> 63.3%, first-error 62.3% -> 64.8%, FN stays 0. Gates 34/34,
emit≡interp parity 401/401, byte-identical generated outputs, tree-sitter generate
clean x4, gate:treesitter 96.0%. Async functions / methods / generators next (same
pattern, more productions).
javascript.ts function-expression production 4-way split (plain / generator / async /
async-generator), each routing its params and body to the right [Await]/[Yield] family.
The await family is now correct for function expressions: `async function(){ let await=1 }`
rejects (await reserved), `async function(){ return await x }` and `async function*(){
yield await x }` parse. Valid JS unaffected (parity 0/0/0, 34/34, gate:treesitter 96.0%).
The yield family is routed (generator bodies -> $Y) but not yet fully reserved: `yield`
is a dedicated Expr arm present in every family, so `yield 1` outside a generator and
`function* g(a = yield){}` still over-accept — fixing that needs a family-conditional-arm
mechanism (next).
…bodyless-fallback escape
Two changes land the [Await]/[Yield] context across all function *declarations*
(JS + TS), not just expressions and arrows:
1. fnArms / tsFnArms helpers generate the four async×generator arms (plain / generator
/ async / async-generator) for every `function` form, routing each arm's params and
body to its family. Applied at all six sites (JS function expr/decl/export-default,
TS the same with type params + return type kept plain). `async` is split out of the
Decl modifier-prefix soup with a `not('function')` guard so `async function` must
take the context-bearing arm instead of being re-accepted as a plain function with
a stray `async` modifier.
2. Root-cause fix for a whole class of error-masking: a TS function declaration body
was `alt(Block, opt(';'))`, so when Block failed (e.g. an [Await]-context violation
like `async function g(){ let await=1 }`) the parser fell through to the bodyless
`;` signature form, parsed a zero-body declaration, and re-parsed the `{...}` as a
separate block statement in plain context — silently accepting the error. Guarding
the bodyless form with `not('{')` makes a present `{` commit to the Block body, so a
body parse error stays an error. Overload/ambient signatures (`function f(): T;`)
still parse (no `{`).
Error-recovery conformance: we-accept 91 -> 81, recall 63.30% -> 64.46%, first-error
64.79% -> 67.32%, FN stays 0. Gates 34/34, parity 0/0/0, byte-identical generated
outputs, tree-sitter generate clean x4, gate:treesitter 96.0%.
`x => …`'s shorthand parameter was a bare `Ident`, so a contextual keyword that lexes as an identifier (`await`/`yield`) slipped through as a parameter name even inside an [Await]/[Yield] context — the parenthesized form already routed through `notReserved` via Param, but the shorthand bypassed it. Guard it with `notReserved` (non-async arm, which inherits the enclosing family so `await => …` rejects nested in async params but parses standalone) and `awaitCtx(notReserved, Ident)` (async arm, always [+Await]). This is exactly the nested-arrow-parameter shape the spec calls out: `async function foo(a = await => await)` and `async (a = await => await) =>` now reject (await is the inner arrow's [+Await] parameter), while `await => await` standalone and every ordinary arrow still parse. `async function f(a = yield => yield)` stays accepted (async is not a generator, so yield is a valid identifier there). Error-recovery conformance: we-accept 81 -> 74, recall 64.46% -> 65.83%, first-error 67.32% -> 69.30%, FN stays 0. Gates 34/34, parity 0/0/0, tree-sitter clean x4, 96.0%.
A class static block's statement list is [+Await] per spec (ClassStaticBlockBody :
ClassStaticBlockStatementList[~Yield, +Await, ~Return]), so `await` is reserved inside
it: `static { await; }`, `static { let await = 1; }` now reject, while a static block
with ordinary statements and a nested non-async `function f(await){}` (whose own
parameters reset to no context) still parse. Wraps the static block's Block body in
awaitCtx; decorators/modifiers on the block keep parsing (they are semantic errors).
Error-recovery conformance: we-accept 74 -> 73, recall 65.83% -> 66.67%, first-error
69.30% -> 69.58%, FN stays 0. Gates 34/34, parity 0/0/0, tree-sitter clean x4, 96.0%.
…lenient
tsc's PARSER accepts await/yield as binding identifiers even inside an async/generator
body (`async function f(){ let await = 1 }`, `function* g(){ function yield(){} }`) —
the "reserved word" rule there is a checker diagnostic, not a parse error. Only at
EXPRESSION position does tsc reject, because `await` must be the operator and so needs
an operand (`await;`, `await =>`, `a = await` -> "Expression expected").
The earlier fork made `notReserved` (the binding guard) reservable too, which
false-rejected those lenient bindings. Drop that: only `notReservedExpr` (the
expression identifier-NUD guard) carries the [Await]/[Yield] reservation, and the
single-identifier arrow parameter now guards with `notReservedExpr` so `await => x`
rejects in an await context via the same operator-needs-operand path tsc uses (it
parses the arrow head as an expression first), while `let await`/`var yield`/named
`function yield(){}` parse everywhere.
Bidirectional over the single-file conformance corpus: false-rejects of tsc-accepted
files drop (the await/yield-binding FN, asyncOrYieldAsBindingIdentifier1, is gone);
over-accepts unchanged (they were always expression-position). recovery-conformance
recall 66.35%, first-error 69.58%, we-accept 73. Gates 34/34, parity 0/0/0, 96.0%.
… async
Class members and object-literal properties now route method params/bodies to their
[Await]/[Yield] family instead of leaking the enclosing context: plain methods,
constructors, accessors and field initializers reset (a method body has its OWN,
non-inherited context — the spec's implicit function boundary), generators yield,
async await, async-generators both. A computed key `[e]` stays OUTSIDE the family (it
is evaluated in the enclosing context), so `class C { [await](){} }` inside async still
rejects while the method bodies don't.
`async` is pulled out of the member modifier soup into dedicated arms (the class analog
of the Decl/arrow fix) so the body gets its await context — but tsc parses `async` as an
ORDER-FREE modifier (`async static m`, `override async m`, `async get x`, `async static
{}` all parse, the checker validates), so each async arm carries its own inner
many(Modifier) run and there are async-accessor / async-static-block arms. The `static`
modifier's `not('{')`-style guard keeps `async static {}` parsing the block, not eating
`static` as a modifier.
This closes the class-body context leak: `async function f(){ class C { m(){ await; } } }`
and `{ x = await }` field initializers now parse (method/initializer reset), matching
tsc's parser; over the single-file conformance corpus the await/yield false-rejects are
gone (FN drops to 2 pre-existing externalModules import-feature cases, unrelated). Async
methods reject `await;`/`await =>` like async functions do.
recovery-conformance unchanged at recall 66.35%, first-error 69.58%, we-accept 73 (the
method await cases were never in the single-file set). Gates 34/34, parity 0/0/0,
byte-identical generated outputs, tree-sitter generate clean x4, gate:treesitter 96.0%.
The random mutator only hits an async/generator toggle by luck, yet that edit is the whole reason the context is a build-time name-fork rather than a runtime flag: flipping `async`/`*` on an enclosing function changes its body's RULE IDENTITY (Block -> Block$A/$Y/$AY), and a runtime flag read by core() but absent from the reuse key would let a stale cross-family row survive. This adds a scripted edit class over hand-authored async/generator documents — drop/re-add `async`, drop a generator `*`, edit an async arrow's params, a yield operand, a class method's async/`*` — interleaved with a surgery-path in-body keystroke, asserting each stays edit≡fresh + self-consistent. 706/706 steps equal+consistent across all 7 grammars: the name-fork preserves the window-replay theorem verbatim under exactly the edits it exists to survive.
A `using` / `await using` declaration binds a plain BindingIdentifier, never a pattern. UsingBinding replaces the pattern-allowing Binding/ForBinding in the using arms, so `using [a] = null` falls through to the expression `using[a] = null` — which is exactly how tsc reads sync `using` in statement position (it is a contextual identifier there), so the tree now matches instead of minting a bogus using-declaration with a pattern. The `await using [a]` parse-error tsc reports is NOT cleared by this alone: it is statement-ASI-gated — mono still splits `await using` off `[a] = null` into two statements (the Task #24 gap), so the over-accept stands until the ASI round, which this identifier-only binding is a prerequisite for (the await-using arm must reject the pattern once ASI stops the split). Accept-neutral: recovery-conformance unchanged (we-accept 73, recall 66.35%, first-error 69.58%), 34/34, parity 0/0/0, tree-sitter clean, gate:treesitter 96.0%.
UsingBinding cleared no over-accept — `await using [a] = null` over-accepts via STATEMENT-SPLITTING (the ASI gap, #24): mono splits `await using` off `[a] = null` into two statements regardless of the binding shape (proven: `x using [a]` splits the same way). So an identifier-only using binding only shuffles trees tsc rejects anyway, and it introduced a tree-sitter GLR conflict (`using x: T <` vs a generic type) — 9c04bc0 committed the stale grammar.js because the `tree-sitter generate` failure was swallowed by the `|| echo FAIL` in the gate chain. The identifier-only using binding + an `await using [` ExpressionStatement commit guard are the correct fix, but they only clear the over-accept once ASI stops the split, so they belong with the ASI round (#24), not as a standalone companion that adds a GLR conflict for zero acceptance gain. Restores Binding in the using arms; 34/34, parity 0/0/0, tree-sitter generate clean x4, gate:treesitter 96.0%.
… new false-rejects
The TS statement terminator becomes asi() = alt([';'], [not(sameLine)], [not(not('}'))])
on every Stmt-level arm (var/let/const, return, throw, break, continue, debugger, using,
expression statement): a statement may end only at ';', a line-terminator before the next
token, or a closing '}'. A same-line non-';'/'}' token can no longer terminate it, so the
mid-line splits mono used to accept by exploiting the optional ';' (`var x = a[]` split
into `var x=a` + `[]`) now stay one statement and reject like tsc.
asi alone false-rejects every tsc-clean construct that legitimately continues a statement
without a ';'. A multi-agent workflow mapped the full set (41 single-file conformance
cases) to 11 companions — each a MISSING production asi merely EXPOSED (base only
"accepted" them via the same split it removes), so every fix lives in the arm asi
exposed, never in asi() itself:
- per-specifier `type` modifier on import/export specifiers, with tsc's multi-token
`{ type as as B }` / `{ type type as foo }` disambiguation
- `export type *` / `export type * as ns from` + ModuleExportName namespace alias
- `import type X = require()` (type-only import-equals; two arms so `import type = …`
keeps `type` as the binding name)
- interface heritage via the shared heritageClauses helper (implements / `extends Foo?.Bar`
/ empty `extends {` / self / repeated)
- leading modifier soup before var/let/const/using (mirrors the decorator-prefix arm)
- nested `new new Foo()()` (recursive NewTarget; + ['new_target'] tree-sitter conflict)
- `export as namespace X` + `export default interface`
- `import<T>(...)` instantiation expression
- regex flag tail = maximal-munch IdentifierPart run (tsc lexes flags leniently)
- non-null `!` is a restricted (no-line-break) postfix, like `++`/`--`
- `unique` as a general prefix type operator (`unique <Type>`)
The workflow's const/var->notReservedExpr companion was MEASURED net-negative (it
regresses `for (var of X)` + `[...x = a]`, both tsc-parse-clean) and dropped; its lone
target (importWithTypeArguments) is covered by the import<T> arm instead.
recovery-conformance: we-accept 73 -> 50 (-23 mid-line-split over-accepts), recall
66.35% -> 69.82%, first-error 69.58% -> 74.37% (precision dips 84% -> 67% as mono now
REPORTS errors on the 23 newly-rejected files at a coarser granularity than tsc — the
known recovery-granularity gap, not new false-rejects: bidirectional FN stays 2, both
pre-existing externalModules import-feature cases). recovery.ts VALID fixture swapped
parserRealSource7 (a tsc PARSE-ERROR file that only passed via the split bug) ->
parserRealSource12. Gates 34/34, parity 0/0/0, tree-sitter generate clean x4,
gate:treesitter 96.0%.
…t type-args Two CFG/lexer-landable over-accepts from the 50-file triage (workflow mapped 49 landable / 2 semantic-ceiling): 1. numeric-literal-lex (10 files): a decimal integer part is a single `0` or a `[1-9]`-led run — `0` immediately followed by a digit (legacy octal `0123`, leading-zero `09`) is not a decimal literal. intPart='0' lets the trailing digit trip numericTailGuard so the token fails and the total lexer rejects it (tsc's scanner behavior). fracTail/expTail/ BigInt keep `digits` (leading zeros legal: `0.012`, `1e007`, `0n`); radix tokens untouched. `0`, `0.5`, `0e1`, `1_000`, `0x1f` stay valid. 2. type-arg-sameLine (1 file): generic type-argument application `T<A>` is newline- sensitive — `T\n<A>` rejects, mirroring the existing `[$, sameLine, '[']` / `!` postfix type arms. recovery-conformance: we-accept 50 -> 39, recall 69.82% -> 72.77%, first-error 74.37% -> 77.75%, precision ~stable. Bidirectional FN 0 (handle API). Gates 34/34, parity 0/0/0, tree-sitter generate clean x4, gate:treesitter 96.0%.
…g separator
Four more CFG-landable over-accepts from the 50-file triage:
- `let [` at statement start commits to a LexicalDeclaration (added to the
expression-statement lookahead guard), so a bad `let [...]` head rejects instead of
parsing as `let`-indexed expression.
- `new <T>Foo()` rejects: a `<` may not directly follow `new` (the operand is a
MemberExpression) — `not('<')` on the `new` arms; post-callee `Foo<T>()` type-args stay.
- a labeled-statement / for-binding-property label is `notReserved` (a reserved word can
never be an Identifier-slot label).
- a class index-signature ends with the asi() member terminator (`; / newline / }`),
not a bare optional `;`, so a same-line adjacent member rejects.
(The type-literal member separator was tried in the same asi() shape but REVERTED: it
regresses `var x: { private y: string }` — tsc reads `private y` as two lenient members
with no separator, which requires TypeMember modifier support, a separate change.)
recovery-conformance we-accept 39 -> 36, FN 0 (handle API). Gates 34/34, parity 0/0/0,
tree-sitter generate clean x4.
…ntextual name)
A type-parameter NAME guards through `notReserved, Ident`. `in` LEXES as an Ident, so an
un-guarded Ident wrongly accepted it as the name — but `in` is a reserved word there:
tsc rejects `<in>` / `<in in>` / `<out in>` / `<in = any>` ("'in' is a reserved word that
cannot be used here") while accepting `<out>` / `<out out>` / `<in out>` (out is a
contextual keyword, a valid name) and every modifier use (`<in T>` / `<in out T>` /
`<const T>` — `in` stays a variance modifier). Guards all three TypeParam arms (the
modifier-soup arm's name too, since `many(mod)` greedily eats trailing `in`s).
test/refactor-guard.ts had codified the old over-accept: its SHOULD-PASS `tp name-in
default` = `interface I<in = any> {}` is a tsc PARSE ERROR — corrected to the valid
`out` analog `interface I<out = any> {}`.
recovery-conformance we-accept 36 -> 35, FN 0. incremental-grammars 706/706 (the tripwire
that rejected the super-primary attempt — this one keeps edit≡fresh). Gates 34/34, refactor
-guard 112/112, tree-sitter generate clean x4.
`npm run check` ran its 35 gates strictly serially (execFileSync in a for-loop), so the wall-clock was the SUM of every gate. Each gate is an independent subprocess that emits its own parser and reads its own corpus, sharing no mutable state and writing DISTINCT /tmp/emitted-*.mjs files — so they parallelize safely. A (cpus-2)-wide worker pool turns the wall-clock into ~max(sum/pool, slowest-gate): measured 19.4s (was minutes), now bound by the single slowest gate (exhaustive-edits ~18s). Results stream as each finishes; the final pass/fail summary prints in gate order and the exit code is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
parse/editon the handle API are now total: every input produces a tree plus acst.errorsfield ([{ offset, end, message }], empty when clean) — never a throw. API misuse (no changes, out-of-range coordinates, foreign/stale handles) still throws; the module-level strict API is untouched, so the whole conformance/parity gate chain runs unchanged.Design
$errorrow (its leaves keep the text-tiling invariant) up to the element's FIRST set, the enclosing seq's follower literal, or EOF. Expression-internal hooks were measured to cascade (273 errors from one broken identifier) and are deliberately excluded.pos ≤ bar ≤ maxPos ≤ bar+2, stateless so losing longest-match arms cannot consume bars); a failure past the bars aborts the attempt, mints the next bar, and the pass re-runs — adoption keeps re-runs cheap, and the runParse safety net obeys the same gate.$errorrows are found by descending the structurally-propagatedrowRMspine of the final tree (per-pass candidate lists double-count under stateless re-adoption); lexer diagnostics persist as structured entries formatted at settle time with current offsets, maintained by the window splice.'>'splits disable adoption for the rest of the parse (the frozen damage mapping is invalid after a mid-parse index shift).A latent strict-mode hole, found and closed
The last equivalence divergence traced to a watermark contract violation that predates this PR: a Pratt rule's winning row is built before its failed LED extension arms run, so the row's stored lookahead extent under-records the rule's true probes — the memo watermark was always right, but the memo dies with its generation and adoption reads the row. A later edit inside a failed arm's reads could keep a stale row alive (typing the
)that turns a failed call arm into a successful one). Strict sessions never hit it because the triggering texts reject, and the reject was the firewall — total parsing removed the firewall and exposed it. Fix: the rule-level watermark is written back to the row at memo-store time.Gates (32/32)
test/recovery.ts(new): valid corpus ≡ strict byte-identical with empty errors; invalid corpus total + deterministic + spans-in-bounds; a char-by-char typing session through transiently-invalid states — every keystroke ≡ a fresh parse, tree and errors.test/incremental-verify.tsreworked to total semantics: every seeded step (syntax-breaking included) compares tree + errors against a fresh recovering parse — 128 steps, 0 mismatches.test/multi-doc.tsreworked: 60 interleaved steps across two documents including broken text, plus the 9-point handle contract.Numbers
Valid-state keystrokes are unchanged (9MB median 0.05ms warm). Broken-state keystrokes are incremental too (third follow-up commit): ~3–7ms on 9MB, ~100× over the interim full-re-parse design — see the recovering-adoption section below. Break/fix transitions still cost ~0.4s (the absorbed error region re-parses; recorded follow-up).
Closes #39.
Missing-token synthesis (follow-up commit)
Required token matchers in recovering mode synthesize a zero-width
$missingleaf instead of failing, tsc-style:const x = f(1, 2;keeps its Call structure and reportsexpected ')';function g() { return 1;closes the body and reportsexpected '}'. Synthesis is budget-free and position-pure — it fires iff a recovery bar lies in[pos, pos+2], never under probes (not()/optional/separator), never in free-fire — so it is a pure function of (text, bars) and cannot desync the two engines.Two structural consequences, both gated:
pos === beforealone, hooked reps discard + resync, leftRec continuations and Pratt LEDs refuse zero-width wraps. Same-position re-entry of a rule through a synthesized leading token is an unbounded recursion no grammar shape rules out — recovering runs keep a(rule, pos)in-progress set and fail the re-entry (PEG cycle semantics, zero strict-path cost). The sentinel also dissolved the bar+1ladders the recursion crashes were minting: broken-doc recovery in the incremental gate dropped 10.7s → 1.2s.maxPos; the next bar must be a function of (text, bars), not of what got adopted). Three fixes:frameMax, a frame-local advance watermark, makes recorded probe reaches (rowExt/memo watermarks) exact instead of contaminated by earlier-sibling probes — closing the recorded "exact per-frame extents" item at one extra compare on frontier breaches; recovery runs are adoption-free (attempt loop and the lex-recovered first run — a recovering row's reach is bar-history-dependent, so replaying it poisons the next bar); surgery refuses recovery-made trees (a strict splice into kept$error/$missingsiblings was a fake strict success that froze the old text's recovery shape, shifted).New gate:
test/incremental-grammars.ts— all 7 grammars × generative inputs × seeded edit sessions, every stepedit ≡ fresh(tree + errors), self-consistency walk, totality: 672/672.test/recovery.tsgains exact-diagnostic synthesis pins;test/incremental-verify.tsgains the five protocol-pin GLUE pairs. Suite 33/33.Missing-nonterminal synthesis (second follow-up commit)
$missinggeneralizes from tokens to RULES — the tsc "Expression expected" analog:const a = ;→expected Expr(one diagnostic, right offset), same for operator rhs (a + ;,-;), mixfix arms (x ? y : ;), and list elements after a real separator (a, ;,f(1, ;). Hooks at parseRuleEntry's fail exit + the three Pratt rhs sites that bypass rule entries.Placement follows commitment semantics: an optional group / repetition element fails freely while uncommitted (its absence is not a diagnostic), but once it consumes a real token, missing pieces synthesize —
opt('=', Expr)commits at the real=;rep(seq(',', Expr))cannot mint a phantom,but synthesizes the Expr after a real one.not()and separator probes stay absolutely suppressed; FIRST-token call-site guards open under recovering so the rule-entry hook is reachable.The new shapes surfaced two latent bugs, both fixed: the previous commit's frameMax conversion had been double-applied at the 12 token-advance sites by a patch-script composition hole (token consumes never raised maxPos; both engines ran the same wrong bar protocol, so equivalence gates stayed green — a reminder that edit≡fresh cannot see consistently-wrong protocol inputs), and the memo-jump coordinate refresh read
toff(tokN)past the token columns for zero-width rows minted at EOF (stale slots under handle reuse → anexpected Exprat offset 8 in a 5-char document).Synthesis quality pins: 3 → 9 exact-match cases in
recovery.ts. Suite 33/33; 9MB: fresh 438ms, valid keystroke ~0.6–5ms warm, breaking 649ms, while-broken 438ms, fixing 368ms.Recovering adoption under bar purity (third follow-up commit)
Typing in a broken 9MB document: ~440ms → ~3–7ms per keystroke. Recovery runs adopt rows from the previous tree again, made sound by two steps:
recoverArmednow takes the failing element's own frame-local probe reach (stagedframeMax) instead of reading the globalmaxPos— a global frontier parked on a far bar can no longer arm an unrelated loop. With that, every recovery decision (hook arming, token/nonterminal synthesis, the cycle sentinel) is a function of the row's window.[start, reach+2]are identical (shifted) to the bars the build run saw there (lastBarsrides the document register set; strict trees carry[], free-fire treesnull). Window text + window bars then determine the frame's behavior completely — including losing-arm fires and synthesis — so$error-containing rows adopt too (the error region is exactly what stays stable across far edits).The changed fire pattern exposed a latent bug in committed code, now fixed:
collectErrRowsdecoded a$errorrow's first kid as a token leaf unconditionally, but the runParse leftover net builds a wrapper$errorwhose kids are nodes — the message quoted text from a garbage offset, and the two text layers resolved it differently (equal trees, different messages — exactly the kind of impurity only the equivalence gate's error-field comparison can catch).Gates: 672/672 cross-grammar, incremental gate 9.9× vs fresh on mixed sessions, suite 33/33.
The transition cliff, killed (fourth–sixth commits)
The recorded ~0.4s break/fix transitions decomposed into an attempt-re-parse cost and a lexer relex-to-EOF, each closed by a proven mechanism:
recRunningnow carries entry serials, and a refusal leaning on a frame entered before the current one taints its memo entry (own-generation-only, taint propagating on reuse). This is precisely the hole that sank the first survival attempt.trySurgerynow accepts recovery-made trees when the damage and the re-parsed span sit clear of every bar window — the splice then provably commutes with every recovery decision (attempt bar lists are prefixes of the final list, so one check covers all attempts). The suffix bars shift with the splice instead of being cleared.9MB transitions: breaking 335ms → ~6ms, fixing 230ms → ~3ms; valid keystroke ~1ms; while-broken typing at valid-path parity.
perf-benchworst 472ms (was 802ms).Diagnostics: viable sets + paired openers (seventh commit)
expected ',' or ']'— companion literals are provably still accepted at the failure position (repetitions before the failing literal are always re-enterable; one-shot optionals are crossed but contribute nothing, since they may already have consumed). A static FIRST union would name impossible continuations (after[1, 2an expression is not viable); tsc under-reports the same position as')' expected.to match this '('— per-closer structural openers derived by intersecting preceding-literal sets across all seq occurrences ()←(,]←[,while←dofall out; no bracket list is hardcoded), attached as arelated: { offset, end, message }field pointing at the opener leaf.Measured against tsc and tree-sitter (eighth commit)
test/head-to-head.ts, one 9MB document, identical single-character edit scripts (README has the full table): Monogram beats tsc on every phase (valid keystroke 0.37ms vs 37ms; while-broken 0.21ms vs 13.6ms) and beats or matches tree-sitter everywhere except the two break/fix transition edits — profiling attributes those almost entirely to lexer-layer suffix bookkeeping on the bench's first-touch 4.5MB cursor jump (one-time suffix-min allocation + EOF-relative re-basing); the strict-fail pass measures 0.35ms, the attempts 0.6ms, and repeated transitions at one cursor position settle to ~2ms.test/recovery-conformance.tsscores our diagnostics against tsc's parser on every conformance file it rejects: recall 59.1%, precision 82.4%.Two syntactic over-accepts, fixed (ninth commit)
Unterminated templates (
`tpl ${x;) parsed clean — a template head now commits to its middle/tail chain in both engines; and a bare reservedcaseslid through the identifier-expression fallback, lettingswitch (x) { case 1 y(); }parse as three statements. A full accept/reject flip scan over the conformance corpus shows exactly one flip: an intentionally-invalid error test now correctly rejects.The formal spine + a complete gate (tenth commit)
TOTAL-PARSING.mdwrites up the contract, the bar-determinism theorem, the three structural theorems, the window-replay theorem with its three corollaries, the lexer-resync soundness conditions, and the one known open caveat (row-level taint).test/exhaustive-edits.ts(CI, suite 34/34) checks edit ≡ fresh completely within a bound: every document ≤ 4 chars over a small bracket grammar × every single-character edit (~330k steps; the 3.2M-step depth-5 run is also clean). It caught a real one-case regression in the surgery length update on its first run.Ceiling round (eleventh–fourteenth commits)
rowRMis now bitwise — bit 1 keeps the structural error containment the diagnostics walk descends, bit 2 marks a context-tainted result (a frame whose parse leaned on the cycle sentinel finding an ancestor). Tainted rows refuse recovering adoption and run extension; the one open caveat inTOTAL-PARSING.mdis gone.)-pops that found an empty paren stack (almost always empty); the dominant statement-boundary resync case then needs one end-of-list check instead of an O(suffix) minimum build. Steady break/fix transitions settle at ~2 ms (strict-fail pass 0.23 ms, attempts 0.46 ms; the raw 7-column suffix memmove measures 0.07 ms — no storage floor). The head-to-head's 12 ms transition rows measure a one-time 4.5 MB cursor jump (the EOF-relative bias boundary moves with the cursor — exactly the design that makes valid keystrokes 0.37 ms); the docs now attribute this correctly.if/for/return/… parsed as identifier expressions, lettingnamespace if {}fall apart into accepted identifier statements. Blockingforexposed a real masked gap —for (a in b, c)had been parsing as a call of the identifierfor; the for-in object now takes a full comma Expression while for-of correctly keeps one AssignmentExpression. Every accept/reject flip across the conformance corpus was individually adjudicated against tsc: 7 flips, all toward tsc, zero away ('var'and'extends'were caught by this scan leaning on tsc-accepted shapes and stay out). Error-report recall vs tsc's parser: 55.7% → 61.2%; the remaining 108 divergent files are enumerated in the ROADMAP (31 = the[Await]/[Yield]context class, 77 = named per-shape strictness items).