feat(webapp): mollifier API GET read-fallback — synthetic primitives + route wiring#3755
feat(webapp): mollifier API GET read-fallback — synthetic primitives + route wiring#3755d-cs wants to merge 34 commits into
Conversation
🦋 Changeset detectedLatest commit: 645b231 The changes in this PR will be included in the next version bump. This PR includes changesets to release 32 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (36)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
31f4726 to
b05929b
Compare
1838229 to
af0aeeb
Compare
b05929b to
b89da52
Compare
0919f7a to
f36c576
Compare
74fdf6d to
c6fa61f
Compare
047b240 to
e57bc5e
Compare
Five Devin findings on PR #3755: - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer was assigning buffered.friendlyId to the FoundRun.id field. The id column on PG is the internal cuid; downstream log correlation reads taskRun.id and was getting the friendly token instead. Fixed to read the cuid that readFallback already derives via RunId.fromFriendlyId. - api.v1.runs.$runId.trace.ts buffered branch hardcoded isPartial: true. Cancelled is a terminal state — the sibling spans route and syntheticTrace already gate this on !isCancelled. Match. - synthesisePayload helper replaces the silent typeof === "string" coercion. Object-shaped payloads now JSON- stringify (matching how the trigger path would serialise them) with a warn log so format drift is visible. Truly unserialisable inputs fall back to "" with an error log instead of silently dropping. - syntheticRedirectInfo now uses deserialiseMollifierSnapshot (the webapp-side wrapper) instead of deserialiseSnapshot from the redis-worker package directly. Both share the same implementation today, but pinning the wrapper means the two read-side modules can't drift if the snapshot encoding ever changes (e.g. msgpack). - attempts route loader verifies the run belongs to the authenticated environment (PG-first, buffer fallback) before returning the parity-empty list. Other run-scoped endpoints (spans, trace, retrieve) 404 cross-env; matching that closes the exists-vs-doesn't-exist side channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
242ba73 to
6a8404d
Compare
Five Devin findings on PR #3755: - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer was assigning buffered.friendlyId to the FoundRun.id field. The id column on PG is the internal cuid; downstream log correlation reads taskRun.id and was getting the friendly token instead. Fixed to read the cuid that readFallback already derives via RunId.fromFriendlyId. - api.v1.runs.$runId.trace.ts buffered branch hardcoded isPartial: true. Cancelled is a terminal state — the sibling spans route and syntheticTrace already gate this on !isCancelled. Match. - synthesisePayload helper replaces the silent typeof === "string" coercion. Object-shaped payloads now JSON- stringify (matching how the trigger path would serialise them) with a warn log so format drift is visible. Truly unserialisable inputs fall back to "" with an error log instead of silently dropping. - syntheticRedirectInfo now uses deserialiseMollifierSnapshot (the webapp-side wrapper) instead of deserialiseSnapshot from the redis-worker package directly. Both share the same implementation today, but pinning the wrapper means the two read-side modules can't drift if the snapshot encoding ever changes (e.g. msgpack). - attempts route loader verifies the run belongs to the authenticated environment (PG-first, buffer fallback) before returning the parity-empty list. Other run-scoped endpoints (spans, trace, retrieve) 404 cross-env; matching that closes the exists-vs-doesn't-exist side channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f4131eb to
d8f6cf7
Compare
6a8404d to
bc9f4e2
Compare
Five Devin findings on PR #3755: - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer was assigning buffered.friendlyId to the FoundRun.id field. The id column on PG is the internal cuid; downstream log correlation reads taskRun.id and was getting the friendly token instead. Fixed to read the cuid that readFallback already derives via RunId.fromFriendlyId. - api.v1.runs.$runId.trace.ts buffered branch hardcoded isPartial: true. Cancelled is a terminal state — the sibling spans route and syntheticTrace already gate this on !isCancelled. Match. - synthesisePayload helper replaces the silent typeof === "string" coercion. Object-shaped payloads now JSON- stringify (matching how the trigger path would serialise them) with a warn log so format drift is visible. Truly unserialisable inputs fall back to "" with an error log instead of silently dropping. - syntheticRedirectInfo now uses deserialiseMollifierSnapshot (the webapp-side wrapper) instead of deserialiseSnapshot from the redis-worker package directly. Both share the same implementation today, but pinning the wrapper means the two read-side modules can't drift if the snapshot encoding ever changes (e.g. msgpack). - attempts route loader verifies the run belongs to the authenticated environment (PG-first, buffer fallback) before returning the parity-empty list. Other run-scoped endpoints (spans, trace, retrieve) 404 cross-env; matching that closes the exists-vs-doesn't-exist side channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d8f6cf7 to
796a2c0
Compare
bc9f4e2 to
637e8c0
Compare
Five Devin findings on PR #3755: - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer was assigning buffered.friendlyId to the FoundRun.id field. The id column on PG is the internal cuid; downstream log correlation reads taskRun.id and was getting the friendly token instead. Fixed to read the cuid that readFallback already derives via RunId.fromFriendlyId. - api.v1.runs.$runId.trace.ts buffered branch hardcoded isPartial: true. Cancelled is a terminal state — the sibling spans route and syntheticTrace already gate this on !isCancelled. Match. - synthesisePayload helper replaces the silent typeof === "string" coercion. Object-shaped payloads now JSON- stringify (matching how the trigger path would serialise them) with a warn log so format drift is visible. Truly unserialisable inputs fall back to "" with an error log instead of silently dropping. - syntheticRedirectInfo now uses deserialiseMollifierSnapshot (the webapp-side wrapper) instead of deserialiseSnapshot from the redis-worker package directly. Both share the same implementation today, but pinning the wrapper means the two read-side modules can't drift if the snapshot encoding ever changes (e.g. msgpack). - attempts route loader verifies the run belongs to the authenticated environment (PG-first, buffer fallback) before returning the parity-empty list. Other run-scoped endpoints (spans, trace, retrieve) 404 cross-env; matching that closes the exists-vs-doesn't-exist side channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
796a2c0 to
b139391
Compare
637e8c0 to
65219db
Compare
b139391 to
d153042
Compare
65219db to
ccdcd9c
Compare
Five Devin findings on PR #3755: - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer was assigning buffered.friendlyId to the FoundRun.id field. The id column on PG is the internal cuid; downstream log correlation reads taskRun.id and was getting the friendly token instead. Fixed to read the cuid that readFallback already derives via RunId.fromFriendlyId. - api.v1.runs.$runId.trace.ts buffered branch hardcoded isPartial: true. Cancelled is a terminal state — the sibling spans route and syntheticTrace already gate this on !isCancelled. Match. - synthesisePayload helper replaces the silent typeof === "string" coercion. Object-shaped payloads now JSON- stringify (matching how the trigger path would serialise them) with a warn log so format drift is visible. Truly unserialisable inputs fall back to "" with an error log instead of silently dropping. - syntheticRedirectInfo now uses deserialiseMollifierSnapshot (the webapp-side wrapper) instead of deserialiseSnapshot from the redis-worker package directly. Both share the same implementation today, but pinning the wrapper means the two read-side modules can't drift if the snapshot encoding ever changes (e.g. msgpack). - attempts route loader verifies the run belongs to the authenticated environment (PG-first, buffer fallback) before returning the parity-empty list. Other run-scoped endpoints (spans, trace, retrieve) 404 cross-env; matching that closes the exists-vs-doesn't-exist side channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0753300 to
a1e1ad8
Compare
…tage in mollifier drainer The mollifier drainer's cancel bifurcation called engine.createCancelledRun without handling its documented conflict contract: when the normal trigger replay path races ahead and materialises a live (non-CANCELED) row, the engine throws a conflict so the caller can "decide between engine.cancelRun() and skipping". The handler did neither — the conflict propagated, isRetryablePgError returned false, and the drainer buffer.fail()'d the entry, silently losing the cancellation while the run kept executing. Now route conflicts to engine.cancelRun() so the cancel actually wins. Separately, when engine.trigger fails non-retryably and the SYSTEM_FAILURE fallback write then fails because PG is transiently unreachable, rethrowing the original non-retryable error made the drainer buffer.fail() the entry — losing the run with no PG row ever landing, and dropping the write error entirely. Rethrow the retryable write error instead so the drainer requeues; the failure row lands once PG recovers. Non-retryable write failures still rethrow the original error as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove plan-tracking shorthand (Q# bifurcation, Phase C1/Q4) from replay-layer mollifier comments and test names; reword to plain English. Comment/test-name only; no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…labels The mollifier queue moved from ZSET to LIST in an earlier refactor, but two comments still described it as a ZSET (`mollifierStaleSweep.server.ts:9` and `env.server.ts:1104` — both narrating the periodic stale sweep). Update to LIST. Also clean four residual internal plan-doc labels left over from prior cleanup passes: - `createCancelledRun` docstring (`engine/index.ts`) referenced "Q4 mollifier- cancel design" and "F4 bypass" — both dead nomenclature now that the gate's C1/C3/F4 labels have been rewritten. Restate the waitpoint-skip rationale in plain English: the mollifier gate refuses to buffer triggerAndWait children, so a cancelled buffered run never has a waiting parent to unblock. - `createCancelledRun.test.ts` empty-tags regression dropped "Found while running the Phase F challenge suite." — the comment describes the bug itself, which is self-contained. - `mollifierStaleSweep.test.ts` "scans across multiple orgs" rephrased away from "Phase-3 design has org-level fairness"; the prose now states the invariant directly. Comment/docstring-only; no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous sweep was unbounded along two dimensions: every tick walked
every org and every env (via buffer.listOrgs + listEnvsForOrg). At the
sweep's default per-env entry cap of 1000, an incident-scale fan-out
gave O(orgs * envs * 1000) Redis round-trips per tick — running far
longer than the 5-minute interval and triggering the inFlight guard
to drop every subsequent tick until the slow pass finished.
Shard the work via a durable cursor:
- New file `mollifierStaleSweepState.server.ts` owns three Redis keys
(`mollifier:stale_sweep:{cursor,org_list,counts}`), all under the
mollifier namespace but separated from the buffer's own state. The
state class has its own Redis client; the buffer's existing
`MollifierBuffer` API surface is untouched.
- On `cursor === 0` the org list is rebuilt by snapshotting
`buffer.listOrgs()` into the frozen LIST — the cycle's frozen view.
- Each tick consumes up to `maxOrgsPerPass` orgs (default 100),
processes them, and advances the cursor. When the cursor reaches the
end of the LIST it wraps to 0; the next tick rebuilds and starts the
next cycle.
- The per-env counts HASH is the source of truth for the gauge
snapshot. Visiting an env with zero stale entries HDEL's its hash
field — gauge clears immediately on revisit. Envs not revisited this
tick keep their last-known value (durability across ticks AND across
webapp restarts), accepting a worst-case lag of one full cursor cycle
before a no-longer-stale env clears.
Snapshot contract change: only envs with non-zero stale counts appear
in the reported `Map`. The telemetry layer (`mollifierTelemetry.server.ts`
`reportStaleEntrySnapshot`) sums values, so absence is equivalent to
zero for the gauge — the alert behaviour is unchanged.
Tests:
- New: "shards work across ticks: cursor advances by maxOrgsPerPass and
wraps after a full cycle" — drives a 5-org fixture with cap=2,
asserts the cursor's three-tick progression and wrap.
- New: "clears an env from the durable snapshot on revisit when it has
entries but none currently stale" — same entry flips
stale→not-stale by varying the sweep's `now`, asserts HDEL on
revisit.
- Existing tests updated to inject `state`; one assertion shape
rewritten ("snapshot omits envs that have entries but none stale")
to match the new HDEL semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five gaps in the sharded sweep's test coverage, added 9 new tests covering each: - State durability across process restart: state1 populates the cursor + counts hash, closes its Redis client (simulated webapp restart), state2 constructs against the same Redis and picks up exactly where state1 left off. This is the headline benefit of storing sweep state in Redis instead of process memory; without this test it could silently regress. - Cycle wrap rebuilds the org list: a third org joins between cycles and is visible only in the next cycle's snapshot. Pins the rebuildOrgList-on-cursor=0 contract. - Empty buffer (no orgs) advances cleanly with zero work, empty snapshot, cursor stays at 0 instead of tripping the wrap math. - Buffer-null branch's clearAll: previously asserted only "snapshot is empty"; now also asserts the durable state was actually wiped (cursor=0, counts hash empty) so a re-enable doesn't resume on a stale cursor. - MollifierStaleSweepState direct unit tests (5 tests): readCursor default + corrupt-value tolerance, writeCursor round-trip, rebuildOrgList replace-not-append semantics, setEnvStaleCount HSET vs HDEL, clearAll DELs all three keys. Suite total: 7 existing + 9 new = 16 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on terminal failure Two Devin review findings on PR #3754, both real and unresolved: 1. Sharded stale sweep's counts hash never cleared for fully-drained envs — gauge stayed permanently elevated, false-alerting the recommended `> 0 for 5m` rule. Root cause: when an env's last buffered entry is popped, the buffer's atomic Lua removes the env from `mollifier:org-envs:${orgId}` (and removes the org from `mollifier:orgs` if it has no other envs). The sweep's inner loop walks `buffer.listEnvsForOrg(orgId)`, so the env disappears from the iteration entirely — `setEnvStaleCount(envId, 0)` (which HDELs the field) is never called, and the counts hash retains the env's last-known stale count forever. Fix (Devin's Approach 2): cycle-bounded reconciliation. Add a Redis SET `mollifier:stale_sweep:visited` that the sweep SADDs into for every env it touches. When the cursor wraps (cycle complete), `reconcileVisited()` does `HKEYS counts → SMEMBERS visited → HDEL the difference → DEL visited`. Pipelined; orphans clear within at most one full cursor cycle of the env going quiet, which matches the sharding contract's existing one-cycle freshness window. Test: "evicts fully-drained envs from the counts hash at cycle wrap" — accepts an entry, sweep flags it stale, pops the entry (env vanishes from listEnvsForOrg), runs another sweep that triggers wrap, asserts the env is HDEL'd from both the snapshot and the underlying counts hash. 2. Drainer handler's terminal SYSTEM_FAILURE write dropped the snapshot's `batch` field. If the buffered run was part of a batch, the failure row wasn't associated with the batch and the batch parent's completion tracking could hang indefinitely waiting on a child that landed but isn't visible to the batch. Fix: extract `snapshot.batch` with structural type guards and pass it through to `createFailedTaskRun`. Same defensive pattern as the other snapshot fields in this code path (the snapshot is typed `Record<string, unknown>` because it came from cjson-decoded buffer payload). Test: "propagates the batch association into createFailedTaskRun" — asserts the call site receives `{ id, index }` from the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Redis Devin review on PR #3754: `stop()` previously called `deps.state.close()` immediately after `clearInterval`, but `tick()` only checks `stopped` at entry. A tick that was already past that guard would keep making `state.*` calls against an ioredis client that `stop()` had already `quit()`ed — those calls would throw, the tick's own try/catch would swallow them as `mollifier.stale_sweep.failed` warnings, and every graceful shutdown would emit spurious noise. Track the current tick promise as `currentTick`. `stop()` awaits it (if present) before invoking `state.close()`, so the tick's last state call lands BEFORE the Redis client is quit. The tick's own try/catch handles the (unexpected) case where it rejects; the await in `stop()` is solely for ordering. Also drop the `instanceof MollifierStaleSweepState` guard around `state.close()` — `close()` is part of the `StaleSweepStateStore` contract, so unconditional invocation is correct. Test fake states implement `close()` as a no-op. Test: `stop() waits for an in-flight tick to finish before closing the state` — gates a fake state's `readCursor` on a Deferred, kicks off the interval, waits for the tick to start, then races `stop()` against the gate. Asserts the stop promise stays unresolved while the tick is mid-flight and that the tick's final state operation lands before `close()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved Devin threads, all addressed in this commit. Committed
locally only — not pushed.
1. `callWithoutTraceEvents` was inheriting the new `emitRunFailedEvent`
default of `true`, so its `createFailedTaskRun` call would fire the
`runFailed` bus emit and the listener would write a ClickHouse
completion event row with empty `traceId`/`spanId` — orphan row,
directly contradicting the method's "without trace events" contract.
Pass `emitRunFailedEvent: false` and enqueue
`PerformTaskRunAlertsService` directly, mirroring the `call()`
pattern so customers' ERROR channels still see the failure.
2. The cjson empty-tags defense lived only on `createCancelledRun`, not
on `engine.trigger`. When the mollifier buffer's mutate-side Lua
re-serialises a payload (e.g. `append_tags` on a buffered run that
never had tags), an empty Lua table encodes as `{}` and decodes
back to a JS object — and the previous `tags.length === 0` check
passes that object straight to Prisma's `String[]` column.
Mirror the same `Array.isArray && tags.length > 0 ? tags : undefined`
guard `createCancelledRun` already uses. The defense is symmetric
with the existing tested case for createCancelledRun, so the same
contract holds for the trigger replay path.
3. `runCancelled` handler's `cancelRunEvent` lookup fails for
buffered-only runs (no primary trace event exists, since the
mollifier gate skipped `repository.traceEvent` for the
not-yet-materialised run). The handler's `tryCatch` swallowed the
error, but the systematic `[runCancelled] Failed to cancel run
event` log fired on every cancelled buffered run.
Add `emitRunCancelledEvent: boolean = true` to `createCancelledRun`
(symmetric with the existing `emitRunFailedEvent` flag on
`createFailedTaskRun`); drainer handler passes `false`. CANCELED PG
row still writes; only the trace-event mirror is skipped.
Tests:
- `RunEngine.createCancelledRun > emitRunCancelledEvent: false
suppresses the bus emit but still writes the CANCELED PG row` —
pins the new flag's semantics.
- `createDrainerHandler > calls createCancelledRun with
emitRunCancelledEvent: false (suppresses orphan trace-event log
noise)` — pins the call site's contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts exhaustion Before: when engine.trigger() threw a retryable PG error (P1001 / P2024 / ECONNRESET) and the drainer exhausted TRIGGER_MOLLIFIER_DRAIN_MAX_ATTEMPTS (default 3), MollifierDrainer.processEntry called buffer.fail() directly, which atomically marks the entry FAILED and DELs the hash. No PG row was ever written, so the customer's run silently disappeared. This fix introduces a new optional `onTerminalFailure` callback on MollifierDrainerOptions that fires before buffer.fail() on any terminal path. The webapp wires it to a wrapper around the existing createFailedTaskRun(...) snapshot block from the non-retryable path — factored into a shared writeMollifierTerminalFailureRow helper so both paths land an identical SYSTEM_FAILURE row. Behaviour: - Non-retryable error: handler writes the row inline (unchanged), then returns normally. Drainer ack()s. - Retryable error, attempts left: drainer requeues (unchanged). - Retryable error, attempts exhausted: drainer invokes onTerminalFailure with cause: "max-attempts-exhausted"; callback writes the SYSTEM_FAILURE row, then drainer calls buffer.fail(). - Callback itself throws retryable: drainer requeues so the next tick gets another shot at the PG write. - Callback throws non-retryable: log and fall through to buffer.fail so a genuinely malformed snapshot doesn't loop forever. Also corrects the stale Lua comment in buffer.ts's failMollifierEntry script that claimed the drainer "has already written a SYSTEM_FAILURE PG row for terminal failures" — only true for the non-retryable path before this change. Includes 5 unit tests in drainer.test.ts covering max-attempts-exhausted, non-retryable, callback-retryable-rerequeue, callback-non-retryable-falls-through, and no-callback back-compat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…failures `writeMollifierTerminalFailureRow` was calling `createFailedTaskRun` without `emitRunFailedEvent: false`. The engine's `runFailed` event handler then called `completeFailedRunEvent`, which looks up the run's primary trace span. Buffered-only runs never had a primary trace event written for them — the mollifier gate intercepts BEFORE `repository.traceEvent` runs — so the lookup always failed, producing a systematic `[runFailed] Failed to complete failed run event` error log per drainer terminal failure. The errors are caught by the handler's `tryCatch`, but the per-failure noise pollutes Sentry and masks real errors. Mirrors what `TriggerFailedTaskService` already does at `triggerFailedTask.server.ts:212` and `:324`: - Pass `emitRunFailedEvent: false` to suppress the trace-event mirror. - Manually enqueue `PerformTaskRunAlertsService.enqueue(run.id)` so the customer's ERROR channel still fires. Best-effort, logged on failure but does not bubble (the SYSTEM_FAILURE row landing is the load-bearing customer-visible outcome). Also tightens the helper's return type from `boolean` to `TaskRun | null` so the alert side has the run id without an extra lookup. Both existing call sites (inline non-retryable + max-attempts-exhausted callback) keep working — the `if (!wrote)` truthy check is identical under `TaskRun | null`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RE fallback on cancel-bifurcation Two fixes from CodeRabbit on PR #3754: 1. `TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED` was inheriting `TRIGGER_MOLLIFIER_ENABLED`, so any deployment already running the mollifier would auto-start the sweep worker on upgrade — defeating the point of having a separate kill switch. Hard-default to "0" so the sweep is an explicit ops decision. 2. In the cancel-bifurcation path, a non-conflict + non-retryable error from `createCancelledRun` rethrew out of the handler. The drainer's `onTerminalFailure` handler gates on `cause === "max-attempts-exhausted"` and skips "non-retryable", so `buffer.fail()` deleted the entry without ever writing a PG row — the cancelled run disappeared silently. Mirror the SYSTEM_FAILURE fallback the non-cancel trigger path already uses: write a terminal row via the shared helper, falling back to the original throw only when the write also fails non-retryably (and re-throw retryable write errors so the drainer requeues). UX trade: the customer initiated a cancel but sees SYSTEM_FAILURE if the cancel write itself was structurally rejected. Terminal-but- mislabelled beats terminal silence — the alternative was the run disappearing from the dashboard entirely. The path is narrow (only non-conflict, non-retryable createCancelledRun failures). Two new unit tests cover the cancel branch: non-retryable creates the SYSTEM_FAILURE row + ACKs the entry; retryable rethrows so the drainer requeues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… slice
`MollifierStaleSweepState.readOrgListSlice` was logging-and-returning
`{ orgs: [], total: 0 }` on pipeline-abort or per-result Redis errors.
The caller (`runStaleSweepOnce`) treated that as a clean empty cycle:
wrote cursor=0, reconciled visited envs against the empty result, and
cleared the stale-entry gauge — silencing the alerts the sweep exists
to raise.
Re-throw the underlying error instead. The interval wrapper's catch
logs `stale_sweep.failed` for the failed tick; durable cursor + counts
hash stay untouched so the gauge keeps reporting its last-known value
until a healthy tick repopulates it.
New unit test on the caller contract pins this: error propagates,
cursor stays at its seeded value, counts hash retains the seeded env,
no snapshot is reported.
Separately documents a known limitation in the .server-changes entry:
the sweep runs per-webapp-instance, so its stale-entry counter
multiplies by N webapps in HA until a distributed lease lands as a
follow-up. The system is HA-safe (Redis ops are atomic, no torn
state); only the metric output is inflated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ route wiring Synthesise QUEUED/FAILED responses from the mollifier buffer when a TaskRun row hasn't landed in Postgres yet. Wires the synthesis into: - ApiRetrieveRunPresenter - v1 trace GET route - v1 spans GET route - attempts route gains a GET loader (fixes pre-existing Remix 400) Stacked on the trigger-time decisions PR. The readFallback infra itself lives on the trigger PR (consumed by IdempotencyKeyConcern); this PR adds the route-level synthetic-rendering primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the ad-hoc \`as Record<string, unknown>\` + \`typeof === "string"\` checks in \`findBufferedRunRedirectInfo\` with a Zod \`safeParse\` against a schema for the subset of fields the redirect needs (envSlug / projectSlug / orgSlug / optional spanId). Wrong-typed or missing fields now collapse into a single parse-fail branch that logs the structured issue list and returns null. Adds a regression test for the structural-vs-typeof distinction: \`environment.slug: 42\` (number) is now rejected, where the previous \`typeof slug === "string"\` chain would silently accept any string- typed value but had no defence against shape drift in other fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l enum \`SyntheticRun.machinePreset\` is a plain string sourced from the mollifier snapshot, but \`SpanRun.machinePreset\` is the typed \`MachinePresetName\` enum (micro / small-1x / small-2x / medium-1x / medium-2x / large-1x / large-2x). The direct assignment failed \`tsc --noEmit\` and CI typecheck. Validate via \`MachinePresetName.safeParse\` and collapse unknown values to \`undefined\` so a stale preset returned by the buffer doesn't bleed into the UI as a typed-but-unknown value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five Devin findings on PR #3755: - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer was assigning buffered.friendlyId to the FoundRun.id field. The id column on PG is the internal cuid; downstream log correlation reads taskRun.id and was getting the friendly token instead. Fixed to read the cuid that readFallback already derives via RunId.fromFriendlyId. - api.v1.runs.$runId.trace.ts buffered branch hardcoded isPartial: true. Cancelled is a terminal state — the sibling spans route and syntheticTrace already gate this on !isCancelled. Match. - synthesisePayload helper replaces the silent typeof === "string" coercion. Object-shaped payloads now JSON- stringify (matching how the trigger path would serialise them) with a warn log so format drift is visible. Truly unserialisable inputs fall back to "" with an error log instead of silently dropping. - syntheticRedirectInfo now uses deserialiseMollifierSnapshot (the webapp-side wrapper) instead of deserialiseSnapshot from the redis-worker package directly. Both share the same implementation today, but pinning the wrapper means the two read-side modules can't drift if the snapshot encoding ever changes (e.g. msgpack). - attempts route loader verifies the run belongs to the authenticated environment (PG-first, buffer fallback) before returning the parity-empty list. Other run-scoped endpoints (spans, trace, retrieve) 404 cross-env; matching that closes the exists-vs-doesn't-exist side channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synthetic SpanRun/trace builders for buffered runs hardcoded non-terminal state, so a CANCELED or FAILED buffered run rendered as a healthy in-progress run: - syntheticSpanRun: FAILED now maps to SYSTEM_FAILURE (matching ApiRetrieveRunPresenter.bufferedStatusToTaskRunStatus); isFinished is true for CANCELED/FAILED; isError is true for FAILED; the error block is synthesised as STRING_ERROR and statusReason carries the message. - syntheticSpanRun: drop the empty-string spanId/taskIdentifier relationship stubs (blank task name + misleading `?span=` jump) since the snapshot only carries friendly IDs. - syntheticTrace: FAILED now renders as an errored, non-partial, "failed" root span instead of executing/partial. CANCELED stays "completed", matching RunPresenter's derivation. - tests: cover the CANCELED and FAILED terminal paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…truction
Addresses the higher-confidence read-fallback review findings:
- attempts GET loader: rebuilt on createLoaderApiRoute so it matches the
sibling read routes — accepts JWTs with run/task/tag/batch resource
scoping (was bare authenticateApiRequest, rejecting PUBLIC_JWT and doing
no scope check), and 404s with `x-should-retry: true` so SDK pollers keep
retrying a not-yet-materialised run instead of giving up.
- batch reconstruction: the snapshot embeds the batch as `{ id, index }`
(engine.trigger shape), but readFallback read a non-existent flat
`batchId`, so SyntheticRun.batchId was always undefined. Read it from
`snapshot.batch.id` (the internal cuid). synthesiseFoundRunFromBuffer now
populates `batch` from it, and the spans/trace buffer-path authorization
pushes the batch resource — so batch-scoped JWTs authorise against
buffered runs and the retrieve response reports the correct batchId.
- metadata: coerce a non-string buffered metadata defensively (JSON
stringify + warn) instead of silently dropping to null, mirroring
synthesisePayload. In practice metadata is always a string, so this is a
no-op guard, but it surfaces format drift to ops.
- tests: cover batchId extraction from the nested batch object and its
absence for non-batched runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comments referenced internal phase labels ("Phase A2", "Phase A5") from
the development plan rather than describing what the code does. Replaced
with self-contained prose; the surrounding explanations were already
correct and are preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…derivation syntheticTrace.server.ts shipped without a test file; this adds one, covering the identity-field passthrough, taskIdentifier-and-spanId defaults, the three rootSpanStatus branches (QUEUED→executing, CANCELED→completed, FAILED→failed) with their isPartial/isError/isCancelled flags, the 1ms duration floor, rootStartedAt mapping, and the single-span trace shape (empty events/timelineEvents, empty linkedRunIdBySpanId, undefined overridesBySpanId/queuedDuration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utes + align workerQueue default Addresses three Devin review findings on PR #3755: - api.v1.runs.\$runId.spans.\$spanId.ts and api.v1.runs.\$runId.trace.ts: the buffered-run response branch hardcoded isError:false and only checked CANCELED for isPartial, so a FAILED buffered run rendered as "still in progress" — SDK consumers would poll forever. Now derives both flags from CANCELED and FAILED, matching syntheticTrace.server.ts. - ApiRetrieveRunPresenter.synthesiseFoundRunFromBuffer: workerQueue defaulted to "main" while syntheticSpanRun.server.ts uses "". The API response's `region` is sourced via `run.workerQueue || undefined`, so "main" was advertising a region the run hadn't yet been assigned to. Aligned to "" so unassigned buffered runs coerce to region: undefined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…drop scaffolding GET on attempts Two related cleanups on the mollifier read surface: 1. Extract the buffered-run response bodies for the spans-detail and trace endpoints into pure helpers (apps/webapp/app/v3/mollifier/syntheticApiResponses.server.ts: buildSyntheticSpanDetailBody, buildSyntheticTraceBody). The route bodies were carrying the only copy of the terminal-state derivation (CANCELED / FAILED → isError / isPartial / isCancelled) with no unit coverage; extracting them lets us pin the contract directly. The route files now just authenticate, resolve, validate the spanId, and forward — no body shape logic in routes. 2. Drop the GET loader on api.v1.runs.\$runParam.attempts.ts. It was added in this PR solely to fix a pre-existing Remix "no loader" 400 on a URL no SDK consumer was actually calling, and to give the mollifier-parity script a stable assertion target. The detailed attempt list lives on the v3 retrieve endpoint — the GET was scaffolding rather than product surface, and Devin's review flagged it as such. Reverted to action-only. Tests: 16 cases in apps/webapp/test/mollifierSyntheticApiResponses.test.ts covering QUEUED / CANCELED / FAILED for each body, plus identity and default-field passthrough. Pins the FAILED-terminal-state regression that shipped briefly with isPartial:true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pin synthesise contract with regression tests synthesiseFoundRunFromBuffer hardcoded `scheduleId: null`, which dropped the schedule field from the retrieve-API response for any scheduled trigger that landed in the mollifier buffer. Scheduled triggers go through the same TriggerTaskService path as API triggers and the gate doesn't bypass them, so the snapshot does carry scheduleId; the synthesis was just throwing it away. Forward `buffered.scheduleId ?? null` so resolveSchedule() can hydrate the schedule object from PG (the Schedule row exists before the trigger fires). Exported synthesiseFoundRunFromBuffer + the FoundRun type from the presenter and added apps/webapp/test/mollifierSynthesiseFoundRun.test.ts (16 cases). The test file pins the snapshot→FoundRun mapping that previously had no direct coverage — the new scheduleId forwarding plus earlier-session regressions (batch reconstruction, workerQueue default "", FAILED→SYSTEM_FAILURE status mapping, STRING_ERROR shape, defensive metadata coercion, idempotency defaults, execution-state zero defaults). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns in events endpoint `api.v1.runs.$runId.events.ts` calls `ApiRetrieveRunPresenter.findRun`, which now returns a buffered fallback. The route then proceeds to `getRunEvents` against ClickHouse with `run.traceId` (empty string for buffered runs) and `run.taskEventStore` (`"taskEvent"` default) — a guaranteed-empty round trip, since the mollifier gate intercepts BEFORE any trace event is written. Devin's analysis on PR #3755 flagged this as a wasted ClickHouse round trip per request hitting the events endpoint with a buffered runId. Add an `isBuffered: boolean` flag to `FoundRun`. The PG branch of `findRun` sets `isBuffered: false` on the spread Prisma row; the buffered branch's `synthesiseFoundRunFromBuffer` sets `isBuffered: true`. The events route short-circuits on the flag and returns `{ events: [] }` with status 200, matching the semantically-correct behaviour without the ClickHouse hop. Gating on the explicit flag (rather than probing surrogates like `traceId === ""`) keeps the contract clear for future callers that need the same "PG-only" gate — they can introspect the same field. Includes a unit test in `mollifierSynthesiseFoundRun.test.ts` pinning `isBuffered: true` on the synth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PG-resident SYSTEM_FAILURE rows always have completedAt populated by the engine, so SDK consumers that stop polling when isCompleted=true and then read finishedAt see a real timestamp. The buffer-synth path was returning completedAt=null for FAILED (only the CANCELED branch filled it from buffered.cancelledAt), producing isCompleted:true + finishedAt:undefined — exactly the inconsistency the synth shape exists to prevent. Fall back to buffered.createdAt for the SYSTEM_FAILURE case (the buffer entry has no separate failedAt; createdAt is the best-available proxy for when the terminal state landed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric fix to the just-applied `ApiRetrieveRunPresenter` change. PG-resident SYSTEM_FAILURE rows always have `completedAt` populated; the synthetic span run was returning null for FAILED, so the run-detail panel and any caller using `isFinished && completedAt` saw a finished run with no completion timestamp. Fall back to `run.createdAt` when `isFailed` is true (the buffer entry has no separate failedAt — createdAt is the best proxy for when the terminal state landed). Regression locked in `mollifierSyntheticSpanRun.test.ts` — the FAILED test now asserts `completedAt === NOW`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| }, | ||
| }, | ||
| }); | ||
|
|
||
| if (pgRow) return { ...pgRow, isBuffered: false }; | ||
|
|
||
| // Postgres miss → fall back to the mollifier buffer. When the gate | ||
| // diverted a trigger, the run lives in Redis until the drainer replays | ||
| // it through engine.trigger. Synthesise the FoundRun shape so call() | ||
| // returns a `QUEUED` (or `FAILED`) response with empty output, no | ||
| // attempts, no relations. | ||
| const buffered = await findRunByIdWithMollifierFallback({ | ||
| runId: friendlyId, | ||
| environmentId: env.id, | ||
| organizationId: env.organizationId, | ||
| }); | ||
|
|
||
| if (!buffered) return null; | ||
|
|
||
| return synthesiseFoundRunFromBuffer(buffered); | ||
| } |
There was a problem hiding this comment.
🚩 Implicit behavioral change: findRun now returns buffered runs for all call sites
The PR modified ApiRetrieveRunPresenter.findRun to include a mollifier buffer fallback. This affects ALL callers of findRun, not just the routes modified in this PR. The api.v3.runs.$runId.ts:19 and api.v1.runs.$runParam.reschedule.ts:66 routes also call findRun but were not updated. For api.v3.runs.$runId.ts, the presenter.call() method handles the synthesised FoundRun correctly (PENDING status skips output processing, resolveSchedule does a PG lookup that may miss but returns undefined gracefully). For the reschedule route, the downstream service would receive a synthetic FoundRun and attempt operations on a run that doesn't exist in PG yet — worth confirming that the reschedule service handles this gracefully (e.g., fails with a clear error rather than corrupting state).
(Refers to lines 107-160)
Was this helpful? React with 👍 or 👎 to provide feedback.
| try { | ||
| await PerformTaskRunAlertsService.enqueue(failedRun.id); | ||
| } catch (alertsError) { | ||
| logger.warn("TriggerFailedTaskService: alert enqueue failed", { | ||
| taskId: request.taskId, | ||
| friendlyId: failedRun.friendlyId, | ||
| error: | ||
| alertsError instanceof Error ? alertsError.message : String(alertsError), | ||
| }); | ||
| } |
There was a problem hiding this comment.
🚩 TriggerFailedTaskService now enqueues alerts where it previously didn't
Both call() (line 226) and callWithoutTraceEvents() (line 332) in TriggerFailedTaskService now call PerformTaskRunAlertsService.enqueue(failedRun.id) after createFailedTaskRun. Previously, failed task runs through this service never triggered customer alert channels — the runFailed event was never emitted, and no manual alert enqueue existed. This is an intentional feature addition (the comments explain the reasoning), but it changes the observable behavior: batch trigger failures that exceed queue limits, environment-not-found errors, etc. will now fire ERROR alert channels for customers who have them configured. If rollout needs to be gradual, this could be gated behind a feature flag.
Was this helpful? React with 👍 or 👎 to provide feedback.
`.server-changes/mollifier-reads.md` mentioned an "attempts" endpoint but no `api.v1.runs.$runParam.attempts.ts` changes exist in this PR (it's a POST/action route for creating attempts, not a read endpoint). The actual added endpoints are retrieve, trace, spans, and events — fix the changelog to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Synthesise QUEUED/FAILED responses from the mollifier buffer when a TaskRun row hasn't landed in Postgres yet. Wires the synthesis into:
ApiRetrieveRunPresenterThe
readFallbackinfra itself lives on the trigger PR (consumed byIdempotencyKeyConcern); this PR adds the route-level synthetic-rendering primitives.Stacked on the replay PR.
Test plan