Skip to content

fix(laminar): drain OTel processors before process.exit() to stop losing final agent spans#50

Merged
Alezander9 merged 1 commit intomainfrom
fix/laminar-flush-tracing-on-exit
May 9, 2026
Merged

fix(laminar): drain OTel processors before process.exit() to stop losing final agent spans#50
Alezander9 merged 1 commit intomainfrom
fix/laminar-flush-tracing-on-exit

Conversation

@Alezander9
Copy link
Copy Markdown
Member

Summary

Laminar traces were missing the final agent LLM span on ~27% of overnight eval runs. Eddie's measurements:

eval n missing-final rate
gemini-3-flash-preview 99 75 75.8%
glm-5.1 (3 seeds pooled) 300 102 34.0%
mimo-v2.5-pro 100 14 14.0%
gpt-5.5 100 1 1.0%
claude-opus-4-7 100 0 0.0%
all 699 192 27.5%

Score impact is nil (the runner reads agent steps from --format json stdout, not from spans), but it makes the Laminar UI useless for any human dive into a glm/gemini trace — the closing text-emit LLM call is just not there.

Root cause

Three problems compose:

  1. Plugin event hooks are fire-and-forget. packages/opencode/src/plugin/index.ts:249 does void hook["event"]?.({event}). The bcode-laminar session.idle handler ends the turn span and await processor.forceFlush(), but its Promise is discarded by the bus subscriber.
  2. process.exit() is unconditional. packages/opencode/src/index.ts:252 calls it in the top-level finally. Node's event loop is killed mid-flight; any in-flight gRPC export from the BatchSpanProcessor is severed.
  3. BSP defaults. maxExportBatchSize=512, scheduledDelayMillis=5000ms. The very last spans of a run sit in the queue waiting either for the 5 s timer or our forceFlush — and forceFlush loses the race to process.exit().

Why it's model-dependent

The bug only ever bites the very last LLM span. Models split into two camps:

  • Tool-call+text-in-same-step (claude-opus, gpt-5.5): they reason aloud while choosing the next action, so the final answer text rides on a step that also had a tool call earlier in the run. That doStream span ended many seconds before idle and was already flushed in an earlier batch. → 0–1% miss.
  • Tool-only-then-final-text (glm-5.1, gemini-3-flash): every middle turn is pure tool calls, then ONE extra LLM round emits only text with finishReason=stop. That lone final doStream span ends 50–200 ms before idle, so it's the freshest unflushed thing in the BSP queue when process.exit() fires. → 34–76% miss. Gemini-3-flash is the worst because its TTFB is fastest, so the gap between span-end and idle is the smallest.
  • Mimo at 14% sits in between (mixed behavior).

It's a clean continuum on "fraction of runs where the closing text is in a separate, late, lone span."

What this PR does

In the top-level finally of packages/opencode/src/index.ts, before process.exit():

  1. Fetch the global OTel TracerProvider via @opentelemetry/api (already a dep of packages/opencode).
  2. Duck-check for forceFlush (the API-level type doesn't declare it; BasicTracerProvider from @opentelemetry/sdk-trace-base, which NodeSDK.start() registers, implements it).
  3. Promise.race against a 3 s timeout so a wedged exporter cannot hang bcode on exit.

Generic OTel-API approach with no laminar import — benefits any future tracing plugin equally (DataDog, New Relic, OTLP collector, etc.).

What this PR does NOT do

  • Does not change scheduledDelayMillis. The default 5 s exists to batch many spans into one gRPC call. Reducing it to 1 s would multiply gRPC traffic during long runs (~5×) without addressing the root race; once we drain at exit, the queue is fully flushed regardless of timer. Leaving the BSP defaults alone.
  • Does not fix the underlying fire-and-forget event subscriber (src/plugin/index.ts:249's void hook["event"]?.(...)). That's the proper upstream fix; should be a PR to anomalyco/opencode. This patch is a tactical drain at the only place we control unconditionally — the exit point.

Risk / blast radius

+12 lines in index.ts. One Yellow-zone touch (logged in EXCEPTIONS.md § "Drain OTel processors before process.exit()"). Behavior when no OTel provider is registered: the duck-check fails and the exit path is unchanged. Behavior when LMNR_PROJECT_API_KEY is unset (laminar plugin no-ops): the global TracerProvider is the API noop provider with no forceFlush, duck-check fails, exit path is unchanged. Behavior when exporter is wedged: bounded 3 s wait, then process.exit() fires anyway.

Expected effect on the missing-final rate: ~27% → near-zero across all models.

Verify

  • bun typecheck clean (filtered + per-package).
  • Manual eval run on glm-5.1 / gemini-3-flash recommended to confirm rate drops. (Eddie can re-run a small slice with the patched binary.)

…ing final agent spans

Eval traces were missing the final LLM span on ~27% of runs (gemini-3-flash
76%, glm-5.1 34%, mimo 14%, gpt-5.5 1%, claude 0%). Root cause is a
process-exit race: the plugin event hook in packages/opencode/src/plugin/index.ts:249
is invoked fire-and-forget (`void hook["event"]?.(...)`), so the bcode-laminar
`session.idle` handler's `processor.forceFlush()` Promise is discarded, and
the unconditional `process.exit()` in the top-level `finally` (index.ts:252)
kills in-flight gRPC exports.

Model-dependence comes from emit shape: tool-only-then-final-text models
(glm-5.1, gemini-3-flash) make one extra tool-less LLM round at the end whose
lone `ai.streamText.doStream` span ends 50-200ms before idle and is the
freshest unflushed thing in the BatchSpanProcessor queue at exit. Tool-call+
text-in-same-step models (claude-opus, gpt-5.5) fold the final answer into a
step that ended seconds earlier and was already flushed in a prior batch.

Fix: in the top-level `finally` of index.ts, before `process.exit()`, fetch
the global OTel TracerProvider via `@opentelemetry/api`, duck-check
`forceFlush`, and race it against a 3 s timeout so a wedged exporter cannot
hang bcode on exit. Generic to any OTel-based plugin.

Does not touch the deeper bug (the fire-and-forget `event` hook); that is
the proper upstream fix to anomalyco/opencode.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

@Alezander9 Alezander9 merged commit ff90631 into main May 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant