fix(webapp): restore Postgres fallback for non-ClickHouse OTLP spans#3803
fix(webapp): restore Postgres fallback for non-ClickHouse OTLP spans#3803matt-aitken wants to merge 4 commits into
Conversation
Environments with taskEventStore = 'taskEvent' / 'taskEventPartitioned' (Postgres-backed runs not yet migrated to ClickHouse) were hitting a throw in buildEventRepository when the org-scoped ClickHouse factory received those store values. The throw happened inside the grouping loop of #exportEvents, causing the entire OTLP request to return HTTP 500 — which the OpenTelemetry collector treats as non-retryable, silently dropping the full batch. Fix: guard the getEventRepositoryForOrganizationSync call in #exportEvents to only invoke the ClickHouse factory for clickhouse/clickhouse_v2 stores. All other store values (taskEvent, taskEventPartitioned, postgres) route directly to the existing Postgres eventRepository, mirroring the pattern already used by resolveEventRepositoryForStore and getEventRepositoryForStore in eventRepository/index.server.ts. Also wrap the factory call in a try/catch that falls back to Postgres so any future unexpected store values in an OTLP batch degrade gracefully instead of failing the whole request.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
🧰 Additional context used📓 Path-based instructions (7)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
apps/webapp/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.server.ts📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Files:
**/*.{js,ts,jsx,tsx,json,md,yml,yaml}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (8)📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-03-29T19:16:28.864ZApplied to files:
📚 Learning: 2026-05-05T09:38:02.512ZApplied to files:
📚 Learning: 2026-05-12T21:04:05.815ZApplied to files:
📚 Learning: 2026-05-14T08:21:07.614ZApplied to files:
🔇 Additional comments (1)
WalkthroughexportEvents now resolves its destination repository based on taskEventStore: for "clickhouse" and "clickhouse_v2" it uses ClickHouseFactory.getEventRepositoryForOrganizationSync(...); for any other value it uses the server-side eventRepository with routing key "postgres:default". The PR also adds a changelog note describing OTLP ingest HTTP 500 behavior for Postgres-backed task event stores. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
As pointed out in code review: the guard on line 126 already screens out all non-ClickHouse store values, so the only values reaching getEventRepositoryForOrganizationSync are 'clickhouse' and 'clickhouse_v2' — both valid. The broad catch was unnecessary and harmful: a real ClickHouse resolution failure would have been silently routed to Postgres, writing spans where they'd be invisible on the ClickHouse read path.
Problem
On environments where runs carry a Postgres-backed
taskEventStorevalue (taskEventortaskEventPartitioned), OTLP ingest endpoints (POST /otel/v1/tracesand/otel/v1/logs) were returning HTTP 500.Root cause: The org-scoped ClickHouse factory introduced in a recent PR routes all OTLP spans through
getEventRepositoryForOrganizationSync→buildEventRepository. That function only handles"clickhouse"and"clickhouse_v2"store values and throwsUnknown ClickHouse event repository store: <value>for anything else. The throw occurred inside the grouping loop of#exportEvents, unwinding the entire method and returning 500 for the whole batch.The OpenTelemetry collector's
otlphttpexporter treats HTTP 500 as non-retryable and drops the batch — causing real span loss.Fix: Guard the
getEventRepositoryForOrganizationSynccall in#exportEventsso it is only invoked forclickhouse/clickhouse_v2store values. All other values are routed directly to the PostgreseventRepository, matching the guard pattern already present inresolveEventRepositoryForStoreandgetEventRepositoryForStoreineventRepository/index.server.ts.The ClickHouse factory call is also wrapped in a try/catch that falls back to Postgres so any unexpected store value in a future OTLP batch degrades gracefully instead of failing the whole request.
Changes
apps/webapp/app/v3/otlpExporter.server.ts— add Postgres routing guard and try/catch fallback in#exportEventsTesting
The
eventRepository/index.server.tsmodule already has the same guard pattern thoroughly covered. The fix brings#exportEventsinto alignment with that existing, tested pattern. Manual verification: confirm OTLP batches containing Postgres-store spans return 200 and route to the correct repository.