Skip to content

feat: OpenTelemetry support to export metrics and tracing to an Otel collector#2807

Open
Ignacio-Vidal wants to merge 8 commits into
stoplightio:masterfrom
Ignacio-Vidal:feature/otel-telemetry-phase3
Open

feat: OpenTelemetry support to export metrics and tracing to an Otel collector#2807
Ignacio-Vidal wants to merge 8 commits into
stoplightio:masterfrom
Ignacio-Vidal:feature/otel-telemetry-phase3

Conversation

@Ignacio-Vidal

@Ignacio-Vidal Ignacio-Vidal commented Jun 14, 2026

Copy link
Copy Markdown

Summary

Adds OpenTelemetry support to the Prism HTTP service: distributed tracing, metrics, and log↔trace correlation for both mock and proxy, exported to any OTLP-compatible collector (Jaeger, Grafana Tempo, Datadog, the OpenTelemetry Collector, grafana/otel-lgtm, etc.).

Telemetry is opt-in and off by default — when disabled, Prism behaves exactly as before and the OTel SDK is never started (zero overhead, no behavior change).

What's included

  • Tracing — each request produces a server span (METHOD /path) carrying request method, URL path, response status code, and recorded exceptions on errors. In proxy mode, outbound upstream calls are traced via HTTP auto-instrumentation.
  • Metrics — request count + latency histogram (tagged with method/path/status), plus Node.js VM metrics (event loop delay/utilization, GC duration, V8 heap) via @opentelemetry/instrumentation-runtime-node. Enabled together with tracing by the single --otel-telemetry flag.
  • Log ↔ trace correlationcreateLogger injects the active span's trace_id/span_id into every log record (via a pino mixin), and the CLI terminal output appends [trace=<id>]. Lets you pivot between a log line and its trace in the backend.
  • Configurable exporter — OTLP endpoint, service name, and transport (http/protobuf or grpc) are all configurable, each with the standard OTEL_* env-var fallback.
  • Graceful shutdown — traces/metrics buffered by the batch processors are flushed on SIGINT/SIGTERM (single- and multi-process), so the last batch isn't dropped on exit.
  • Multiprocess support — telemetry initializes in the cluster worker; the primary forwards shutdown so the worker flushes before exit.

Configuration

Concern CLI flag Env var fallback Default
Enable --otel-telemetry PRISM_TELEMETRY false
Collector URL --otel-exporter-url OTEL_EXPORTER_OTLP_ENDPOINT exporter default
Service name --otel-service-name OTEL_SERVICE_NAME prism
Transport --otel-exporter-protocol OTEL_EXPORTER_OTLP_PROTOCOL http/protobuf

A single --otel-telemetry flag enables tracing and metrics together. CLI flags take precedence over env vars.

prism mock api.oas.yaml \
  --otel-telemetry \
  --otel-exporter-url http://localhost:4318/v1/traces \
  --otel-service-name prism-mock

For metrics, the OTLP /v1/traces URL is automatically rewritten to /v1/metrics, so the one --otel-exporter-url covers both signals.

Benefits

  • See request-level traces, latency, status codes, and Node runtime health from Prism in your existing observability stack — previously the only signal was stdout logs.
  • Log lines carry the trace id, so you can jump from a log to its distributed trace and back.
  • Vendor-neutral via OTLP; no dependency on any specific backend.
  • Zero overhead and no behavior change when telemetry is off (the default).

Also included: a standalone bug fix

fix(cli): correct node:cluster import for multiprocess mode — multiprocess mode (--multiprocess, the default when NODE_ENV=production, e.g. in the Docker image) crashed on startup with Cannot read properties of undefined (reading 'isPrimary'). The node:cluster default import resolved to undefined at runtime because the project builds CommonJS without esModuleInterop. This is required for multiprocess telemetry to work and fixes multiprocess mode in general.

Implementation notes

  • Telemetry is initialized before the CLI logger is created so the active-span lookup in the logger mixin is meaningful from the first request.
  • Metric instruments are created inside createServer (after the MeterProvider is registered), not at module load, so they bind to the real meter rather than a no-op.
  • Log correlation uses a pino mixin rather than @opentelemetry/instrumentation-pino: pino is required at module load, before the flag-driven SDK init runs, so the auto-instrumentation's require-hook can't patch it.

Testing

  • Unit tests for telemetry init (enabled/disabled, HTTP/gRPC, metrics) and the shutdown-flush handler.
  • End-to-end harness test asserting --otel-telemetry does not change served responses.
  • Manually validated end to end against a grafana/otel-lgtm collector (also added as a runnable demo): traces visible in Tempo, request + Node VM metrics in Prometheus, and trace_id in logs — over both HTTP and gRPC, single- and multi-process.

Checklist

  • The basics
    • I tested these changes manually in my local or dev environment
  • Tests
    • Added or updated
  • Event Tracking
    • N/A
  • Error Reporting
    • N/A

Add OpenTelemetry tracing to the Prism HTTP service, toggled via a
configuration flag with a configurable OTLP/HTTP exporter.

- New telemetry module (initTelemetry) sets up the OTel NodeSDK with an
  OTLP/HTTP trace exporter and HTTP auto-instrumentation; no-op when disabled.
- Micri request handler is wrapped in a server span when enabled, recording
  method, path, response status code, and exceptions.
- CLI flags --telemetry, --otel-exporter-url, --otel-service-name on both
  mock and proxy, with PRISM_TELEMETRY / OTEL_* env-var fallbacks.
- Flush and shut down the OTel SDK on SIGINT/SIGTERM so spans buffered by the
  BatchSpanProcessor are exported instead of dropped on exit.
- Unit tests for the telemetry module and the shutdown-flush handler.

Single-process only; multiprocess hardening and metrics deferred to later phases.

Refs stoplightio#2804
Verifies that enabling OpenTelemetry tracing via --telemetry does not affect
the served response: the server span wraps the request handler transparently
and the mock response is returned as normal.

Refs stoplightio#2804
Multiprocess mode crashed on startup with "Cannot read properties of undefined
(reading 'isPrimary')". The default import of node:cluster compiled to a
`.default` access, which is undefined at runtime because the project builds
CommonJS without esModuleInterop and node:cluster's API lives on the module
object itself. Use an import-equals require and reconcile the @types/node
default-export typing with the runtime shape.
Build on the Phase 1 tracing support with the remaining Phase 2 items.

- gRPC OTLP exporter: new --otel-exporter-protocol flag (http/protobuf | grpc),
  with OTEL_EXPORTER_OTLP_PROTOCOL env-var fallback. initTelemetry selects the
  OTLP/gRPC or OTLP/HTTP trace exporter accordingly.
- Multiprocess support: the cluster worker already initializes telemetry via the
  shared server path; the primary now forwards a graceful SIGTERM to the worker
  on shutdown so buffered spans are flushed before exit.
- Unit test for gRPC exporter selection.

Metrics remain deferred to Phase 3.

Refs stoplightio#2804
Add metrics export alongside traces, gated by a new --otel-metrics flag
(requires --telemetry).

- Prism request metrics: http.server.request.count (counter) and
  http.server.request.duration (latency histogram), tagged with method,
  path, and response status code.
- Node.js VM metrics via @opentelemetry/instrumentation-runtime-node:
  event loop delay/utilization, GC duration, and V8 heap usage.
- Metrics export through a PeriodicExportingMetricReader using the same
  HTTP/gRPC protocol selection as traces; the OTLP /v1/traces url is
  rewritten to /v1/metrics so a single --otel-exporter-url covers both.
- Instruments are created in createServer (after the MeterProvider is
  registered by initTelemetry), not at module load, so they bind to the
  real meter rather than a no-op.
- Unit tests for the metrics and gRPC-metrics paths.

Refs stoplightio#2804
@Ignacio-Vidal Ignacio-Vidal requested a review from a team as a code owner June 14, 2026 10:01
@Ignacio-Vidal Ignacio-Vidal changed the title Feature/Add OpenTelemetry support for metrics and tracing and exporting to an Otel collector feat - OpenTelemetry support for metrics and tracing and exporting to an Otel collector Jun 14, 2026
@Ignacio-Vidal Ignacio-Vidal changed the title feat - OpenTelemetry support for metrics and tracing and exporting to an Otel collector feat: OpenTelemetry support for metrics and tracing and exporting to an Otel collector Jun 14, 2026
@Ignacio-Vidal Ignacio-Vidal changed the title feat: OpenTelemetry support for metrics and tracing and exporting to an Otel collector feat: OpenTelemetry support to export metrics and tracing to an Otel collector Jun 14, 2026
- Replace --telemetry and --otel-metrics with a single --otel-telemetry
  boolean that enables tracing and metrics together. Simpler surface; one
  switch for the whole OTel pipeline.
- Add log <-> trace correlation: createLogger now injects the active span's
  trace_id/span_id into every log record via a pino mixin, and the CLI
  terminal output appends [trace=<id>] when present. A mixin is used (rather
  than @opentelemetry/instrumentation-pino) because pino is required at
  module load, before the flag-driven SDK init runs, so the auto-
  instrumentation's require-hook would never patch it.
- Initialize telemetry before the CLI logger is created so the mixin's
  active-span lookup is meaningful from the first request.

Refs stoplightio#2804
@Ignacio-Vidal Ignacio-Vidal force-pushed the feature/otel-telemetry-phase3 branch from 0269629 to bbec166 Compare June 14, 2026 10:51
…ry flag

- Drop explanatory comments from createServer.ts.
- Update the telemetry harness spec to the renamed --otel-telemetry flag.

Refs stoplightio#2804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant