feat: OpenTelemetry support to export metrics and tracing to an Otel collector#2807
Open
Ignacio-Vidal wants to merge 8 commits into
Open
feat: OpenTelemetry support to export metrics and tracing to an Otel collector#2807Ignacio-Vidal wants to merge 8 commits into
Ignacio-Vidal wants to merge 8 commits into
Conversation
Add OpenTelemetry tracing to the Prism HTTP service, toggled via a configuration flag with a configurable OTLP/HTTP exporter. - New telemetry module (initTelemetry) sets up the OTel NodeSDK with an OTLP/HTTP trace exporter and HTTP auto-instrumentation; no-op when disabled. - Micri request handler is wrapped in a server span when enabled, recording method, path, response status code, and exceptions. - CLI flags --telemetry, --otel-exporter-url, --otel-service-name on both mock and proxy, with PRISM_TELEMETRY / OTEL_* env-var fallbacks. - Flush and shut down the OTel SDK on SIGINT/SIGTERM so spans buffered by the BatchSpanProcessor are exported instead of dropped on exit. - Unit tests for the telemetry module and the shutdown-flush handler. Single-process only; multiprocess hardening and metrics deferred to later phases. Refs stoplightio#2804
Verifies that enabling OpenTelemetry tracing via --telemetry does not affect the served response: the server span wraps the request handler transparently and the mock response is returned as normal. Refs stoplightio#2804
Multiprocess mode crashed on startup with "Cannot read properties of undefined (reading 'isPrimary')". The default import of node:cluster compiled to a `.default` access, which is undefined at runtime because the project builds CommonJS without esModuleInterop and node:cluster's API lives on the module object itself. Use an import-equals require and reconcile the @types/node default-export typing with the runtime shape.
Build on the Phase 1 tracing support with the remaining Phase 2 items. - gRPC OTLP exporter: new --otel-exporter-protocol flag (http/protobuf | grpc), with OTEL_EXPORTER_OTLP_PROTOCOL env-var fallback. initTelemetry selects the OTLP/gRPC or OTLP/HTTP trace exporter accordingly. - Multiprocess support: the cluster worker already initializes telemetry via the shared server path; the primary now forwards a graceful SIGTERM to the worker on shutdown so buffered spans are flushed before exit. - Unit test for gRPC exporter selection. Metrics remain deferred to Phase 3. Refs stoplightio#2804
Add metrics export alongside traces, gated by a new --otel-metrics flag (requires --telemetry). - Prism request metrics: http.server.request.count (counter) and http.server.request.duration (latency histogram), tagged with method, path, and response status code. - Node.js VM metrics via @opentelemetry/instrumentation-runtime-node: event loop delay/utilization, GC duration, and V8 heap usage. - Metrics export through a PeriodicExportingMetricReader using the same HTTP/gRPC protocol selection as traces; the OTLP /v1/traces url is rewritten to /v1/metrics so a single --otel-exporter-url covers both. - Instruments are created in createServer (after the MeterProvider is registered by initTelemetry), not at module load, so they bind to the real meter rather than a no-op. - Unit tests for the metrics and gRPC-metrics paths. Refs stoplightio#2804
- Replace --telemetry and --otel-metrics with a single --otel-telemetry boolean that enables tracing and metrics together. Simpler surface; one switch for the whole OTel pipeline. - Add log <-> trace correlation: createLogger now injects the active span's trace_id/span_id into every log record via a pino mixin, and the CLI terminal output appends [trace=<id>] when present. A mixin is used (rather than @opentelemetry/instrumentation-pino) because pino is required at module load, before the flag-driven SDK init runs, so the auto- instrumentation's require-hook would never patch it. - Initialize telemetry before the CLI logger is created so the mixin's active-span lookup is meaningful from the first request. Refs stoplightio#2804
0269629 to
bbec166
Compare
…ry flag - Drop explanatory comments from createServer.ts. - Update the telemetry harness spec to the renamed --otel-telemetry flag. Refs stoplightio#2804
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds OpenTelemetry support to the Prism HTTP service: distributed tracing, metrics, and log↔trace correlation for both
mockandproxy, exported to any OTLP-compatible collector (Jaeger, Grafana Tempo, Datadog, the OpenTelemetry Collector,grafana/otel-lgtm, etc.).Telemetry is opt-in and off by default — when disabled, Prism behaves exactly as before and the OTel SDK is never started (zero overhead, no behavior change).
What's included
METHOD /path) carrying request method, URL path, response status code, and recorded exceptions on errors. In proxy mode, outbound upstream calls are traced via HTTP auto-instrumentation.@opentelemetry/instrumentation-runtime-node. Enabled together with tracing by the single--otel-telemetryflag.createLoggerinjects the active span'strace_id/span_idinto every log record (via a pino mixin), and the CLI terminal output appends[trace=<id>]. Lets you pivot between a log line and its trace in the backend.http/protobuforgrpc) are all configurable, each with the standardOTEL_*env-var fallback.Configuration
--otel-telemetryPRISM_TELEMETRYfalse--otel-exporter-urlOTEL_EXPORTER_OTLP_ENDPOINT--otel-service-nameOTEL_SERVICE_NAMEprism--otel-exporter-protocolOTEL_EXPORTER_OTLP_PROTOCOLhttp/protobufA single
--otel-telemetryflag enables tracing and metrics together. CLI flags take precedence over env vars.For metrics, the OTLP
/v1/tracesURL is automatically rewritten to/v1/metrics, so the one--otel-exporter-urlcovers both signals.Benefits
Also included: a standalone bug fix
fix(cli): correct node:cluster import for multiprocess mode— multiprocess mode (--multiprocess, the default whenNODE_ENV=production, e.g. in the Docker image) crashed on startup withCannot read properties of undefined (reading 'isPrimary'). Thenode:clusterdefault import resolved toundefinedat runtime because the project builds CommonJS withoutesModuleInterop. This is required for multiprocess telemetry to work and fixes multiprocess mode in general.Implementation notes
createServer(after the MeterProvider is registered), not at module load, so they bind to the real meter rather than a no-op.@opentelemetry/instrumentation-pino: pino is required at module load, before the flag-driven SDK init runs, so the auto-instrumentation's require-hook can't patch it.Testing
--otel-telemetrydoes not change served responses.grafana/otel-lgtmcollector (also added as a runnable demo): traces visible in Tempo, request + Node VM metrics in Prometheus, andtrace_idin logs — over both HTTP and gRPC, single- and multi-process.Checklist