feat(openfeature): emit server-side EVP flagevaluation#11639
Draft
leoromanovsky wants to merge 8 commits into
Draft
feat(openfeature): emit server-side EVP flagevaluation#11639leoromanovsky wants to merge 8 commits into
leoromanovsky wants to merge 8 commits into
Conversation
…EVP writer - Tests for identical-event aggregation (count 2, first<=last min/max) - Test for type-tagged canonical key distinguishing int vs string - Tests for global/degraded cap overflow and drop-counted overflow - Test for absent variant -> runtimeDefaultUsed - Tests for 256-field/256-char context pruning - Tests for flush posting to 'flagevaluations' with required JSON fields
…nWriterImpl two-tier EVP writer - FlagEvalEvent (bootstrap): lightweight data record for hook->writer channel - FlagEvaluationWriter (bootstrap): interface with enqueue/start/close - FlagEvaluationWriterImpl (lib): two-tier aggregation with frozen contract constants - GLOBAL_CAP=131072 / PER_FLAG_CAP=10000 / DEGRADED_CAP=32768 - Canonical context key: sorted, type-tagged, length-delimited (no hash; reviewer concern #3) - Context pruning: <=256 fields, string values <=256 chars (reviewer concern #1) - Min/max first/last eval time (reviewer concern #4) - Absent variant -> runtimeDefaultUsed (reviewer concern #5) - Drop-counted overflow (reviewer concern #8) - Posts to 'flagevaluations' via BackendApiFactory(EVENT_PLATFORM) - Flush interval 10s (differs from exposure 1s) - Add FEATURE_FLAG_EVALUATION_PROCESSOR thread to AgentThreadFactory - Register/deregister writer with FeatureFlaggingGateway - FlagEvaluationWriterImplTest: all 10 unit tests GREEN
- Test: enqueue called with flagKey, variant, reason (lowercased), allocationKey - Test: evalTimeMs from dd.eval.timestamp_ms metadata (reviewer concern #4) - Test: evalTimeMs fallback to hook-fire time when metadata absent - Test: null value -> null variant (runtime default; reviewer concern #5) - Test: only enqueue called, no inline aggregation (reviewer concern #7) - Test: writer=null is no-op (killswitch off) - Test: targetingKey extracted from evaluation context
- FlagEvalEVPHook: Hook<T> with finallyAfter() doing cheap capture only
- Reads allocationKey from metadata.getString('allocationKey')
- Reads eval-time from metadata.getLong('dd.eval.timestamp_ms'), fallback to hook-fire time
- Null value -> null variant (runtime default; reviewer concern #5)
- Lowercases reason string
- Resolves writer lazily from FeatureFlaggingGateway (test: injected directly)
- No aggregation on hook thread (reviewer concern #7)
- DDEvaluator: stamp dd.eval.timestamp_ms in flag metadata at resolution point
- Provider.getProviderHooks(): returns [OTel FlagEvalHook, EVP FlagEvalEVPHook]
- OTel path preserved byte-for-byte (PRES-01 non-regression)
- FeatureFlaggingSystem: create + start FlagEvaluationWriterImpl behind killswitch
- DD_FLAGGING_EVALUATION_COUNTS_ENABLED=false disables EVP path only
- Default: enabled (EVP path on)
- ProviderTest: updated to expect 2 hooks (OTel + EVP)
- FlagEvalEVPHookTest: all 8 unit tests GREEN
- Apply project-required code style (google-java-format) to all new files in this plan before submitting
Contributor
|
Contributor
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
… pruning, shutdown drain, and schema validation
- variant sourced from details.getVariant() (not the evaluated value)
- error object captured from evaluation details, schema {"message":...}
- killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED via the tracer config system
- deterministic context pruning (sort before cut), stored pruned attrs
- observable queue-overflow drop counter; close() drains and final-flushes
- JMH hot-path benchmark; structural validation against vendored worker schema
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Server-side flag evaluations are currently invisible in the Datadog EVP
flagevaluationtrack for Java. This PR makesdd-trace-javaemit schema-conformant, aggregated EVPflagevaluationpayloads, mirroring the Go reference implementation (#4886) across the cross-SDK fan-out, while leaving the existing OTelfeature_flag.evaluationsbehavior unchanged (no regression). It builds on the proof-of-concept ideas from #4874 and resolves that POC's 8 reviewer concerns.What we learned
Recording
flagevaluationis not like emitting the existing OTel metric. The metric is fire-and-forget;flagevaluationmust aggregate per evaluation — bucket by flag + variant + reason + allocation + context, which means prune (256/256) → canonical-context key → update the tiered maps on every evaluation. The OpenFeature Java SDK runs hooks synchronously on the caller's evaluation thread (finallyAfterfires inline), so doing that aggregation inside the hook would charge the user's flag evaluation directly. The work is structural (the aggregation is the feature), so the fix is architectural, not a hot-loop tweak.flowchart LR A["flagevaluation hook needs per-eval aggregation<br/>(prune / canonical key / bucket)"] --> B["OpenFeature runs hooks synchronously<br/>on the evaluation thread (finallyAfter)"] B --> C["Inline aggregation would charge<br/>the user's evaluation directly"] C --> D["Resolution: async — cheap snapshot in the hook,<br/>aggregate on a background agent thread"]flagevaluationaggregates counts per bucket on every evaluation; that is the feature and cannot be optimized away.finallyAfterstage runs on the caller's evaluation thread — any cost it incurs lands on the user'sclient.getBooleanValue(...).details.getVariant(), reason, error, targeting key) plus a non-blocking boundedoffer()— and a single background agent thread owns prune/key/aggregate/flush. Measured hot-path cost stays flat (see Performance).Transferable lesson for the fan-out: telemetry that aggregates per evaluation belongs off the synchronous hook path. Each SDK should verify its runtime's hook-execution model and offload aggregation to a worker accordingly.
Design Decisions
FlagEvaluationWriterImplposts to/evp_proxy/v2/api/v2/flagevaluationsthrough theBackendApiFactory(EVENT_PLATFORM)channel — the direct analog ofExposureWriterImpl— adding only a new aggregation layer on a ~10s flush timer (plus a final drain-and-flush on shutdown). Emission is aggregated counts per bucket, matching the worker'sevaluation_count/first_evaluation/last_evaluationschema.finallyAfterhook does only cheap scalar extraction + a non-blockingoffer()onto a boundedMessagePassingBlockingQueue(capacity2^16); a single background agent thread owns prune/key/aggregate/flush. No thread-per-evaluation.flag,variant,allocationKey,reason,targetingKey) are exact strings. The pruned context is encoded into a comparable canonical (type-tagged, length-delimited) string that becomes part of the map key, so distinct contexts always land in distinct buckets — no FNV digest, no collisions, no misattribution. Context pruning is deterministic (sort before cut): keys are sorted first, then the first 256 non-oversized values are kept, and the pruned attributes (not the raw attrs) are what gets aggregated and serialized.globalCap=131072,perFlagCap=10000/flag; context pruned 256/256) → degraded (degradedCap=32768; drops targeting key + context) → drop (counted). No third "ultra-degraded" tier.details.getVariant(), not the evaluated value. The variant is the OpenFeature variant key (same source as the OTelFlagEvalHook); an absent variant means a runtime default (runtime_default_used=true), detected from the missing variant rather than the reason.finallyAfterhook covering success/error/default paths; the existing OTel metrics hook (FlagEvalHook/FlagEvalMetrics) is unchanged. The error path serializes the error message (falling back to the error code name) as the schema{"message": ...}object.dd.eval.timestamp_ms, falling back to hook-fire time when absent;first_evaluation/last_evaluationare the min/max across events in a bucket (no wall-clock assumptions).DD_FLAGGING_EVALUATION_COUNTS_ENABLED(default on) gates only the EVP path, read through the tracer config system (FeatureFlaggingConfig.FLAGGING_EVALUATION_COUNTS_ENABLED) and registered insupported-configurations.json(config-inversion gate passes).High-load behavior — when do we drop?
Two layers, two policies:
1. Aggregator (worker-side) — counts are preserved; only dimensional fidelity degrades. Every evaluation the worker processes contributes exactly one count to some tier (or the bounded drop counter).
globalCap=131,072total andperFlagCap=10,000/flagdegradedCap=32,768perFlagCapdistinct full buckets, orglobalCapis fulldegradedCapis fullThe degraded key is exactly the OTel
feature_flag.evaluationscardinality (flag × variant × allocation × reason). Over-cap counts increment an observabledroppedDegradedOverflowcounter (surfaced as a warning on flush) rather than cascading into a lossy third tier.2. Async hand-off (hot-path) — explicit best-effort backpressure. The hook offers onto a bounded queue (capacity
2^16). If the worker can't keep up and the queue fills, the hook drops the event and increments an observabledroppedQueueOverflowcounter (logged each flush) rather than blocking the evaluation. Never silent, never blocking.Performance
Recording is asynchronous, so the number that matters is the hot-path cost the evaluation actually pays in the synchronous
finallyAfterhook. The JMH benchmarkFlagEvaluationHotPathBenchmarkisolates the eval-thread snapshot+enqueue stage from the off-thread worker aggregation stage:evalThreadCapture(snapshot + non-blocking enqueue — what the evaluation thread pays)workerAggregate(prune + canonical key + two-tier aggregation — off the eval thread)The expensive aggregation work (~736 ns) runs on the background agent thread and never blocks the user's evaluation; the evaluation thread pays only the flat ~5.6 ns snapshot+enqueue. Run:
./gradlew :products:feature-flagging:feature-flagging-lib:jmh.Validation
Proven locally against
ffe-dogfoodingmock-intake (app-java, port 8084):GET /logs/api/v2/flagevaluations): baselinecount == 0, thencount > 0after evaluations with the expectedflag.key/variant.key/service. Validated withWAIT_SECONDS=75because Java's OTel export interval is ~60s — the default 15s window would false-fail the OTel non-regression check.feature_flag.evaluationsstill observed after the change; the existingFlagEvalHook/FlagEvalMetricspath is unchanged.app-javareachesPROVIDER_READY.spotlessCheckandcheckConfigurations(config registry) pass.Resolution of the 8 PoC (#4874) reviewer concerns
flageval-workerschemaFlagEvalSchema) over the vendoredbatchedflagevaluations.json; both tiers conform via optional-field omission{"key":...}/{"message":...}object shapesfirst/last_evaluationvia min/maxdd.eval.timestamp_msmetadata, no wall-clock assumptiondetails.getVariant() == null), not from the reason; setsruntime_default_used=trueFlagEvaluationHotPathBenchmarkseparates eval-thread snapshot (~5.6 ns) from off-thread aggregation (~736 ns); async keeps the eval path boundedfinallyAfterhook covering success/error/default; error message captured (falls back to error code name)131072/ per-flag10000/ degraded32768) + observable drop-and-count beyond degraded + bounded async-queue backpressure drop-and-count🚧 Draft — open for review ahead of the cross-SDK fan-out.