feat(openfeature): emit server-side EVP flagevaluation by leoromanovsky · Pull Request #11639 · DataDog/dd-trace-java

leoromanovsky · 2026-06-12T21:44:59Z

Motivation

Server-side flag evaluations are currently invisible in the Datadog EVP flagevaluation track for Java. This PR makes dd-trace-java emit schema-conformant, aggregated EVP flagevaluation payloads, mirroring the Go reference implementation (#4886) across the cross-SDK fan-out, while leaving the existing OTel feature_flag.evaluations behavior unchanged (no regression). It builds on the proof-of-concept ideas from #4874 and resolves that POC's 8 reviewer concerns.

What we learned

Recording flagevaluation is not like emitting the existing OTel metric. The metric is fire-and-forget; flagevaluation must aggregate per evaluation — bucket by flag + variant + reason + allocation + context, which means prune (256/256) → canonical-context key → update the tiered maps on every evaluation. The OpenFeature Java SDK runs hooks synchronously on the caller's evaluation thread (finallyAfter fires inline), so doing that aggregation inside the hook would charge the user's flag evaluation directly. The work is structural (the aggregation is the feature), so the fix is architectural, not a hot-loop tweak.

flowchart LR
    A["flagevaluation hook needs per-eval aggregation<br/>(prune / canonical key / bucket)"] --> B["OpenFeature runs hooks synchronously<br/>on the evaluation thread (finallyAfter)"]
    B --> C["Inline aggregation would charge<br/>the user's evaluation directly"]
    C --> D["Resolution: async — cheap snapshot in the hook,<br/>aggregate on a background agent thread"]

The work is inherent. flagevaluation aggregates counts per bucket on every evaluation; that is the feature and cannot be optimized away.
Hooks are synchronous. The OpenFeature finallyAfter stage runs on the caller's evaluation thread — any cost it incurs lands on the user's client.getBooleanValue(...).
Resolution. The hook does only a bounded, cheap snapshot — scalar extraction (details.getVariant(), reason, error, targeting key) plus a non-blocking bounded offer() — and a single background agent thread owns prune/key/aggregate/flush. Measured hot-path cost stays flat (see Performance).

Transferable lesson for the fan-out: telemetry that aggregates per evaluation belongs off the synchronous hook path. Each SDK should verify its runtime's hook-execution model and offload aggregation to a worker accordingly.

Design Decisions

Reuse the existing EVP transport used by the exposure path. FlagEvaluationWriterImpl posts to /evp_proxy/v2/api/v2/flagevaluations through the BackendApiFactory(EVENT_PLATFORM) channel — the direct analog of ExposureWriterImpl — adding only a new aggregation layer on a ~10s flush timer (plus a final drain-and-flush on shutdown). Emission is aggregated counts per bucket, matching the worker's evaluation_count / first_evaluation / last_evaluation schema.
Asynchronous, best-effort recording (the headline design, per What we learned). The finallyAfter hook does only cheap scalar extraction + a non-blocking offer() onto a bounded MessagePassingBlockingQueue (capacity 2^16); a single background agent thread owns prune/key/aggregate/flush. No thread-per-evaluation.
Bucket identity = comparable canonical-context key, NOT a hash. The enumerable dimensions (flag, variant, allocationKey, reason, targetingKey) are exact strings. The pruned context is encoded into a comparable canonical (type-tagged, length-delimited) string that becomes part of the map key, so distinct contexts always land in distinct buckets — no FNV digest, no collisions, no misattribution. Context pruning is deterministic (sort before cut): keys are sorted first, then the first 256 non-oversized values are kept, and the pruned attributes (not the raw attrs) are what gets aggregated and serialized.
Two-tier degradation with explicit caps (worker-side): full (globalCap=131072, perFlagCap=10000/flag; context pruned 256/256) → degraded (degradedCap=32768; drops targeting key + context) → drop (counted). No third "ultra-degraded" tier.
Variant from details.getVariant(), not the evaluated value. The variant is the OpenFeature variant key (same source as the OTel FlagEvalHook); an absent variant means a runtime default (runtime_default_used=true), detected from the missing variant rather than the reason.
Separate finallyAfter hook covering success/error/default paths; the existing OTel metrics hook (FlagEvalHook / FlagEvalMetrics) is unchanged. The error path serializes the error message (falling back to the error code name) as the schema {"message": ...} object.
Eval timestamp is stamped at eval-entry time via flag metadata key dd.eval.timestamp_ms, falling back to hook-fire time when absent; first_evaluation/last_evaluation are the min/max across events in a bucket (no wall-clock assumptions).
Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default on) gates only the EVP path, read through the tracer config system (FeatureFlaggingConfig.FLAGGING_EVALUATION_COUNTS_ENABLED) and registered in supported-configurations.json (config-inversion gate passes).

Notes for reviewers: the payload schema check in FlagEvaluationWriterImplTest uses a small structural validator (FlagEvalSchema) — asserting required fields, integral types, {"key":...} / {"message":...} object shapes, and additionalProperties:false — rather than a JSON-schema library, because there is no json-schema validator on the test classpath. It validates against the real vendored worker schema (src/test/resources/flagevaluation/batchedflagevaluations.json). A reviewer may prefer adding networknt:json-schema-validator as a testImplementation dependency to validate directly against the schema file instead.

High-load behavior — when do we drop?

Two layers, two policies:

1. Aggregator (worker-side) — counts are preserved; only dimensional fidelity degrades. Every evaluation the worker processes contributes exactly one count to some tier (or the bounded drop counter).

Tier	Key dimensions	Bound	Entered when
Full	flag, variant, allocationKey, reason, targetingKey, canonical context (+ pruned context payload)	`globalCap=131,072` total and `perFlagCap=10,000`/flag	default; context pruned 256/256
Degraded	flag, variant, allocationKey, reason	`degradedCap=32,768`	a flag reaches `perFlagCap` distinct full buckets, or `globalCap` is full
Drop (counted)	—	n/a (a single observable counter)	`degradedCap` is full

The degraded key is exactly the OTel feature_flag.evaluations cardinality (flag × variant × allocation × reason). Over-cap counts increment an observable droppedDegradedOverflow counter (surfaced as a warning on flush) rather than cascading into a lossy third tier.

2. Async hand-off (hot-path) — explicit best-effort backpressure. The hook offers onto a bounded queue (capacity 2^16). If the worker can't keep up and the queue fills, the hook drops the event and increments an observable droppedQueueOverflow counter (logged each flush) rather than blocking the evaluation. Never silent, never blocking.

Performance

Recording is asynchronous, so the number that matters is the hot-path cost the evaluation actually pays in the synchronous finallyAfter hook. The JMH benchmark FlagEvaluationHotPathBenchmark isolates the eval-thread snapshot+enqueue stage from the off-thread worker aggregation stage:

stage	cost
`evalThreadCapture` (snapshot + non-blocking enqueue — what the evaluation thread pays)	~5.6 ns/op
`workerAggregate` (prune + canonical key + two-tier aggregation — off the eval thread)	~736 ns/op

The expensive aggregation work (~736 ns) runs on the background agent thread and never blocks the user's evaluation; the evaluation thread pays only the flat ~5.6 ns snapshot+enqueue. Run: ./gradlew :products:feature-flagging:feature-flagging-lib:jmh.

Validation

Proven locally against ffe-dogfooding mock-intake (app-java, port 8084):

Negative control → green (GET /logs/api/v2/flagevaluations): baseline count == 0, then count > 0 after evaluations with the expected flag.key / variant.key / service. Validated with WAIT_SECONDS=75 because Java's OTel export interval is ~60s — the default 15s window would false-fail the OTel non-regression check.
OTel non-regression: feature_flag.evaluations still observed after the change; the existing FlagEvalHook / FlagEvalMetrics path is unchanged.
Provider readiness: app-java reaches PROVIDER_READY.
Unit tests green: feature-flagging-api 28/0, feature-flagging-lib 16/0; spotlessCheck and checkConfigurations (config registry) pass.

Resolution of the 8 PoC (#4874) reviewer concerns

#	Concern	Addressed
1	Context bounds before buffering	256-field / 256-char prune, deterministic sort-before-cut; oversized strings skipped; pruned view is what's aggregated/serialized
2	Tiers validated vs `flageval-worker` schema	Structural validator (`FlagEvalSchema`) over the vendored `batchedflagevaluations.json`; both tiers conform via optional-field omission
3	Bucket identity not FNV-1a-alone	Comparable canonical-context string key (type-tagged, length-delimited) as part of the map key — no digest, no collisions; payload uses `{"key":...}` / `{"message":...}` object shapes
4	`first/last_evaluation` via min/max	min/max across events in a bucket; eval time from `dd.eval.timestamp_ms` metadata, no wall-clock assumption
5	Runtime-default from absence of variant	Detected from missing variant (`details.getVariant() == null`), not from the reason; sets `runtime_default_used=true`
6	Benchmark vs base eval cost	JMH `FlagEvaluationHotPathBenchmark` separates eval-thread snapshot (~5.6 ns) from off-thread aggregation (~736 ns); async keeps the eval path bounded
7	Hook counts error/default paths	Separate `finallyAfter` hook covering success/error/default; error message captured (falls back to error code name)
8	Explicit bounds on degraded/overflow	Per-tier caps (full `131072` / per-flag `10000` / degraded `32768`) + observable drop-and-count beyond degraded + bounded async-queue backpressure drop-and-count

🚧 Draft — open for review ahead of the cross-SDK fan-out.

…EVP writer - Tests for identical-event aggregation (count 2, first<=last min/max) - Test for type-tagged canonical key distinguishing int vs string - Tests for global/degraded cap overflow and drop-counted overflow - Test for absent variant -> runtimeDefaultUsed - Tests for 256-field/256-char context pruning - Tests for flush posting to 'flagevaluations' with required JSON fields

…nWriterImpl two-tier EVP writer - FlagEvalEvent (bootstrap): lightweight data record for hook->writer channel - FlagEvaluationWriter (bootstrap): interface with enqueue/start/close - FlagEvaluationWriterImpl (lib): two-tier aggregation with frozen contract constants - GLOBAL_CAP=131072 / PER_FLAG_CAP=10000 / DEGRADED_CAP=32768 - Canonical context key: sorted, type-tagged, length-delimited (no hash; reviewer concern #3) - Context pruning: <=256 fields, string values <=256 chars (reviewer concern #1) - Min/max first/last eval time (reviewer concern #4) - Absent variant -> runtimeDefaultUsed (reviewer concern #5) - Drop-counted overflow (reviewer concern #8) - Posts to 'flagevaluations' via BackendApiFactory(EVENT_PLATFORM) - Flush interval 10s (differs from exposure 1s) - Add FEATURE_FLAG_EVALUATION_PROCESSOR thread to AgentThreadFactory - Register/deregister writer with FeatureFlaggingGateway - FlagEvaluationWriterImplTest: all 10 unit tests GREEN

- Test: enqueue called with flagKey, variant, reason (lowercased), allocationKey - Test: evalTimeMs from dd.eval.timestamp_ms metadata (reviewer concern #4) - Test: evalTimeMs fallback to hook-fire time when metadata absent - Test: null value -> null variant (runtime default; reviewer concern #5) - Test: only enqueue called, no inline aggregation (reviewer concern #7) - Test: writer=null is no-op (killswitch off) - Test: targetingKey extracted from evaluation context

- FlagEvalEVPHook: Hook<T> with finallyAfter() doing cheap capture only - Reads allocationKey from metadata.getString('allocationKey') - Reads eval-time from metadata.getLong('dd.eval.timestamp_ms'), fallback to hook-fire time - Null value -> null variant (runtime default; reviewer concern #5) - Lowercases reason string - Resolves writer lazily from FeatureFlaggingGateway (test: injected directly) - No aggregation on hook thread (reviewer concern #7) - DDEvaluator: stamp dd.eval.timestamp_ms in flag metadata at resolution point - Provider.getProviderHooks(): returns [OTel FlagEvalHook, EVP FlagEvalEVPHook] - OTel path preserved byte-for-byte (PRES-01 non-regression) - FeatureFlaggingSystem: create + start FlagEvaluationWriterImpl behind killswitch - DD_FLAGGING_EVALUATION_COUNTS_ENABLED=false disables EVP path only - Default: enabled (EVP path on) - ProviderTest: updated to expect 2 hooks (OTel + EVP) - FlagEvalEVPHookTest: all 8 unit tests GREEN

- Apply project-required code style (google-java-format) to all new files in this plan before submitting

datadog-datadog-prod-us1-2 · 2026-06-12T21:47:23Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | check_base

DataDog/apm-reliability/dd-trace-java | test_smoke: [17, 2/8]

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: c8c95a5 | Docs | Datadog PR Page | Give us feedback!}

…valuation code

dd-octo-sts · 2026-06-12T22:06:16Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Scenario	Candidate	master	Δ (95% CI of mean)
startup:insecure-bank:iast:Agent	13.95 s	13.93 s	[-0.5%; +0.8%] (no difference)
startup:insecure-bank:tracing:Agent	12.82 s	13.01 s	[-2.2%; -0.7%] (maybe better)
startup:petclinic:appsec:Agent	16.86 s	16.53 s	[+1.0%; +2.9%] (maybe worse)
startup:petclinic:iast:Agent	16.87 s	16.82 s	[-0.6%; +1.1%] (no difference)
startup:petclinic:profiling:Agent	16.74 s	16.86 s	[-1.5%; +0.1%] (no difference)
startup:petclinic:sca:Agent	16.84 s	16.47 s	[+1.4%; +3.1%] (significantly worse)
startup:petclinic:tracing:Agent	16.01 s	15.52 s	[-1.2%; +7.4%] (no difference)

Commit: c8c95a56 · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

… pruning, shutdown drain, and schema validation - variant sourced from details.getVariant() (not the evaluated value) - error object captured from evaluation details, schema {"message":...} - killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED via the tracer config system - deterministic context pruning (sort before cut), stored pruned attrs - observable queue-overflow drop counter; close() drains and final-flushes - JMH hot-path benchmark; structural validation against vendored worker schema

leoromanovsky added 5 commits June 12, 2026 15:31

style(02-05): apply google-java-format via spotlessApply

fcc884e

- Apply project-required code style (google-java-format) to all new files in this plan before submitting

chore(openfeature): remove internal review annotations from EVP flage…

78bbaba

…valuation code

leoromanovsky added 2 commits June 13, 2026 11:27

chore(openfeature): remove internal planning annotations

c8c95a5

leoromanovsky changed the title ~~[FFL-2446] dd-trace-java: emit EVP flagevaluation (Phase 2 fan-out)~~ feat(openfeature): emit server-side EVP flagevaluation Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openfeature): emit server-side EVP flagevaluation#11639

feat(openfeature): emit server-side EVP flagevaluation#11639
leoromanovsky wants to merge 8 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-java

leoromanovsky commented Jun 12, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 12, 2026 •

edited by datadog-prod-us1-6 Bot

Loading

Uh oh!

dd-octo-sts Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leoromanovsky commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What we learned

Design Decisions

High-load behavior — when do we drop?

Performance

Validation

Resolution of the 8 PoC (#4874) reviewer concerns

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 12, 2026 • edited by datadog-prod-us1-6 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

dd-octo-sts Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leoromanovsky commented Jun 12, 2026 •

edited

Loading

datadog-datadog-prod-us1-2 Bot commented Jun 12, 2026 •

edited by datadog-prod-us1-6 Bot

Loading

dd-octo-sts Bot commented Jun 12, 2026 •

edited

Loading