Skip to content

feat(openfeature): emit server-side EVP flagevaluation#11639

Draft
leoromanovsky wants to merge 8 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-java
Draft

feat(openfeature): emit server-side EVP flagevaluation#11639
leoromanovsky wants to merge 8 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-java

Conversation

@leoromanovsky

@leoromanovsky leoromanovsky commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Motivation

Server-side flag evaluations are currently invisible in the Datadog EVP flagevaluation track for Java. This PR makes dd-trace-java emit schema-conformant, aggregated EVP flagevaluation payloads, mirroring the Go reference implementation (#4886) across the cross-SDK fan-out, while leaving the existing OTel feature_flag.evaluations behavior unchanged (no regression). It builds on the proof-of-concept ideas from #4874 and resolves that POC's 8 reviewer concerns.

What we learned

Recording flagevaluation is not like emitting the existing OTel metric. The metric is fire-and-forget; flagevaluation must aggregate per evaluation — bucket by flag + variant + reason + allocation + context, which means prune (256/256) → canonical-context key → update the tiered maps on every evaluation. The OpenFeature Java SDK runs hooks synchronously on the caller's evaluation thread (finallyAfter fires inline), so doing that aggregation inside the hook would charge the user's flag evaluation directly. The work is structural (the aggregation is the feature), so the fix is architectural, not a hot-loop tweak.

flowchart LR
    A["flagevaluation hook needs per-eval aggregation<br/>(prune / canonical key / bucket)"] --> B["OpenFeature runs hooks synchronously<br/>on the evaluation thread (finallyAfter)"]
    B --> C["Inline aggregation would charge<br/>the user's evaluation directly"]
    C --> D["Resolution: async — cheap snapshot in the hook,<br/>aggregate on a background agent thread"]
Loading
  • The work is inherent. flagevaluation aggregates counts per bucket on every evaluation; that is the feature and cannot be optimized away.
  • Hooks are synchronous. The OpenFeature finallyAfter stage runs on the caller's evaluation thread — any cost it incurs lands on the user's client.getBooleanValue(...).
  • Resolution. The hook does only a bounded, cheap snapshot — scalar extraction (details.getVariant(), reason, error, targeting key) plus a non-blocking bounded offer() — and a single background agent thread owns prune/key/aggregate/flush. Measured hot-path cost stays flat (see Performance).

Transferable lesson for the fan-out: telemetry that aggregates per evaluation belongs off the synchronous hook path. Each SDK should verify its runtime's hook-execution model and offload aggregation to a worker accordingly.

Design Decisions

  • Reuse the existing EVP transport used by the exposure path. FlagEvaluationWriterImpl posts to /evp_proxy/v2/api/v2/flagevaluations through the BackendApiFactory(EVENT_PLATFORM) channel — the direct analog of ExposureWriterImpl — adding only a new aggregation layer on a ~10s flush timer (plus a final drain-and-flush on shutdown). Emission is aggregated counts per bucket, matching the worker's evaluation_count / first_evaluation / last_evaluation schema.
  • Asynchronous, best-effort recording (the headline design, per What we learned). The finallyAfter hook does only cheap scalar extraction + a non-blocking offer() onto a bounded MessagePassingBlockingQueue (capacity 2^16); a single background agent thread owns prune/key/aggregate/flush. No thread-per-evaluation.
  • Bucket identity = comparable canonical-context key, NOT a hash. The enumerable dimensions (flag, variant, allocationKey, reason, targetingKey) are exact strings. The pruned context is encoded into a comparable canonical (type-tagged, length-delimited) string that becomes part of the map key, so distinct contexts always land in distinct buckets — no FNV digest, no collisions, no misattribution. Context pruning is deterministic (sort before cut): keys are sorted first, then the first 256 non-oversized values are kept, and the pruned attributes (not the raw attrs) are what gets aggregated and serialized.
  • Two-tier degradation with explicit caps (worker-side): full (globalCap=131072, perFlagCap=10000/flag; context pruned 256/256) → degraded (degradedCap=32768; drops targeting key + context) → drop (counted). No third "ultra-degraded" tier.
  • Variant from details.getVariant(), not the evaluated value. The variant is the OpenFeature variant key (same source as the OTel FlagEvalHook); an absent variant means a runtime default (runtime_default_used=true), detected from the missing variant rather than the reason.
  • Separate finallyAfter hook covering success/error/default paths; the existing OTel metrics hook (FlagEvalHook / FlagEvalMetrics) is unchanged. The error path serializes the error message (falling back to the error code name) as the schema {"message": ...} object.
  • Eval timestamp is stamped at eval-entry time via flag metadata key dd.eval.timestamp_ms, falling back to hook-fire time when absent; first_evaluation/last_evaluation are the min/max across events in a bucket (no wall-clock assumptions).
  • Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default on) gates only the EVP path, read through the tracer config system (FeatureFlaggingConfig.FLAGGING_EVALUATION_COUNTS_ENABLED) and registered in supported-configurations.json (config-inversion gate passes).

Notes for reviewers: the payload schema check in FlagEvaluationWriterImplTest uses a small structural validator (FlagEvalSchema) — asserting required fields, integral types, {"key":...} / {"message":...} object shapes, and additionalProperties:false — rather than a JSON-schema library, because there is no json-schema validator on the test classpath. It validates against the real vendored worker schema (src/test/resources/flagevaluation/batchedflagevaluations.json). A reviewer may prefer adding networknt:json-schema-validator as a testImplementation dependency to validate directly against the schema file instead.

High-load behavior — when do we drop?

Two layers, two policies:

1. Aggregator (worker-side) — counts are preserved; only dimensional fidelity degrades. Every evaluation the worker processes contributes exactly one count to some tier (or the bounded drop counter).

Tier Key dimensions Bound Entered when
Full flag, variant, allocationKey, reason, targetingKey, canonical context (+ pruned context payload) globalCap=131,072 total and perFlagCap=10,000/flag default; context pruned 256/256
Degraded flag, variant, allocationKey, reason degradedCap=32,768 a flag reaches perFlagCap distinct full buckets, or globalCap is full
Drop (counted) n/a (a single observable counter) degradedCap is full

The degraded key is exactly the OTel feature_flag.evaluations cardinality (flag × variant × allocation × reason). Over-cap counts increment an observable droppedDegradedOverflow counter (surfaced as a warning on flush) rather than cascading into a lossy third tier.

2. Async hand-off (hot-path) — explicit best-effort backpressure. The hook offers onto a bounded queue (capacity 2^16). If the worker can't keep up and the queue fills, the hook drops the event and increments an observable droppedQueueOverflow counter (logged each flush) rather than blocking the evaluation. Never silent, never blocking.

Performance

Recording is asynchronous, so the number that matters is the hot-path cost the evaluation actually pays in the synchronous finallyAfter hook. The JMH benchmark FlagEvaluationHotPathBenchmark isolates the eval-thread snapshot+enqueue stage from the off-thread worker aggregation stage:

stage cost
evalThreadCapture (snapshot + non-blocking enqueue — what the evaluation thread pays) ~5.6 ns/op
workerAggregate (prune + canonical key + two-tier aggregation — off the eval thread) ~736 ns/op

The expensive aggregation work (~736 ns) runs on the background agent thread and never blocks the user's evaluation; the evaluation thread pays only the flat ~5.6 ns snapshot+enqueue. Run: ./gradlew :products:feature-flagging:feature-flagging-lib:jmh.

Validation

Proven locally against ffe-dogfooding mock-intake (app-java, port 8084):

  • Negative control → green (GET /logs/api/v2/flagevaluations): baseline count == 0, then count > 0 after evaluations with the expected flag.key / variant.key / service. Validated with WAIT_SECONDS=75 because Java's OTel export interval is ~60s — the default 15s window would false-fail the OTel non-regression check.
  • OTel non-regression: feature_flag.evaluations still observed after the change; the existing FlagEvalHook / FlagEvalMetrics path is unchanged.
  • Provider readiness: app-java reaches PROVIDER_READY.
  • Unit tests green: feature-flagging-api 28/0, feature-flagging-lib 16/0; spotlessCheck and checkConfigurations (config registry) pass.

Resolution of the 8 PoC (#4874) reviewer concerns

# Concern Addressed
1 Context bounds before buffering 256-field / 256-char prune, deterministic sort-before-cut; oversized strings skipped; pruned view is what's aggregated/serialized
2 Tiers validated vs flageval-worker schema Structural validator (FlagEvalSchema) over the vendored batchedflagevaluations.json; both tiers conform via optional-field omission
3 Bucket identity not FNV-1a-alone Comparable canonical-context string key (type-tagged, length-delimited) as part of the map key — no digest, no collisions; payload uses {"key":...} / {"message":...} object shapes
4 first/last_evaluation via min/max min/max across events in a bucket; eval time from dd.eval.timestamp_ms metadata, no wall-clock assumption
5 Runtime-default from absence of variant Detected from missing variant (details.getVariant() == null), not from the reason; sets runtime_default_used=true
6 Benchmark vs base eval cost JMH FlagEvaluationHotPathBenchmark separates eval-thread snapshot (~5.6 ns) from off-thread aggregation (~736 ns); async keeps the eval path bounded
7 Hook counts error/default paths Separate finallyAfter hook covering success/error/default; error message captured (falls back to error code name)
8 Explicit bounds on degraded/overflow Per-tier caps (full 131072 / per-flag 10000 / degraded 32768) + observable drop-and-count beyond degraded + bounded async-queue backpressure drop-and-count

🚧 Draft — open for review ahead of the cross-SDK fan-out.

…EVP writer

- Tests for identical-event aggregation (count 2, first<=last min/max)
- Test for type-tagged canonical key distinguishing int vs string
- Tests for global/degraded cap overflow and drop-counted overflow
- Test for absent variant -> runtimeDefaultUsed
- Tests for 256-field/256-char context pruning
- Tests for flush posting to 'flagevaluations' with required JSON fields
…nWriterImpl two-tier EVP writer

- FlagEvalEvent (bootstrap): lightweight data record for hook->writer channel
- FlagEvaluationWriter (bootstrap): interface with enqueue/start/close
- FlagEvaluationWriterImpl (lib): two-tier aggregation with frozen contract constants
  - GLOBAL_CAP=131072 / PER_FLAG_CAP=10000 / DEGRADED_CAP=32768
  - Canonical context key: sorted, type-tagged, length-delimited (no hash; reviewer concern #3)
  - Context pruning: <=256 fields, string values <=256 chars (reviewer concern #1)
  - Min/max first/last eval time (reviewer concern #4)
  - Absent variant -> runtimeDefaultUsed (reviewer concern #5)
  - Drop-counted overflow (reviewer concern #8)
  - Posts to 'flagevaluations' via BackendApiFactory(EVENT_PLATFORM)
  - Flush interval 10s (differs from exposure 1s)
- Add FEATURE_FLAG_EVALUATION_PROCESSOR thread to AgentThreadFactory
- Register/deregister writer with FeatureFlaggingGateway
- FlagEvaluationWriterImplTest: all 10 unit tests GREEN
- Test: enqueue called with flagKey, variant, reason (lowercased), allocationKey
- Test: evalTimeMs from dd.eval.timestamp_ms metadata (reviewer concern #4)
- Test: evalTimeMs fallback to hook-fire time when metadata absent
- Test: null value -> null variant (runtime default; reviewer concern #5)
- Test: only enqueue called, no inline aggregation (reviewer concern #7)
- Test: writer=null is no-op (killswitch off)
- Test: targetingKey extracted from evaluation context
- FlagEvalEVPHook: Hook<T> with finallyAfter() doing cheap capture only
  - Reads allocationKey from metadata.getString('allocationKey')
  - Reads eval-time from metadata.getLong('dd.eval.timestamp_ms'), fallback to hook-fire time
  - Null value -> null variant (runtime default; reviewer concern #5)
  - Lowercases reason string
  - Resolves writer lazily from FeatureFlaggingGateway (test: injected directly)
  - No aggregation on hook thread (reviewer concern #7)
- DDEvaluator: stamp dd.eval.timestamp_ms in flag metadata at resolution point
- Provider.getProviderHooks(): returns [OTel FlagEvalHook, EVP FlagEvalEVPHook]
  - OTel path preserved byte-for-byte (PRES-01 non-regression)
- FeatureFlaggingSystem: create + start FlagEvaluationWriterImpl behind killswitch
  - DD_FLAGGING_EVALUATION_COUNTS_ENABLED=false disables EVP path only
  - Default: enabled (EVP path on)
- ProviderTest: updated to expect 2 hooks (OTel + EVP)
- FlagEvalEVPHookTest: all 8 unit tests GREEN
- Apply project-required code style (google-java-format) to all new files
  in this plan before submitting
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | check_base   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-java | test_smoke: [17, 2/8]   View in Datadog   GitLab

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c8c95a5 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.95 s 13.93 s [-0.5%; +0.8%] (no difference)
startup:insecure-bank:tracing:Agent 12.82 s 13.01 s [-2.2%; -0.7%] (maybe better)
startup:petclinic:appsec:Agent 16.86 s 16.53 s [+1.0%; +2.9%] (maybe worse)
startup:petclinic:iast:Agent 16.87 s 16.82 s [-0.6%; +1.1%] (no difference)
startup:petclinic:profiling:Agent 16.74 s 16.86 s [-1.5%; +0.1%] (no difference)
startup:petclinic:sca:Agent 16.84 s 16.47 s [+1.4%; +3.1%] (significantly worse)
startup:petclinic:tracing:Agent 16.01 s 15.52 s [-1.2%; +7.4%] (no difference)

Commit: c8c95a56 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

… pruning, shutdown drain, and schema validation

- variant sourced from details.getVariant() (not the evaluated value)
- error object captured from evaluation details, schema {"message":...}
- killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED via the tracer config system
- deterministic context pruning (sort before cut), stored pruned attrs
- observable queue-overflow drop counter; close() drains and final-flushes
- JMH hot-path benchmark; structural validation against vendored worker schema
@leoromanovsky leoromanovsky changed the title [FFL-2446] dd-trace-java: emit EVP flagevaluation (Phase 2 fan-out) feat(openfeature): emit server-side EVP flagevaluation Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant