Skip to content

AE control auth + CodeRabbit #53 followup#68

Open
entlein wants to merge 56 commits into
mainfrom
ae-followup-auth
Open

AE control auth + CodeRabbit #53 followup#68
entlein wants to merge 56 commits into
mainfrom
ae-followup-auth

Conversation

@entlein

@entlein entlein commented Jun 21, 2026

Copy link
Copy Markdown

Summary: AE control surface gains bearer-JWT auth + TLS, the trigger gets a PollLimit-saturation escape, fork-cloud config restored after the main merge, and CodeRabbit Go-code findings on the umbrella PR #53 are all addressed (controller rollback, NaN-finite gate, MaxBytesReader, streaming retain-on-fail, etc.).

Test Plan: build + 14/14 AE go test packages green locally on cb81ecd / c2642da; CI-image lint (tools/linters/lint_like_ci.sh) reports 0 Errors 0 Warnings; vizier-release built aeprod22 (cb81ecd) successfully, aeprod23 (c2642da) in flight; trigger differential oracle added (oracle_test.go) pins watermark advancement + boundary dedup + PollLimit-at-capacity + the documented overflow trade-off.

Type of change: /kind feature

AE control auth + CodeRabbit #53 followup

Followup to #53 (merge #53 first — this stacks on ae-prod; I'll retarget to main once #53 lands). Captured all 48 inline + 8 review + 12 issue CodeRabbit comments to biz/PoC/log4j/ae_pr53_coderabbit_feedback/ (JSON + DIGEST.md) so nothing is lost.

✅ Done in this PR

  • Bearer-JWT auth on the control surface — uses the same shared lib the broker/PEM use (px.dev/pixie/src/shared/services/utils); SetAuth verifies via jwtutils.ParseToken (signature + audience), middleware on Handler(), /healthz exempt. dx already mints this exact service JWT — it just attaches it. No new secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH, default off) for safe incremental rollout. Unit test mints a real JWT and asserts 401/pass/healthz. (CodeRabbit server.go: protect control endpoints.)
  • Input validation — reject invalid t_end (≤0) and query windows (start≥end, non-positive). (server.go: reject invalid t_end/window.)

Companion change (separate, entlein/dx)

dx aeclient.go must attach Authorization: Bearer <minted JWT> on its AE control POSTs. Then flip CONTROL_REQUIRE_AUTH=true + mount PL_JWT_SIGNING_KEY on AE. (Tracked; small.)

Already resolved on ae-prod (verified — no action)

  • http.Server timeouts (main.go) — already added (CodeRabbit r3379377432).
  • ~9 harness Majors (ae_config/ae_vs_all/deploy_ae/exp_datavolume_extreme/exp_log4j_reps/… fail-fast) — those scripts were deleted in the harness consolidation; moot.
  • The http_events=0 "bug" — self-retracted by the author (misdiagnosis).

Remaining AE-code Majors — triaged (to address as commits here)

Safe / clear:

  • streaming/filter.go:196 — timer leak on deltaCh close.
  • passthrough/passthrough.go:147 — per-table query/write timeouts.
  • pixieapi/pixieapi.go:89 — direct-mode TLS-skip uses wrong option (→ WithDirectTLSSkipVerify).
  • stats.py:57 — guard false "EXACT" when zero PASS rows.
  • schema.sql:485 — retention on ae_reconcile (unbounded growth).
  • controller.go:347 — don't fan out Pixie rows when attribution persistence fails.

Risky / needs careful testing before a presentation — defer with care:

  • controller.go:255 — rehydrated active windows never restart fan-out (state-machine path).
  • trigger/clickhouse.go:303 — inclusive-watermark paging (this is the F8 watermark area — change only with the F8 fixtures).
  • schema.sql:430adaptive_attribution ORDER BY key width (schema change — migration risk).
  • sink/clickhouse.go:433Record blocks hot path on CH (async refactor — perf-sensitive).

Entlein and others added 30 commits June 21, 2026 22:49
Signed-off-by: entlein <einentlein@gmail.com>
New env-gated background loop that runs the same PxL shape AE's
anomaly-gated path uses, but with an empty Target (no ns/pod predicate)
and over a configurable rolling window. Writes via the existing sink so
the byte-shape of forensic_db rows is comparable between the
PASSTHROUGH=1 phase (EVERYTHING) and the PASSTHROUGH=0 phase
(AE-FILTER). One-shot A/B that yields the per-table capture fraction of
the adaptive write path.

- internal/passthrough/passthrough.go — Loop + Config; defaults to
  30s window / 30s refresh / clickhouse.PixieTables() table list.
- internal/passthrough/passthrough_test.go — 6 tests; the load-bearing
  one is TestLoop_EmitsEmptyTargetPxL (asserts neither df.namespace nor
  df.pod predicates appear in the emitted PxL).
- cmd/main.go: ADAPTIVE_PASSTHROUGH + _WINDOW_SEC + _REFRESH_SEC env
  knobs. Adapter is constructed unconditionally when passthrough is on
  (joins the existing PushPixie / streaming construction path so the
  same pxapi grpc stream is reused). Loop is registered with the
  shutdown WaitGroup so SIGTERM waits for the in-flight tick.
- cmd/BUILD.bazel: drop @px// load (other AE BUILD.bazel files use
  //bazel — sticking out as the only one with @px is a leftover from a
  prior gazelle run; align). Add passthrough dep.

Signed-off-by: entlein <einentlein@gmail.com>
Stand-alone workflow that builds entlein/dx (private Active-Diagnosis
Framework) into ghcr.io/k8sstormcenter/dx-daemon. Separates the dx image
publish from the bazel-based vizier_release pipeline; the dx repo ships
its own Dockerfile.dxd (Go cross-compile + distroless final stage) so it
doesn't need to live as a submodule inside src/vizier/services/dx.

Triggers:
- tag push 'release/dx/v*' on this repo cuts a release build, image tag
  derived from the tag suffix (release/dx/v0.1.0 -> image tag 0.1.0).
- workflow_dispatch lets us build any dx ref on demand with a custom tag
  (default: short sha of the resolved dx commit).

Pulls dx via DX_ENTLEIN_PAT (already configured on the repo). Multi-arch
build (linux/amd64, linux/arm64); Dockerfile.dxd cross-compiles in the
native BUILDPLATFORM stage and the final stage is COPY-only, so target
emulation isn't required.

Signed-off-by: entlein <einentlein@gmail.com>
The dx image build pipeline lives in entlein/dx itself (PR #53,
branch feat/bazel-release): bazel-based with @px external pin to
pixie's ae-prod tip, pushes to docker.io/entlein/dx-daemon on a
release/dx/v* tag in the dx repo. The pixie-side buildx workflow
this reverts duplicated that intent in the wrong repo + the wrong
build system (docker buildx instead of bazel + pl_go_image macros)
+ the wrong registry (ghcr.io/k8sstormcenter instead of
docker.io/entlein).

Signed-off-by: entlein <einentlein@gmail.com>
…affordances

Fixes the silent-halt bug: the trigger gated on a RAW event_time high-water-mark,
so a single anomaly in a larger unit (ms/ns) drove the watermark past all real
seconds rows and AE stopped processing forever (data still on Pixie). Normalize
event_time to canonical nanoseconds in the poll SELECT filter+order and in the
in-memory/persisted cursor, boundary-dedup, and maxSeen (normalizeEventTimeNanos
+ chNormEventTimeNanos). Validated at the data layer: vs a poisoned watermark the
raw filter returns 0 rows, the normalized filter recovers all 60.

Also adds ADAPTIVE_PUSH_REFRESH_SEC (negative = single-shot pull) for the
reproducible load-test harness, an in-package trigger unit test, and an e2e
hermetic load test (mock PixieQuerier, exact rows+bytes).

Signed-off-by: entlein <einentlein@gmail.com>
…ness

Consolidate the adaptive_export load-test harness under src/e2e_test/ (matching
vzconn_loadtest / px_cluster conventions). Control-plane experiments (E1-E4, E6,
E8 sustained) are proven exactly-reproducible on a live rig; the data-plane
experiments (E5, E8 data-mode) are authored and pending live validation on a
vizier-registered rig (status documented in README + FINDINGS_AND_BACKLOG).

- harness/: shell + python (inject, exp_control, exp_e8, stats, ...) + lib helpers
- fixtures/EXPERIMENTS.md: curated kubescape_logs data-set catalog + expected outputs
- k8s/: isolated sinks + per-rep generator pod (no probes)
- tools/loadgen/: cleanloadgen + httpsink (docker-built test tool; .bazelignore'd
  pending a bazel target — lib/pq is already vendored in the module)
- FINDINGS_AND_BACKLOG.md: F8 watermark-poison bug + the fix + AE backlog

The AE Go unit/e2e tests live with the service (internal/{trigger,e2e}).

Signed-off-by: entlein <einentlein@gmail.com>
…C14) + diagrams

Signed-off-by: entlein <einentlein@gmail.com>
…iagram; gen sustained-DNS mode

C15 = AE must keep re-pulling+writing an active pod until t_end or DX stop (the
contract DX steers on; last week's 'wrote then stopped' is its violation). Add
DX-steering sequence diagram. Generator gains SUSTAIN_SEC (distinct-DNS trickle,
a Pixie-traced protocol) + configurable SETTLE_PRE_MS warm-up for fresh-pod
capture; harness wires GEN_SUSTAIN_SEC/GEN_SETTLE_MS.

Signed-off-by: entlein <einentlein@gmail.com>
…lisation

The 700821d trigger unit-normalisation wrapped event_time in a
multiIf(...) inside both the WHERE filter and the ORDER BY. Three
existing tests in watermark_test.go + one in clickhouse_test.go pinned
the raw 'event_time >= N' substring and broke at HEAD.

Update each test's expected substring to match the new normalized form
(') >= <ns-scaled N>' — the closing paren of multiIf, then the value
in canonical nanoseconds). Per-test ns-scaling:

  watermark_test.go:94  1744000000000000000  already ns -> unchanged
  watermark_test.go:125 InitialWatermark=42  < 1e10 sec -> * 1e9
  watermark_test.go:156 InitialWatermark=7   < 1e10 sec -> * 1e9
  watermark_test.go:297 event_time='5000'    < 1e10 sec -> * 1e9
  clickhouse_test.go:82 ref.T=1744..303e9 ns already ns -> unchanged

go test ./src/vizier/services/adaptive_export/... all green.

Signed-off-by: entlein <einentlein@gmail.com>
…future-stamp watermark poison)

Signed-off-by: entlein <einentlein@gmail.com>
Records one forensic_db.ae_reconcile row per data-plane pull (read_count vs
wrote_count, window, ns/pod) across ALL three write paths — controller fan-out
(filter), passthrough firehose, and streaming scanner — so a reconcile run
localizes loss to query (R5: read<PEM) vs sink (R6: wrote<read) and quantifies
re-pull dup (C8). Counts alone (write >= read) were proven insufficient.

- new internal/reconcile leaf package (Row, Recorder, Nop) — no import cycle
- sink.Record: CH-backed recorder (INSERT INTO forensic_db.ae_reconcile)
- ae_reconcile table: schema.sql + KnownTables + OperatorOwnedTables (synced);
  not a pixie table (absent from PixieTables, so VerifyPixieSchema ignores it)
- wired: passthrough.tick, controller.pushPixieRows (deferred, all return
  paths), streaming.scanner.Run; gated by ADAPTIVE_RECONCILE=true, else Nop
- unit test proves read/wrote capture incl. the sink-drop read>wrote shape
- fixed apply_test trailing-tables guard for the new operator table
- harness: exp_row_reconcile.sh (row-level PEM<->CH), ae_vs_all.sh, exp_datavolume_extreme.sh

Signed-off-by: entlein <einentlein@gmail.com>
px -o json empty result previously printed one blank line → counted as a phantom
LOSS=1. Guard: empty set → 0-byte keys file; drop all-empty-field keys. Confirmed
against the controlled log4j run (backend http 14/14 exact, conn 66>=12, no loss).

Signed-off-by: entlein <einentlein@gmail.com>
…log4shell

Reliable BY CONSTRUCTION against bob#140 (stateful/unreliable exploit on re-fire):
fresh-JVM backend (delete pod) + attacker-before-backend + the WORKING resolvable FQDN
attacker.<ns>.svc.cluster.local:1389 (NOT the bare attacker-ns.svc which NXDOMAINs and
gets dropped), then VERIFY the actual backend->:1389 LDAP egress in forensic_db.conn_stats
and RETRY until confirmed (the validity gate). Never assumes the exploit fired. Node-side.

Signed-off-by: entlein <einentlein@gmail.com>
…tion)

Reword from offensive 'exploit' to detection-signal-generation language: this validates
the kubescape->DX->AE detection chain. No logic change.

Signed-off-by: entlein <einentlein@gmail.com>
… http2

- pxl.CompilePassthrough/Render: precompile per-table PxL once (fixed
  window => constant relative start_time), only the two time_ bounds are
  stamped per tick. Rendered output is byte-identical to QueryFor with an
  empty Target (TestCompilePassthrough_MatchesQueryFor), so this is a
  structural change, not a capture change. upid->pod/ns stays in PxL.
- passthrough: tickConcurrent fans every table out at once (was a serial
  loop); shared pull() helper. Sink/recorder are pool/HTTP-backed and
  already called concurrently elsewhere.
- drop http2_messages.beta from the firehose set (not materialised on
  every cluster => ""Table not found"" every tick); shared PixieTables/DDL
  lists untouched.
- toggle ADAPTIVE_PASSTHROUGH_COMPILED (default on; =false reverts to the
  legacy serial QueryFor path).

Signed-off-by: entlein <einentlein@gmail.com>
…e.go

Fixes "missing strict dependencies: import of .../internal/reconcile" that
broke the AE image build for passthrough, sink, streaming, controller, cmd
(pre-existing since the ADAPTIVE_RECONCILE commit added the package + imports
without bazel deps; never CI-built). Also wires the new pxl/compile.go srcs
+ passthrough/pxl test srcs (compiled_test.go, reconcile_test.go, compile_test.go).

- new internal/reconcile/BUILD.bazel (go_library, stdlib-only)
- +//internal/reconcile dep: passthrough, sink, streaming, controller, cmd
- pxl go_library +compile.go; pxl_test +compile_test.go; passthrough_test +compiled_test.go,reconcile_test.go (+reconcile dep)

Signed-off-by: entlein <einentlein@gmail.com>
F1 RCA: Pixie caps every px.display at max_output_rows_per_table (default
10000, query_flags.go) — the planner add_limit_to_batch_result_sink_rule
silently truncates wide firehose windows / busy pods at the READ (write path
is clean: ae_reconcile shows read==wrote). Fix uses Pixie own native knob —
prepend `#px:set max_output_rows_per_table=1000000` to every generated PxL
(QueryFor + CompilePassthrough) so all pull paths are uncapped. Validated on
rig: a 14208-row window returned 10000 (capped) vs 14298 (with flag). No
pagination loop, no extra round-trips. See memory project-ae-passthrough-10k-cap.

Signed-off-by: entlein <einentlein@gmail.com>
Consolidates the recurring content_type silent-drop incident class
into one default-suite test gate (6 tests, ~15ms):

  I1 TestContract_ContentTypeIsInt64InSchema
  I2 TestContract_FastEncodeContentTypeAsInt
  I3 TestContract_SilentDropDetected
  I3.b TestContract_SilentDropNotTriggeredOnSuccess
  I3.c TestContract_SilentDropToleratesMissingSummaryHeader
  I4 TestContract_HTTPEventsRoundTrip

Top-of-file docstring chronicles the incident timeline so future
operators can grep their way to the contract.

Signed-off-by: entlein <einentlein@gmail.com>
Measures the AdaptiveExport value prop: datavolume REDUCTION of DX-steered AE
(rev-3 streaming, AE writes only DX-steered activeSet pods over the control
surface) vs saving ALL data (passthrough firehose). Two arms, same fixed load,
forensic_db active-part deltas (rows+bytes) per table; reduction = 1 - DX/ALL.
Uses the canonical resolvable JNDI FQDN (attacker.attacker-ns.svc.cluster.local)
so the chain fires + DX can classify (a malformed host → NXDOMAIN → no steer).
Successor to ae_vs_all.sh, whose AE arm used the rev-2 controller gate + stale JNDI.

Signed-off-by: entlein <einentlein@gmail.com>
Measures all AE non-functional requirements under steady load on the rig:
throughput (rows+bytes/sec), capture completeness (AE read vs broker count =
F1 cap proof), write fidelity (read==wrote + write-error count), end-to-end
freshness latency (now - max(time_) in CH), resource footprint (AE pod cpu/mem
idle vs loaded), per-cycle cadence. Emits a structured report; companion to
exp_dx_steering_reduction.sh. Real-data only.

Signed-off-by: entlein <einentlein@gmail.com>
…ring + live-pod guard)

Run-1 reported false 100% reduction because stale adaptive_attribution windows
rehydrated DEAD pods (deleted loaders) into the activeSet → AE streamed dead pods
→ 0 rows. Clear adaptive_attribution before the DX arm so the activeSet only gets
freshly-steered LIVE pods; add a guard that prints the steered pods + marshalsec
fire count + live log4j-poc pods so a dead-arm result is caught, not reported as
a reduction.

Signed-off-by: entlein <einentlein@gmail.com>
lag query used now()-DateTime64 (type error -> na); use dateDiff(second,...).
Capture-completeness vs broker was window-misaligned (623% artifact) -> drop it;
report tot_read vs tot_wrote (read==wrote) + errs instead. The 10k-cap/completeness
proof is the dedicated F1 test (max_read>10000 vs broker for the SAME window).

Signed-off-by: entlein <einentlein@gmail.com>
…ary)

Run-2 byte-delta reduction came out negative because system.parts byte delta is
compaction-noisy (merges land mid-window). Report rows reduction as primary
(actual captured-row count, noise-free); keep bytes as secondary context.

Signed-off-by: entlein <einentlein@gmail.com>
Eviction-RCA finding (PR #63 NFR campaign): AE had NO memory limit (only cpu
300m) and was CPU-pinned at 300m under concurrent passthrough. AE measured tiny
(16-38Mi steady), but the raised 1M-row passthrough cap can spike, so cap at 1Gi
so AE can never memory-pressure a node; raise cpu limit 300m->1 core (was
throttling). NOTE node evictions were NOT AE/OOM — node-01 went NotReady
(network/heartbeat); the memory consumer is PEM (1365Mi, OOMs at the 2Gi default).

Signed-off-by: entlein <einentlein@gmail.com>
Root cause of the recurring "AE unauthenticated / writes 0 / crashloop" reverts:
kustomization.yaml bundled adaptive_export_secrets.yaml (placeholder pixie-api-key)
with the role+deployment, so EVERY infra re-apply (make log4j) clobbered the real
key that ae-auth had written. Separation of concerns: remove the secret from the
kustomization — infra (role+deployment) stays re-appliable; the secret holds real
creds and is owned solely by `make ae-auth`, created once, never touched by infra
re-applies. Secret manifest kept as a hand-applied seed-only template (documented).

Signed-off-by: entlein <einentlein@gmail.com>
The DX-steered arm was failing because the harness only fired stage-1 (JNDI/LDAP).
That generates ldap-egress but NO kubescape R0001 → backend never flagged → DX no
case → indeterminate → AE steers wrong/no pods. R0001 comes from stage-2 (post-
exploitation exec). fire() now does stage-1 (JNDI) + stage-2 (whoami/shadow/token/
getent in the backend) → kubescape R0001+R0006 → DX rules backend MALIGNANT →
backend enters AE activeSet → reduction is measurable. Verified live: DX evidence
unexpected-spawn+sensitive-file-read → verdict ruled_in generic=MALIGNANT.

Signed-off-by: entlein <einentlein@gmail.com>
…dd DX-steering diagnostics

Standing terminology rule: allowlist/blocklist, never whitelist/blacklist.
Pure rename (no behavior change) of the rev-3 streaming filter:
  FilterModeWhitelist  → FilterModeAllowlist
  MaxWhitelistSize     → MaxAllowlistSize
  ADAPTIVE_STREAM_MAX_WHITELIST → ADAPTIVE_STREAM_MAX_ALLOWLIST  (env)
  mode=whitelist log string → mode=allowlist
plus all comments/identifiers/tests in streaming, activeset, cmd/main.

DX-steering diagnostics (the reason DX-arm-writes-0 has been hard to RCA —
we could not tell "empty ActiveSet" from "broker returned 0 rows"):
  - scanner: log the empty-allowlist short-circuit (was silent) so an
    empty ActiveSet is visible in logs, distinct from "query completed rows=0".
  - FilterUpdater: emitted-filter log Debug→Info so the steered pod count
    per ActiveSet change is visible without debug logging.

NOTE: ADAPTIVE_STREAM_MAX_WHITELIST env renamed → tooling that sets the old
name must switch to ADAPTIVE_STREAM_MAX_ALLOWLIST.

Signed-off-by: entlein <einentlein@gmail.com>
The Pixie dx_evidence_graph UI reads dx_attack_graph via px.DataFrame
clickhouse_dsn, whose query template hardcodes event_time + hostname and
ORDER BY event_time. A table without those columns fails 'Unknown
identifier event_time'; a table created by hand (local, not via the
operator) isn't globally registered. Fix: make AE own it like the other
forensic tables.

- schema.sql: dx_attack_graph DDL with event_time(UInt64 nanos) + hostname,
  edge columns, fromUnixTimestamp64Nano partition/TTL (nanos-correct).
- KnownTables + OperatorOwnedTables: register it so Apply creates it at boot.
- apply_test: assert last-applied DDL == last OperatorOwnedTables entry
  (robust to appended operator tables) instead of hardcoding trigger_watermark.

go test ./.../clickhouse green.

Signed-off-by: entlein <einentlein@gmail.com>
Pixie's clickhouse_dsn type mapper reads UInt8 as BOOLEAN and does not
handle UInt16/UInt32/Float32 -> px fails with 'Column[N] given incorrect
type' rendering the dx_evidence_graph. weight/max_severity/num_findings
-> Int64, confidence -> Float64 (map cleanly to INT64/FLOAT64). Verified
live: px run returns all 6 edges with every column. event_time stays
UInt64 (matches kubescape_logs, which px reads).

Signed-off-by: entlein <einentlein@gmail.com>
…canner buildPxL

The DX/streaming arm silently capped each per-table pull at Pixie's default
10000-row limit while the passthrough/ALL arm (pxl.CompilePassthrough /
QueryFor) already raises it to 1,000,000 via the broker's #px:set query flag.
Validated live on 6a33dac0: a single streaming http_events pull returned
exactly rows=10000 (the cap). Left unfixed this UNDER-counts the DX arm and
OVERSTATES the DX-vs-ALL volume reduction. Prepend the same #px:set directive
to the streaming scanner's PxL so both arms are uncapped and comparable.

Signed-off-by: entlein <einentlein@gmail.com>
…owup)

AE control endpoints had no auth. Add it using the SAME shared lib the vizier
broker/PEM use (px.dev/pixie/src/shared/services/utils): SetAuth verifies a
bearer JWT via jwtutils.ParseToken (signature + audience), middleware on
Handler() with /healthz exempt. dx already mints this exact service JWT
(GenerateJWTForService, PL_JWT_SIGNING_KEY) — it just attaches it. No new
secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH + PL_JWT_SIGNING_KEY), default
OFF so it merges before dx sends the bearer; flip on after dx is updated.

Also: reject invalid t_end (<=0) and query windows (start>=end, non-positive).
Test mints a real JWT via the shared lib; asserts 401 on missing/bad token,
pass on valid, /healthz open.
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Updates adaptive export runtime behavior, adds a load-test rig and SOC workflow, and introduces cockpit and Terraform cloud configuration.

Changes

Adaptive Export Operator and Load-Test Rig

Layer / File(s) Summary
Build config and runner baseline
.arclint, .bazelignore, src/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazel, .github/workflows/vizier_release.yaml, .github/workflows/cli_release.yaml, .github/workflows/cloud_release.yaml, .github/workflows/mirror_demos.yaml, .github/workflows/mirror_deps.yaml, .github/workflows/mirror_releases.yaml, .github/workflows/operator_release.yaml, .github/workflows/perf_common.yaml
Lint and Bazel ignores are added for the loadgen tool, the BUILD argument order is normalized, and several release/mirror/perf jobs switch to the oracle-vm-16cpu-64gb-x86-64 runner.
ClickHouse schema and trigger polling
src/vizier/services/adaptive_export/internal/clickhouse/schema.sql, src/vizier/services/adaptive_export/internal/trigger/clickhouse.go, src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go, src/vizier/services/adaptive_export/internal/trigger/BUILD.bazel, src/vizier/services/adaptive_export/internal/trigger/oracle_test.go
The forensic schema adds ingestion, attribution, watermark, reconcile, and dx attack-graph objects, and the ClickHouse trigger advances past saturated boundaries with normalized-time and oracle coverage.
HTTP control surface and tests
src/vizier/services/adaptive_export/internal/control/BUILD.bazel, src/vizier/services/adaptive_export/internal/control/server.go, src/vizier/services/adaptive_export/internal/control/server_test.go
The control server exposes export, query, attack-graph, and health endpoints with optional bearer-JWT auth and body limits, and the tests exercise auth, validation, and error handling.
Controller, passthrough, and streaming paths
src/vizier/services/adaptive_export/internal/controller/controller.go, src/vizier/services/adaptive_export/internal/controller/controller_test.go, src/vizier/services/adaptive_export/internal/passthrough/passthrough.go, src/vizier/services/adaptive_export/internal/chhttp/chhttp.go, src/vizier/services/adaptive_export/internal/sink/clickhouse.go, src/vizier/services/adaptive_export/internal/sink/fastencode.go, src/vizier/services/adaptive_export/internal/sink/integration_test.go, src/vizier/services/adaptive_export/internal/streaming/writer.go, src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go
Adaptive export rehydrates and rolls back attribution state, records passthrough reconciliation on skipped tables, streams without client timeout, retains flushed buffers on error, and rejects non-finite JSON floats.
Deployment TLS, auth, and client option
k8s/vizier/bootstrap/adaptive_export_deployment.yaml, src/api/go/pxapi/opts.go, src/vizier/services/adaptive_export/cmd/main.go
The deployment adds resources and TLS cert mounting, the main entrypoint conditionally enables auth and TLS, and pxapi gets a direct TLS skip-verify option.
Load generator toolchain
src/e2e_test/adaptive_export_loadtest/tools/loadgen/...
The loadgen module adds cleanloadgen, httpsink, its module definition, and a distroless multi-stage image.
Harness library, injection, config, and stats
src/e2e_test/adaptive_export_loadtest/harness/lib.sh, src/e2e_test/adaptive_export_loadtest/harness/inject.sh, src/e2e_test/adaptive_export_loadtest/harness/ae_config.sh, src/e2e_test/adaptive_export_loadtest/harness/stats.py, src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml, src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml
The shared harness library, row injector, AE config helper, reproducibility stats script, and Kubernetes sink and generator manifests are added.
Experiment harnesses E1–E8 and orchestration
src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh, src/e2e_test/adaptive_export_loadtest/harness/exp_e5.sh, src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh, src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh, src/e2e_test/adaptive_export_loadtest/harness/nfr.sh, src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh, src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh, src/e2e_test/adaptive_export_loadtest/harness/run.sh, .github/workflows/e2e_log4shell_soc.yaml
The control-plane, data-plane, matrix, NFR, reconciliation, log4shell, top-level orchestration scripts, and the SOC workflow are added for adaptive export testing.
Load-test documentation
src/e2e_test/adaptive_export_loadtest/CONTRACTS.md, src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md, src/e2e_test/adaptive_export_loadtest/README.md, src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md
The docs capture the test modes, contract register, findings and backlog, and the experiment catalog with reproducibility notes.
Cockpit manifests and cloud overlay
cockpit/artifact_config.yaml, cockpit/auth_deployment.yaml, cockpit/cloud_ingress.yaml, cockpit/domain_config.yaml, cockpit/kustomization.yaml, cockpit/oauth_config.yaml, cockpit/plugin_db_updater_job.yaml, cockpit/proxy_envoy.yaml, cockpit/script_bundles_config.yaml, cockpit/servers_transport.yaml, skaffold/skaffold_cloud.yaml
The cockpit overlay adds domain, OAuth, ingress, proxy, plugin updater, script bundle, and skaffold profile configuration.
Terraform Auth0 and cloud dependencies
.sops.yaml, terraform/credentials/cockpit/auth0_config.yaml, terraform/kubernetes/auth0/*.tf, terraform/kubernetes/cloud_deps/*.tf, terraform/kubernetes/scripts/create_random_bytes.sh
Encrypted Auth0 inputs, imported Auth0 resources, cloud-dependency providers, certificate and secret resources, and helper scripts are added.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • k8sstormcenter/pixie#49: This PR adds pxapi.WithDirectTLSSkipVerify(), which aligns with the direct/node-local TLS path used by the related PEM query work.
  • k8sstormcenter/pixie#53: Both PRs change adaptive-export watermark/event-time handling and the ae_reconcile instrumentation path.
  • k8sstormcenter/pixie#64: Both PRs update GitHub Actions runner labels in the release workflows.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and reflects the main change: AE control auth plus follow-up fixes from CodeRabbit #53.
Description check ✅ Passed The description is clearly related to the changeset and accurately summarizes the auth, validation, and follow-up fixes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ae-followup-auth

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Base automatically changed from ae-prod to main June 21, 2026 21:03
- controller: don't fan out Pixie rows when attribution Sink.Write fails
  (avoids orphaned rows; release in-flight slot, non-fatal). (controller.go:347)
- controller.Rehydrate: re-arm rev-1 pushPixieRows for restored windows so a
  restart doesn't silently miss post-restart Pixie data. (controller.go:255)
- passthrough.pull: per-table timeout context so a hung dependency can't stall
  the sweep / delay shutdown (covers serial+concurrent ticks). (passthrough.go:147)
- schema: TTL (30d) on ae_reconcile to cap append-only growth. (schema.sql:485)

Verified no-change-needed: adaptive_attribution ORDER BY (hostname, anomaly_hash)
is safe — anomaly_hash already encodes namespace+pod (anomaly/hash.go), so rows
never collapse across ns/pod. (schema.sql:430)
Already on ae-prod (no-op): filter timer leak, stats.py EXACT guard, sink async,
watermark paging, pixieapi WithDirectTLSSkipVerify, http.Server timeouts.
@entlein

entlein commented Jun 21, 2026

Copy link
Copy Markdown
Author

@dx-agent — cut you a release to regression-test this followup.

Release: release/vizier/v0.14.19-aeprod20 → image ghcr.io/k8sstormcenter/vizier-adaptive_export_image:0.14.19-aeprod20
Build: https://github.com/k8sstormcenter/pixie/actions/runs/27917599684 (oracle runner, ~32 min — image lands when green).
Tree: #53 (ae-prod) + this PR's control-auth + the CodeRabbit Major fixes.

What changed vs aeprod19 (what to exercise):

  • control-plane auth — bearer-JWT via the shared jwtutils lib. Flag-gated, DEFAULT OFF (CONTROL_REQUIRE_AUTH unset) → dx→AE works unchanged, no dx-side change needed for this test. (Auth e2e is a later step once dx attaches the bearer.)
  • controller: no fan-out when attribution Sink.Write fails (no orphaned Pixie rows); Rehydrate now re-arms rev-1 fan-out for restored windows.
  • passthrough: per-table timeout (hung dep can't stall the sweep).
  • schema: 30d TTL on ae_reconcile.

Ask: swap your rig's AE to 0.14.19-aeprod20 (keep dx as-is) and confirm the cast4/ExtremeC regression still green — reduction + no-loss + detection (log4shell/react2argo rule-in + dx_attack_graph writes), AE 2/2 capturing. The Rehydrate change is the one to watch (restart → does rev-1 fan-out resume?). Ping back with pass/fail and I'll flip auth on next.

Thanks!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 44

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e_log4shell_soc.yaml:
- Around line 14-31: The e2e_log4shell_soc.yaml workflow lacks concurrency
guards, which allows multiple simultaneous runs to race and corrupt shared
self-hosted runner resources like k3s and /etc/rancher/k3s/. Add a concurrency
block at the top level of the workflow (alongside the on and jobs sections) that
specifies a concurrency group to ensure only one run executes at a time. The
concurrency group should cancel in-progress runs when a new run is triggered,
using a meaningful identifier like the workflow name or a fixed string to ensure
all runs of this workflow serialize properly on the oracle-vm-16cpu-64gb-x86-64
runner.
- Around line 58-65: The workflow is directly interpolating `inputs.soc_ref` and
`inputs.dx_image` into shell commands within the run step, creating a command
injection vulnerability. Move these inputs to an env section at the step level,
assign them to environment variables, and validate that they contain only
allowed characters (e.g., alphanumeric, hyphens, dots for versions). Replace the
inline `${{ inputs.soc_ref }}` and `${{ inputs.dx_image }}` references in the
git clone and kubectl commands with properly quoted environment variable
references like `"$SOC_REF"` and `"$DX_IMAGE"` respectively.

In @.github/workflows/vizier_release.yaml:
- Line 143: The runner label oracle-vm-16cpu-64gb-x86-64 in the runs-on field is
not recognized by actionlint, causing lint validation to fail. If this is an
intentional custom self-hosted runner label, register it in your actionlint
configuration file (typically .actionlintrc or actionlint config) to whitelist
this label and allow the workflow to pass validation. Alternatively, if this
label does not exist in your infrastructure, replace it with a valid standard
GitHub Actions runner label or your organization's properly configured
self-hosted runner label.

In `@src/e2e_test/adaptive_export_loadtest/CONTRACTS.md`:
- Line 96: The `## Legend` heading is missing required blank lines for markdown
linting compliance with rule MD022. Add a blank line before the `## Legend`
heading and a blank line after it to separate it from the surrounding content.
This ensures proper spacing around the heading section in the CONTRACTS.md file.

In `@src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md`:
- Around line 50-59: The markdown file has formatting violations in fenced code
blocks at lines 50-59 and 95-97. Add a blank line before and after each fenced
code block to comply with MD031, and add language identifiers to the opening
fence markers (e.g., change ``` to ```sql for SQL code blocks and ```text or
appropriate language for other blocks). This applies to all fenced code blocks
that currently lack language tags or proper spacing.

In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md`:
- Around line 17-18: The documentation in EXPERIMENTS.md incorrectly states that
fixture timestamps are in unix nanoseconds, but the actual harness
implementation in lib.sh (lines 145-147) uses seconds for the event_time field.
Update the documentation at lines 17-18 to accurately reflect that timestamps
are in seconds rather than nanoseconds, and also correct the same misstatement
at lines 35-36 to ensure consistency with the actual harness behavior and
prevent users from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh`:
- Around line 77-80: The first inj command in the E4 test case (line 79) uses ||
true which silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh`:
- Around line 74-80: The INJECT command invocations in the stall-detection
experiment are using the || true pattern to suppress failures, which causes
failed anomaly inserts to be treated as success and can produce false STALL
flags. Remove the || true suffix from both INJECT command calls (the first one
with the --same-time flag and the second one without it) to allow injection
failures to properly propagate and be detected, ensuring accurate stall
detection in the experiment.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh`:
- Around line 80-85: The measure function accepts a valid parameter but does not
include it when writing results to the RES file at line 84. Add the valid
parameter to the printf statement that writes to RES so it captures whether each
measurement is valid. Additionally, update the summary computation at lines
121-124 to filter the results in RES by the validity column before computing the
aggregate statistics, ensuring the reported means only include measurements
marked as valid.
- Around line 114-116: The matrix arms currently have
ADAPTIVE_PUSH_PIXIE_ROWS=false set on both the ALL arm run_arm call at line 114
and the DX arm run_arm call at line 116, which disables the data-plane write
path and prevents proper measurement of data-volume reduction. To fix this,
change ADAPTIVE_PUSH_PIXIE_ROWS=false to ADAPTIVE_PUSH_PIXIE_ROWS=true in at
least one of these run_arm calls (or both as needed) to enable the data-plane
write path so that data-volume reduction can be properly measured during the
adaptive export load test.
- Line 78: The r0001_recent function has a timestamp unit mismatch where it
multiplies by 1000000000 to convert to nanoseconds, but the event_time field is
stored in seconds according to the harness documentation. This causes the WHERE
clause comparison to fail or undercount valid events. Fix this by removing the
multiplication factor of 1000000000 from the timestamp calculation, so that the
comparison is between seconds and seconds instead of seconds and nanoseconds.
Change the expression from toUInt64((now()-130))*1000000000 to
toUInt64(now()-130).

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh`:
- Around line 54-56: The reconciliation test setup on line 55 of
exp_row_reconcile.sh disables the write path by setting
ADAPTIVE_PUSH_PIXIE_ROWS=false, which prevents the CH protocol-row writes that
the test is meant to validate, causing false reconciliation failures. Remove the
ADAPTIVE_PUSH_PIXIE_ROWS=false flag from the kubectl set env command in the
adaptive-export deployment to ensure the write path remains enabled during the
reconciliation test.

In `@src/e2e_test/adaptive_export_loadtest/harness/inject.sh`:
- Around line 23-27: The row-shape header comment in the inject.sh file
incorrectly documents event_time as unix-NANOS on line 26, but the actual
contract documented in lines 32-37 and the script's usage shows it is
unix-SECONDS. Correct the comment to change "event_time (UInt64 unix-NANOS)" to
"event_time (UInt64 unix-SECONDS)" to accurately reflect the seconds unit that
the script actually uses and expects.
- Around line 96-108: Replace the predictable temporary filename
`/tmp/inject_resp.$$` with a securely generated temporary file using `mktemp`.
Create a new temporary file at the start of the code block using `mktemp` and
store it in a variable, then substitute all occurrences of `/tmp/inject_resp.$$`
with this variable throughout the curl command, the response file output
redirection, the cat command that displays the response, and the rm command that
cleans up the file. This eliminates the security risk of predictable temp file
names that could be subject to symlink attacks or collisions.

In `@src/e2e_test/adaptive_export_loadtest/harness/lib.sh`:
- Line 33: The bash script header marks KUBECONFIG as required using set -uo
pipefail, but the kubectl operations at lines 51-53 execute without verifying
that KUBECONFIG is actually set and points to a valid file. Add an explicit
guard before the kubectl operations to check that KUBECONFIG is set and
non-empty, ensuring that kubectl commands execute against the intended cluster
context and not the caller's default context. This guard should exit with an
error if KUBECONFIG is unset or empty to prevent unintended mutations.
- Around line 89-94: The chq function is missing timeout controls on the curl
command, which can cause the harness to hang indefinitely if the port-forward or
ClickHouse service becomes unresponsive. Add curl timeout options such as
--connect-timeout for the connection establishment phase and --max-time for the
overall request duration to the curl invocation in the chq function to ensure
the harness can proceed even if the backend stalls.

In `@src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh`:
- Around line 61-66: Add validation immediately after the kubectl commands that
set the BIP and BPORT variables to ensure both contain non-empty values before
proceeding to the curl loop. If either BIP or BPORT is empty, the script should
exit with a clear error message indicating backend service resolution failed.
This prevents malformed requests from being sent by the attacker deployment and
ensures the script fails fast with meaningful diagnostics instead of producing
misleading log4j vulnerability detection results.

In `@src/e2e_test/adaptive_export_loadtest/harness/nfr.sh`:
- Around line 49-56: The script detects no-loss proof mismatches by printing
MISMATCH or READ!=WROTE in the loop that iterates over http_events, dns_events,
and conn_stats, but never exits with a failure status, causing automated runs to
pass despite data loss. Add a flag variable initialized before the loop to track
whether any mismatches occur (checking both the wrote versus CH_rows comparison
and the read versus wrote comparison), then after the loop completes, check this
flag and exit the script with a non-zero status code if any mismatches were
detected.
- Around line 34-39: The setarm function on line 34 suppresses all output and
errors from the kubectl commands using `>/dev/null 2>&1`, which hides failures
from kubectl set env and kubectl rollout status. The run_phase function on line
38 calls setarm without checking its return status, allowing the test to proceed
even if the adaptive-export reconfiguration fails. Remove the output suppression
redirection from the setarm function to allow errors to be visible, and add a
return status check immediately after the setarm call in run_phase so that the
phase aborts with an appropriate error message if the kubectl reconfiguration
fails.

In `@src/e2e_test/adaptive_export_loadtest/harness/run.sh`:
- Around line 32-33: The EVID variable assignment on line 32 uses a hardcoded
user-specific path that is not portable across different environments. Replace
the hardcoded path with a portable default, such as using environment variables
or a system temporary directory. Additionally, the critical pipeline stages in
lines 40-56 lack explicit failure checks, allowing the script to continue even
if a stage fails and produces an invalid verdict set. Add explicit error
handling (such as set -e at the beginning of the script or individual error
checks) to ensure the script exits immediately when any critical stage fails.

In `@src/e2e_test/adaptive_export_loadtest/harness/stats.py`:
- Around line 57-76: The issue is that when there are no numeric PASS values for
a metric (the vals list is empty), the code continues without setting repro_ok
to False, allowing a false positive "EXACTLY REPRODUCIBLE" verdict. To fix this,
add repro_ok = False before the continue statement in the block where not vals
is True, so that any metric lacking valid PASS measurements will correctly mark
the reproducibility check as failed.

In `@src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml`:
- Around line 28-37: Add security hardening to the httpsink container
specification in the pod spec by setting automountServiceAccountToken to false
at the spec level to prevent service account token mounting, and within the
container specification add allowPrivilegeEscalation set to false, drop
unnecessary Linux capabilities, and configure a seccomp profile to restrict
system calls. Apply the same hardening pattern to the pg-sink pod specification
as well to ensure both long-lived sink pods follow the same security baseline
and reduce unnecessary privilege and credential exposure.

In `@src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml`:
- Around line 18-38: The gen-pod.tmpl.yaml pod spec lacks essential security
hardening configurations that should be applied to reduce the attack surface.
Add a securityContext block within the pod spec to enforce non-root execution,
disable privilege escalation, and set a read-only root filesystem. Additionally,
add automountServiceAccountToken set to false at the pod spec level to prevent
unnecessary service account token mounting. These hardening measures should be
applied to the spec section containing the restartPolicy and containers
definitions to ensure the cleanloadgen workload runs with minimal privileges and
reduced security risks.

In `@src/e2e_test/adaptive_export_loadtest/README.md`:
- Around line 51-57: The markdown file has linting compliance issues with the
Layout heading and its associated code fence block. First, ensure there is a
blank line before the "## Layout" heading and a blank line after it to satisfy
MD022 heading spacing requirements. Second, add a language tag to the opening
fence (using triple backticks) to satisfy MD040 requirements. Third, ensure
there is a blank line before the opening fence and a blank line after the
closing fence to satisfy MD031 fenced block spacing requirements. These same
spacing rules also apply to the heading at line 71.

In
`@src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go`:
- Around line 64-71: The envInt function currently accepts negative integer
values from environment variables, which can cause unintended behavior in
generation loops. After successfully parsing the integer value with strconv.Atoi
in the envInt function, add a validation check to ensure the parsed value is
non-negative. If the value is negative, call fatalf with an appropriate error
message indicating that the environment variable must be a non-negative integer.

In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod`:
- Line 3: The Go version directive in the go.mod file is set to version 1.22,
which is no longer receiving active security updates. Update the go directive
from version 1.22 to version 1.26 (or 1.25 if necessary) to ensure the loadgen
binaries run on a currently supported Go runtime baseline and receive ongoing
security patches.

In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 799-807: The issue is that isOperatorManagedScript() function uses
a broad prefix check (ch-) to identify operator-managed scripts, which can
accidentally match user-authored scripts that start with ch-. This causes
unintended deletion of user scripts when INSTALL_PRESET_SCRIPTS=true. Replace
the prefix-based matching logic in isOperatorManagedScript() with an explicit
list of known operator-managed script names that the function should check
against. Similarly, apply the same fix to the duplicate logic mentioned at lines
839-847 to ensure both locations use the same explicit script name matching
approach.

In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go`:
- Around line 36-38: The QueryStream method claims to support unbounded
incremental streaming but uses the same http.Client that has a timeout set via
DefaultTimeout. In Go, http.Client.Timeout applies to the entire
request-response cycle including body reads, which will cut off long-running
responses mid-stream. For QueryStream, create a separate http.Client instance
without a timeout constraint, or modify the client initialization to
conditionally apply the timeout only for non-streaming operations. Ensure
QueryStream uses the timeout-free client while other operations continue to use
the DefaultTimeout-constrained client.

In `@src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go`:
- Around line 145-152: The test uses a single 30-second context timeout for both
the Apply and VerifyPixieSchema operations, which means if Apply consumes most
of the time, VerifyPixieSchema may fail with a deadline exceeded error even
though the schema is correct. Create separate context.WithTimeout calls for each
operation so that both Apply and VerifyPixieSchema each get their own
independent 30-second timeout deadline.

In `@src/vizier/services/adaptive_export/internal/clickhouse/schema.sql`:
- Around line 68-70: The kubescape_logs table has unsafe unit handling for the
event_time column in both the PARTITION BY and TTL expressions. The
toDateTime(event_time) calls assume event_time is in seconds, but if the actual
values are in milliseconds or nanoseconds, this will produce incorrect partition
keys and TTL horizons. Determine the actual unit of event_time (likely
milliseconds or nanoseconds), then update both the PARTITION BY
toYYYYMM(toDateTime(event_time)) expression on line 68 and the TTL
toDateTime(event_time) + INTERVAL 30 DAY DELETE expression on line 69 to apply
the appropriate unit conversion before passing to toDateTime(), such as dividing
by 1000 for milliseconds or by 1000000000 for nanoseconds.

In `@src/vizier/services/adaptive_export/internal/control/server_test.go`:
- Around line 155-169: The TestBadInputRejected test function is missing
explicit test cases for the new t_end and window validators. Add test cases that
verify the server returns http.StatusBadRequest for t_end <= 0 on the
/export/start endpoint, and for window[0] >= window[1] and non-positive window
values on the /query endpoint. Follow the existing pattern in
TestBadInputRejected by using the do() helper function with appropriate test
payloads for each validator and asserting the expected 400 status code with
descriptive failure messages.

In `@src/vizier/services/adaptive_export/internal/control/server.go`:
- Around line 120-143: The handleDXAttackGraph method is vulnerable to memory
DoS attacks because the decode function reads the entire unbounded request body
and then copies it into a second buffer. Wrap the request body with
http.MaxBytesReader to enforce a maximum size limit before passing the request
to the decode function. This will ensure oversized requests fail gracefully
instead of exhausting memory resources.

In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 316-366: The code modifies c.active[hash] before calling
c.sink.Write(), but if the write fails, it returns without rolling back the
changes to c.active. This allows in-flight workers already processing the same
hash to continue using unpersisted state extensions. Save the prior state of
c.active[hash] before making any modifications (before the block that either
creates a new row or updates the existing row), and then add rollback logic in
the error handling block after the c.sink.Write() call fails to restore the
original state, ensuring that in-flight workers do not reference orphaned state.

In `@src/vizier/services/adaptive_export/internal/e2e/e2e_test.go`:
- Around line 146-167: The test currently validates only the payload shape
within the INSERT bodies, but does not verify that the INSERT statement targets
the correct table forensic_db.adaptive_attribution. Add an assertion that checks
the captured SQL statements (via insertedSQL) to ensure they contain the correct
table reference, preventing regressions where the wrong table could be inserted
into while still maintaining the correct payload structure. This check should be
added alongside or after the existing body content validation loop that checks
for matched payloads.

In `@src/vizier/services/adaptive_export/internal/passthrough/passthrough.go`:
- Around line 124-131: The compiled path silently skips invalid tables when
pxl.CompilePassthrough fails but does not record these failures using
l.rec(...), which changes the error recording semantics compared to legacy mode.
In the error handling block where pxl.CompilePassthrough(table, cfg.Window)
returns an error, add a call to l.rec(...) to record the failure before
continuing to skip the table. Apply this same fix to the similar error handling
code at lines 235-239 that also needs to record failures when tables cannot be
compiled.

In `@src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go`:
- Around line 34-37: The capRec struct's Record method appends to the rows slice
without synchronization, which causes race conditions when multiple goroutines
call Record concurrently. Add a sync.Mutex field to the capRec struct and use it
to protect the append operation in the Record method by locking before appending
to rows and unlocking after. Also update any code that reads the rows field to
acquire the same lock before accessing the slice to ensure consistent visibility
of all recorded rows.

In `@src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go`:
- Around line 112-160: Add a TODO comment to the Query method of the Adapter
struct that documents the need to implement a long-lived client with JWT-refresh
mechanism for high-throughput direct-mode scenarios. Place this TODO near the
existing comment about GC-time reclamation and connection-leak mitigation to
ensure future developers are aware of the technical debt and the conditions
under which it should be addressed.

In `@src/vizier/services/adaptive_export/internal/sink/clickhouse.go`:
- Around line 190-207: The unconditional Info log statement that logs "sink:
pixie write completed" with the WithFields call executes on every batch write in
the hot path, creating excessive log volume. Wrap the entire logging block (from
the summary := res.Summary assignment through the Info call) behind a
debug-level check using log.IsLevelEnabled(log.DebugLevel) or a temporary
feature flag so this diagnostic logging only occurs when explicitly enabled for
debugging rather than on every write.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode_test.go`:
- Around line 36-58: The encodeViaJSON and parseNDJSON helper functions are
ignoring errors from enc.Encode and json.Unmarshal respectively, which can allow
tests to pass with invalid data. Modify both encodeViaJSON and parseNDJSON to
accept a *testing.T parameter as the first argument, then check the error
returns from enc.Encode and json.Unmarshal and call t.Fatalf to fail the test
immediately if either operation returns an error. Finally, update all call sites
throughout the test file where encodeViaJSON and parseNDJSON are invoked to pass
the test instance t as the first argument.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode.go`:
- Around line 184-187: The appendFloat function is being called with float32 and
float64 values without first validating that they are finite numbers. Non-finite
values like NaN and Infinity will be emitted as invalid JSON, causing
whole-batch inserts to fail. Add a check before calling appendFloat in both the
float32 case (around line 184) and float64 case (around line 186) to verify the
value is finite using a function like math.IsNaN or math.IsInf. If a non-finite
value is detected, reject it or handle it appropriately instead of passing it to
appendFloat. Apply the same validation pattern to the similar float handling
code around lines 215-218.

In `@src/vizier/services/adaptive_export/internal/sink/integration_test.go`:
- Around line 158-164: The shared 60-second timeout context created before the
loop causes the test to be flaky because it applies to the entire table
iteration sequence rather than individual tables. Move the context.WithTimeout
call inside the for loop that iterates over chpkg.PixieTables() so that each
table gets its own independent 60-second timeout. This way, if one table's
WritePixieRows call takes time, it won't affect the timeout for subsequent
tables, ensuring failures are isolated to the specific table being tested.

In `@src/vizier/services/adaptive_export/internal/streaming/writer.go`:
- Around line 117-134: The flush() method clears buf unconditionally after
attempting WritePixieRows, which causes data loss when the write fails. Only
clear buf by setting it to buf[:0] when WritePixieRows succeeds (move this
operation into the else block), not in the error case. Additionally, in the
shutdown branch (lines 138-140), the timeout context is derived from an
already-canceled parent context, causing the final flush to fail fast. For the
shutdown case, create the timeout context using context.Background() instead of
the canceled parent context to give the final flush operation a proper chance to
complete.

In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go`:
- Around line 73-78: The PollLimit field documentation incorrectly states that 0
means unlimited behavior, but the New() function actually rewrites 0 to 10000.
Update the comment for the PollLimit field to accurately reflect the actual
runtime behavior that 0 is rewritten to the default value of 10000, ensuring
operators understand the correct semantics when tuning this setting.

In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`:
- Around line 123-132: The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0f1144a8-c2c2-4681-a318-0629757dcbdb

📥 Commits

Reviewing files that changed from the base of the PR and between 855db81 and 714c1fb.

📒 Files selected for processing (116)
  • .arclint
  • .bazelignore
  • .github/workflows/e2e_log4shell_soc.yaml
  • .github/workflows/vizier_release.yaml
  • k8s/vizier/bootstrap/adaptive_export_deployment.yaml
  • k8s/vizier/bootstrap/adaptive_export_secrets.yaml
  • k8s/vizier/bootstrap/kustomization.yaml
  • skaffold/skaffold_vizier.yaml
  • src/api/go/pxapi/opts.go
  • src/e2e_test/adaptive_export_loadtest/CONTRACTS.md
  • src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md
  • src/e2e_test/adaptive_export_loadtest/README.md
  • src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md
  • src/e2e_test/adaptive_export_loadtest/harness/ae_config.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_e5.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh
  • src/e2e_test/adaptive_export_loadtest/harness/inject.sh
  • src/e2e_test/adaptive_export_loadtest/harness/lib.sh
  • src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh
  • src/e2e_test/adaptive_export_loadtest/harness/nfr.sh
  • src/e2e_test/adaptive_export_loadtest/harness/run.sh
  • src/e2e_test/adaptive_export_loadtest/harness/stats.py
  • src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml
  • src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/Dockerfile
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/httpsink/main.go
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod
  • src/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazel
  • src/vizier/services/adaptive_export/BUILD.bazel
  • src/vizier/services/adaptive_export/cmd/BUILD.bazel
  • src/vizier/services/adaptive_export/cmd/main.go
  • src/vizier/services/adaptive_export/internal/activeset/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/activeset/activeset.go
  • src/vizier/services/adaptive_export/internal/activeset/activeset_test.go
  • src/vizier/services/adaptive_export/internal/anomaly/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/anomaly/hash.go
  • src/vizier/services/adaptive_export/internal/anomaly/hash_bench_test.go
  • src/vizier/services/adaptive_export/internal/anomaly/hash_test.go
  • src/vizier/services/adaptive_export/internal/chhttp/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/chhttp/chhttp.go
  • src/vizier/services/adaptive_export/internal/chhttp/chhttp_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/clickhouse/apply.go
  • src/vizier/services/adaptive_export/internal/clickhouse/apply_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/columns_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/ddl.go
  • src/vizier/services/adaptive_export/internal/clickhouse/ddl_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/insert.go
  • src/vizier/services/adaptive_export/internal/clickhouse/insert_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/schema.sql
  • src/vizier/services/adaptive_export/internal/config/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/config/definition.go
  • src/vizier/services/adaptive_export/internal/control/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/control/server.go
  • src/vizier/services/adaptive_export/internal/control/server_test.go
  • src/vizier/services/adaptive_export/internal/controller/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/controller/controller.go
  • src/vizier/services/adaptive_export/internal/controller/controller_test.go
  • src/vizier/services/adaptive_export/internal/e2e/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/e2e/e2e_test.go
  • src/vizier/services/adaptive_export/internal/e2e/loadtest_test.go
  • src/vizier/services/adaptive_export/internal/kubescape/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/kubescape/extract.go
  • src/vizier/services/adaptive_export/internal/kubescape/extract_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/passthrough/compiled_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/passthrough.go
  • src/vizier/services/adaptive_export/internal/passthrough/passthrough_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go
  • src/vizier/services/adaptive_export/internal/pixie/pixie.go
  • src/vizier/services/adaptive_export/internal/pixieapi/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go
  • src/vizier/services/adaptive_export/internal/pixieapi/pixieapi_test.go
  • src/vizier/services/adaptive_export/internal/pxl/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/pxl/compile.go
  • src/vizier/services/adaptive_export/internal/pxl/compile_test.go
  • src/vizier/services/adaptive_export/internal/pxl/pxl.go
  • src/vizier/services/adaptive_export/internal/pxl/queryfor.go
  • src/vizier/services/adaptive_export/internal/pxl/queryfor_bench_test.go
  • src/vizier/services/adaptive_export/internal/pxl/queryfor_test.go
  • src/vizier/services/adaptive_export/internal/pxl/tables.go
  • src/vizier/services/adaptive_export/internal/pxl/tables_test.go
  • src/vizier/services/adaptive_export/internal/reconcile/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/reconcile/reconcile.go
  • src/vizier/services/adaptive_export/internal/script/script.go
  • src/vizier/services/adaptive_export/internal/sink/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/sink/clickhouse.go
  • src/vizier/services/adaptive_export/internal/sink/clickhouse_test.go
  • src/vizier/services/adaptive_export/internal/sink/content_type_contract_test.go
  • src/vizier/services/adaptive_export/internal/sink/encode_bench_test.go
  • src/vizier/services/adaptive_export/internal/sink/fastencode.go
  • src/vizier/services/adaptive_export/internal/sink/fastencode_test.go
  • src/vizier/services/adaptive_export/internal/sink/integration_test.go
  • src/vizier/services/adaptive_export/internal/streaming/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/streaming/filter.go
  • src/vizier/services/adaptive_export/internal/streaming/filter_test.go
  • src/vizier/services/adaptive_export/internal/streaming/integration_test.go
  • src/vizier/services/adaptive_export/internal/streaming/notifier.go
  • src/vizier/services/adaptive_export/internal/streaming/notifier_test.go
  • src/vizier/services/adaptive_export/internal/streaming/scanner.go
  • src/vizier/services/adaptive_export/internal/streaming/scanner_test.go
  • src/vizier/services/adaptive_export/internal/streaming/supervisor.go
  • src/vizier/services/adaptive_export/internal/streaming/writer.go
  • src/vizier/services/adaptive_export/internal/trigger/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_test.go
  • src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go
  • src/vizier/services/adaptive_export/internal/trigger/integration_test.go
  • src/vizier/services/adaptive_export/internal/trigger/watermark.go
  • src/vizier/services/adaptive_export/internal/trigger/watermark_test.go
💤 Files with no reviewable changes (2)
  • src/vizier/services/adaptive_export/internal/pxl/pxl.go
  • src/vizier/services/adaptive_export/internal/config/definition.go

Comment thread .github/workflows/vizier_release.yaml
Comment thread src/e2e_test/adaptive_export_loadtest/CONTRACTS.md
Comment thread src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md
Comment on lines +17 to +18
isolates each rep even with overlapping windows). Timestamps are explicit unix
nanos — fixtures NEVER use wall-clock `now()`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timestamp unit contract is inconsistent with harness behavior.

Line 17 and Line 35 document nanosecond semantics, but src/e2e_test/adaptive_export_loadtest/harness/lib.sh (Line 145–147) states fixture event_time is in seconds. Following this doc as written can undersize ADAPTIVE_WINDOW_BEFORE_SEC and miss band rows.

Suggested doc correction
- Timestamps are explicit unix nanos — fixtures NEVER use wall-clock `now()`.
+ Timestamps are explicit unix seconds — fixtures NEVER use wall-clock `now()`.

- 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0)/1e9 + margin` so window start
+ 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0) + margin` so window start

Also applies to: 35-36

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md` around lines
17 - 18, The documentation in EXPERIMENTS.md incorrectly states that fixture
timestamps are in unix nanoseconds, but the actual harness implementation in
lib.sh (lines 145-147) uses seconds for the event_time field. Update the
documentation at lines 17-18 to accurately reflect that timestamps are in
seconds rather than nanoseconds, and also correct the same misstatement at lines
35-36 to ensure consistency with the actual harness behavior and prevent users
from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.

Comment thread src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh
Comment thread src/vizier/services/adaptive_export/internal/sink/fastencode.go
Comment thread src/vizier/services/adaptive_export/internal/sink/integration_test.go Outdated
Comment thread src/vizier/services/adaptive_export/internal/streaming/writer.go Outdated
Comment thread src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
Comment on lines +123 to +132
// fingerprintNoFmt is the Fprintf-free reference. Same output guarantee
// is NOT asserted here — this is a perf-comparison anchor only. If the
// numbers diverge by >2× from rowFingerprint, the fmt.Fprintf path is
// a real cost.
func fingerprintNoFmt(r kubescape.Row) string {
h := sha256.New()
var b strings.Builder
b.Grow(64 + len(r.RuleID) + len(r.Hostname) + len(r.K8sDetails) + len(r.ProcessDetails))
_, _ = fmt.Fprintf(&b, "%d", r.EventTime)
b.WriteByte(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

fingerprintNoFmt still uses fmt.Fprintf, so the benchmark baseline is skewed.

Line 131 keeps formatting overhead in the “no fmt” path, which weakens the comparison.

Proposed fix
 import (
 	"crypto/sha256"
 	"encoding/hex"
-	"fmt"
+	"strconv"
 	"strings"
 	"testing"
@@
-	_, _ = fmt.Fprintf(&b, "%d", r.EventTime)
+	var nbuf [20]byte
+	b.Write(strconv.AppendUint(nbuf[:0], r.EventTime, 10))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`
around lines 123 - 132, The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +14 to +31
on:
workflow_dispatch:
inputs:
dx_image:
description: dx-daemon image to test (default = .image-tags pin)
required: false
default: ""
soc_ref:
description: k8sstormcenter/soc branch
required: false
default: "218-clickhouse-schema"
permissions:
contents: read

jobs:
e2e:
runs-on: oracle-vm-16cpu-64gb-x86-64 # eBPF + 16cpu/64gb; ubuntu-latest cannot run Pixie
timeout-minutes: 90

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add workflow concurrency guards for the self-hosted rig.

This job provisions global host resources (k3s, /etc/rancher/k3s/...). Without concurrency, overlapping manual runs can race and corrupt each other.

Proposed fix
 on:
   workflow_dispatch:
@@
+concurrency:
+  group: e2e-log4shell-soc-${{ github.ref }}
+  cancel-in-progress: false
+
 jobs:
   e2e:
🧰 Tools
🪛 actionlint (1.7.12)

[error] 30-30: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 zizmor (1.25.2)

[info] 29-29: workflow or action definition without a name (anonymous-definition): this job

(anonymous-definition)


[warning] 14-24: insufficient job-level concurrency limits (concurrency-limits): workflow is missing concurrency setting

(concurrency-limits)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e_log4shell_soc.yaml around lines 14 - 31, The
e2e_log4shell_soc.yaml workflow lacks concurrency guards, which allows multiple
simultaneous runs to race and corrupt shared self-hosted runner resources like
k3s and /etc/rancher/k3s/. Add a concurrency block at the top level of the
workflow (alongside the on and jobs sections) that specifies a concurrency group
to ensure only one run executes at a time. The concurrency group should cancel
in-progress runs when a new run is triggered, using a meaningful identifier like
the workflow name or a fixed string to ensure all runs of this workflow
serialize properly on the oracle-vm-16cpu-64gb-x86-64 runner.

Source: Linters/SAST tools

Comment thread .github/workflows/e2e_log4shell_soc.yaml Outdated
Comment on lines +77 to +80
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

E4 currently masks a failed first fixture insert.

Line 79 uses || true, so E4 can pass even when the first boundary-collision row was never inserted. This weakens the reproducibility assertion for that experiment.

Proposed fix
     E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
       filt="cp-e4-${R}"
-      inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
+      inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \
+        || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
       inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
       ;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \
|| { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh` around lines 77
- 80, The first inj command in the E4 test case (line 79) uses || true which
silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.

@@ -0,0 +1,5 @@
module aeload

go 1.22

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the latest patched Go release in the 1.22 line, and which GO advisories are fixed there (including GO-2026-4870, GO-2026-4918, GO-2026-4980)?

💡 Result:

As of June 21, 2026, the latest release in the Go 1.22 line is Go 1.22.12 [1][2]. Regarding the specified Go security advisories: - GO-2026-4870: This advisory affects crypto/tls and is noted as affecting versions before go1.25.9, as well as versions from go1.26.0-0 before go1.26.2 [3]. It is not listed as affecting the 1.22 release line. - GO-2026-4918: This advisory concerns net/http and golang.org/x/net/http2 [4][5]. It affects net/http versions before go1.25.10 and versions from go1.26.0-0 before go1.26.3 [4]. It is not listed as affecting the 1.22 release line. - GO-2026-4980: This advisory concerns html/template and affects versions before go1.25.10, as well as versions from go1.26.0-0 before go1.26.3 [6]. It is not listed as affecting the 1.22 release line. Go 1.22 has reached a state where it is no longer receiving the same security updates as newer Go release lines (such as 1.25 and 1.26) [2][3][4][6]. Users are encouraged to upgrade to a supported release version [2].

Citations:


Upgrade to a currently supported Go release.

Go 1.22 is no longer receiving active security updates. The specified Go advisories (GO-2026-4870, GO-2026-4918, GO-2026-4980) do not affect the 1.22 release line but affect newer versions. Upgrade to the latest 1.25 or 1.26 release to ensure the loadgen binaries run on a supported runtime baseline.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod` at line 3, The Go
version directive in the go.mod file is set to version 1.22, which is no longer
receiving active security updates. Update the go directive from version 1.22 to
version 1.26 (or 1.25 if necessary) to ensure the loadgen binaries run on a
currently supported Go runtime baseline and receive ongoing security patches.

Source: Linters/SAST tools

…#67)

ae-prod's boundary handling accumulates the seen-fingerprint set on a no-progress
tick, but if >PollLimit rows share one normalized event_time the SQL
(>= watermark ORDER BY time LIMIT N, no secondary key) returns the same N boundary
rows every poll → rows beyond N at that timestamp are never emitted (infinite
boundary). Detect all-skipped-at-capacity and advance the watermark by 1ns to make
forward progress (fingerprint dedup already tolerates the 1ns overlap).

Cherry-picked the trigger fix + its test from the stale CodeRabbit-chat PR #67
(687851d); dropped that PR's unrelated gen-pod.tmpl.yaml churn. #67 itself is
NOT mergeable (77 commits behind ae-prod, re-adds deleted terraform).
@entlein

entlein commented Jun 22, 2026

Copy link
Copy Markdown
Author

there are indications that it causes PEM to crash, needs RCA

entlein and others added 2 commits June 23, 2026 18:28
Auth without TLS is half a control: tcpdump on dx→AE :9100 captured 720
cleartext `Authorization: Bearer` JWTs in 70s — the #68 token crosses the
CNI in plaintext. CONTROL_TLS=true now serves TLS with server.crt/key from
the service-tls-certs secret (broker/PEM already use it; dx skip-verifies).
Default-OFF for incremental rollout, symmetric to CONTROL_REQUIRE_AUTH.
Stacks on #68 (ae-followup-auth). dx client half: entlein/dx#88.

Co-authored-by: Entlein <eineintlein@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	src/vizier/services/adaptive_export/cmd/main.go
#	src/vizier/services/adaptive_export/internal/clickhouse/schema.sql
#	src/vizier/services/adaptive_export/internal/control/BUILD.bazel
#	src/vizier/services/adaptive_export/internal/control/server.go
#	src/vizier/services/adaptive_export/internal/control/server_test.go
#	src/vizier/services/adaptive_export/internal/controller/controller.go
#	src/vizier/services/adaptive_export/internal/passthrough/passthrough.go
#	src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
#	src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
@entlein

entlein commented Jun 23, 2026

Copy link
Copy Markdown
Author

Merged main in — conflicts resolved (now MERGEABLE). main landed #53 (production AE), which this branch builds on (+auth +TLS +#53-followups +#67). 9 files conflicted; all resolved keeping this branch's followup work on top of #53 (main's side was the pre-followup #53 in every case):

file kept
cmd/main.go bearer-auth block + CONTROL_TLS/ListenAndServeTLS
clickhouse/schema.sql 30d TTL on the debug-log table
control/BUILD.bazel //src/shared/services/utils dep (auth)
control/server.go SetAuth + auth middleware + input validation
control/server_test.go TestControlAuth
controller/controller.go re-arm fan-out on restart + skip-fan-out on attribution-persist-fail (CodeRabbit)
passthrough/passthrough.go per-table context.WithTimeout (CodeRabbit)
trigger/clickhouse.go #67 PollLimit watermark boundary-escape
trigger/clickhouse_internal_test.go sync.Mutex race-fix on the test

Verified: 0 conflict markers, gofmt clean on all resolved .go, and signature coherence (control.New/SetAuth/jwtutils.ParseToken calls match defs). Bazel build is the CI gate (not run locally).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/vizier/services/adaptive_export/cmd/main.go (1)

583-588: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Fail-open when CONTROL_REQUIRE_AUTH=true but signing key is missing.

If an operator explicitly sets CONTROL_REQUIRE_AUTH=true but PL_JWT_SIGNING_KEY is empty/unmounted, the condition falls through to the else branch and the control surface comes up with auth disabled — only a Warn is emitted. A misconfiguration (forgot to mount the key) thus silently leaves /export/start, /export/stop, /dx/attack_graph, and /query open, contradicting the REQUIRE intent. This should fail-fast instead of fail-open.

🔒 Proposed fix to fail-fast on the require-but-misconfigured case
-		if key := os.Getenv("PL_JWT_SIGNING_KEY"); key != "" && os.Getenv("CONTROL_REQUIRE_AUTH") == "true" {
-			ctrlSrv.SetAuth(key, "vizier")
-			log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)")
-		} else {
-			log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)")
-		}
+		if os.Getenv("CONTROL_REQUIRE_AUTH") == "true" {
+			key := os.Getenv("PL_JWT_SIGNING_KEY")
+			if key == "" {
+				log.Fatal("CONTROL_REQUIRE_AUTH=true but PL_JWT_SIGNING_KEY is empty — refusing to start an unauthenticated control surface")
+			}
+			ctrlSrv.SetAuth(key, "vizier")
+			log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)")
+		} else {
+			log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)")
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/cmd/main.go` around lines 583 - 588, The
current logic allows the control surface to start with auth disabled when
CONTROL_REQUIRE_AUTH is explicitly set to "true" but PL_JWT_SIGNING_KEY is empty
or missing. This is a fail-open misconfiguration scenario. Refactor the
conditional logic to detect when CONTROL_REQUIRE_AUTH equals "true" but the
signing key is empty, and in that case call log.Fatal with a descriptive error
message to fail-fast instead of falling through to the else branch. The else
branch should only warn and disable auth when CONTROL_REQUIRE_AUTH is not
explicitly set to "true".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/vizier_release.yaml:
- Line 18: The runs-on field is missing the vm prefix in the oracle runner
label, causing actionlint to flag it as unknown. Update the runs-on field value
from oracle-16cpu-64gb-x86-64 to oracle-vm-16cpu-64gb-x86-64 by adding the vm
prefix to match the canonical form used consistently across other workflows and
documented in the infrastructure documentation.

---

Outside diff comments:
In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 583-588: The current logic allows the control surface to start
with auth disabled when CONTROL_REQUIRE_AUTH is explicitly set to "true" but
PL_JWT_SIGNING_KEY is empty or missing. This is a fail-open misconfiguration
scenario. Refactor the conditional logic to detect when CONTROL_REQUIRE_AUTH
equals "true" but the signing key is empty, and in that case call log.Fatal with
a descriptive error message to fail-fast instead of falling through to the else
branch. The else branch should only warn and disable auth when
CONTROL_REQUIRE_AUTH is not explicitly set to "true".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0211e817-9080-4f53-bbd5-5df5e5a74ef9

📥 Commits

Reviewing files that changed from the base of the PR and between 714c1fb and 0c65751.

📒 Files selected for processing (5)
  • .github/workflows/vizier_release.yaml
  • k8s/vizier/bootstrap/adaptive_export_deployment.yaml
  • src/vizier/services/adaptive_export/cmd/main.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/vizier/services/adaptive_export/cmd/main.go (1)

583-588: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Fail-open when CONTROL_REQUIRE_AUTH=true but signing key is missing.

If an operator explicitly sets CONTROL_REQUIRE_AUTH=true but PL_JWT_SIGNING_KEY is empty/unmounted, the condition falls through to the else branch and the control surface comes up with auth disabled — only a Warn is emitted. A misconfiguration (forgot to mount the key) thus silently leaves /export/start, /export/stop, /dx/attack_graph, and /query open, contradicting the REQUIRE intent. This should fail-fast instead of fail-open.

🔒 Proposed fix to fail-fast on the require-but-misconfigured case
-		if key := os.Getenv("PL_JWT_SIGNING_KEY"); key != "" && os.Getenv("CONTROL_REQUIRE_AUTH") == "true" {
-			ctrlSrv.SetAuth(key, "vizier")
-			log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)")
-		} else {
-			log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)")
-		}
+		if os.Getenv("CONTROL_REQUIRE_AUTH") == "true" {
+			key := os.Getenv("PL_JWT_SIGNING_KEY")
+			if key == "" {
+				log.Fatal("CONTROL_REQUIRE_AUTH=true but PL_JWT_SIGNING_KEY is empty — refusing to start an unauthenticated control surface")
+			}
+			ctrlSrv.SetAuth(key, "vizier")
+			log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)")
+		} else {
+			log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)")
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/cmd/main.go` around lines 583 - 588, The
current logic allows the control surface to start with auth disabled when
CONTROL_REQUIRE_AUTH is explicitly set to "true" but PL_JWT_SIGNING_KEY is empty
or missing. This is a fail-open misconfiguration scenario. Refactor the
conditional logic to detect when CONTROL_REQUIRE_AUTH equals "true" but the
signing key is empty, and in that case call log.Fatal with a descriptive error
message to fail-fast instead of falling through to the else branch. The else
branch should only warn and disable auth when CONTROL_REQUIRE_AUTH is not
explicitly set to "true".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/vizier_release.yaml:
- Line 18: The runs-on field is missing the vm prefix in the oracle runner
label, causing actionlint to flag it as unknown. Update the runs-on field value
from oracle-16cpu-64gb-x86-64 to oracle-vm-16cpu-64gb-x86-64 by adding the vm
prefix to match the canonical form used consistently across other workflows and
documented in the infrastructure documentation.

---

Outside diff comments:
In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 583-588: The current logic allows the control surface to start
with auth disabled when CONTROL_REQUIRE_AUTH is explicitly set to "true" but
PL_JWT_SIGNING_KEY is empty or missing. This is a fail-open misconfiguration
scenario. Refactor the conditional logic to detect when CONTROL_REQUIRE_AUTH
equals "true" but the signing key is empty, and in that case call log.Fatal with
a descriptive error message to fail-fast instead of falling through to the else
branch. The else branch should only warn and disable auth when
CONTROL_REQUIRE_AUTH is not explicitly set to "true".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0211e817-9080-4f53-bbd5-5df5e5a74ef9

📥 Commits

Reviewing files that changed from the base of the PR and between 714c1fb and 0c65751.

📒 Files selected for processing (5)
  • .github/workflows/vizier_release.yaml
  • k8s/vizier/bootstrap/adaptive_export_deployment.yaml
  • src/vizier/services/adaptive_export/cmd/main.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
🛑 Comments failed to post (1)
.github/workflows/vizier_release.yaml (1)

18-18: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Runner label appears to be missing the vm prefix.

Line 18 specifies oracle-16cpu-64gb-x86-64, but the change summary indicates the intent was oracle-vm-16cpu-64gb-x86-64. The current label is flagged by actionlint as unknown, and the vm-prefixed variant is used consistently in .github/workflows/e2e_log4shell_soc.yaml and documented in the infrastructure README as the canonical form.

🔧 Proposed fix: add `vm` prefix
-    runs-on: oracle-16cpu-64gb-x86-64
+    runs-on: oracle-vm-16cpu-64gb-x86-64
🧰 Tools
🪛 actionlint (1.7.12)

[error] 18-18: label "oracle-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/vizier_release.yaml at line 18, The runs-on field is
missing the vm prefix in the oracle runner label, causing actionlint to flag it
as unknown. Update the runs-on field value from oracle-16cpu-64gb-x86-64 to
oracle-vm-16cpu-64gb-x86-64 by adding the vm prefix to match the canonical form
used consistently across other workflows and documented in the infrastructure
documentation.

entlein added 2 commits June 24, 2026 07:46
… → 73.7% cov)

The trigger's incremental kubescape-events pump has 3 moving parts
that must agree: watermark advancement, boundary fingerprint dedup,
and PollLimit-saturated draining (PR #67 fix). Existing tests cover
each in isolation. This new file pins them TOGETHER against the
simplest possible reference ("drain everything in event_time order,
dedupe by fingerprint, advance the cursor to max(event_time)"). If
the iterative trigger and the reference disagree on the set of
emitted rows for ANY poll sequence, one of the three parts is wrong.

Four oracle tests:

  TestOracle_TriggerEmitsNaiveSet_StaggeredCorpus — 50 rows scattered
  across distinct event_times, PollLimit=10 forces ≥5 polls; trigger
  must emit exactly the 50 rows the naive reference computes.

  TestOracle_PollLimitSaturation_AtCapacity — regression guard for PR
  #67 (dfdc465): when EXACTLY PollLimit rows share a boundary
  event_time, every one of them must emit, and the cursor must clear
  the boundary for the next-event_time row that follows.

  TestOracle_PollLimitOverflow_DocumentsLossBound — pins PR #67's
  intentional trade-off: when >PollLimit rows share a boundary
  event_time, the first PollLimit emit, then the 1ns escape advances
  the cursor past the surplus. Lock-in test: if a future fix recovers
  the surplus (good!) this test fails loudly so both can update in
  lock-step instead of regressing silently.

  TestOracle_BoundaryDedup_NoDuplicates — when CH returns the same
  boundary row in two consecutive polls (real production case: a new
  row lands at the same event_time after a previous poll's cursor
  advanced past it), the seenAtBoundary map must filter the duplicate.

Coverage: trigger 71.6% → 73.7% (statement-level). The oracle's value
isn't in statement count — it's in property-level coverage on
invariants the per-feature tests can't enforce together.

Build.bazel: adds oracle_test.go to the existing trigger_test target.
…cle-vm-16cpu)

Merge from main re-introduced the deprecated 'oracle-16cpu-64gb-x86-64'
label on the Build Release job. That label doesn't resolve in the
fork's runner pool, so aeprod22's release workflow sat queued
indefinitely. update-gh-artifacts-manifest at L143 already had the
'-vm-' variant; align Build Release at L18 to match. Same fix the
fork applied earlier in 21d536e for the standalone vizier_release
workflow — main keeps regressing it.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/vizier/services/adaptive_export/internal/trigger/oracle_test.go`:
- Around line 351-365: The overflow test has two issues: the timeout branch in
the receive loop only breaks the select, so the outer loop can spin forever, and
the error returned by tr.Subscribe(ctx) is being ignored. Update the Subscribe
setup in oracle_test.go to check the error and fail fast with t.Fatalf, and
change the deadline case in the collected-events loop to explicitly fail the
test or exit the loop so the timeout path cannot hang.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 50bd8567-0709-4cfd-b548-c297ff735a59

📥 Commits

Reviewing files that changed from the base of the PR and between 0c65751 and 0fd9c3f.

📒 Files selected for processing (3)
  • .github/workflows/vizier_release.yaml
  • src/vizier/services/adaptive_export/internal/trigger/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/trigger/oracle_test.go

Comment thread src/vizier/services/adaptive_export/internal/trigger/oracle_test.go
entlein added 5 commits June 24, 2026 10:21
…i.bzl, fork cockpit)

The merge of main into ae-followup-auth (0c65751) silently took the
upstream version on a swath of files where the fork had previously
deviated. This commit reverts the merge's regressions back to the
fork-correct state (origin/ae-prod) AND completes the runner-label
sweep so every release/mirror/perf workflow uses the same -vm-16cpu
label the fork's runner pool actually has.

1. Runner labels — main re-introduced 'oracle-16cpu-64gb-x86-64' and
   'oracle-8cpu-32gb-x86-64' on 7 workflows. Fork's pool only resolves
   'oracle-vm-16cpu-64gb-x86-64'; both stale labels sit queued forever.
   Fixed: cli_release.yaml (L18+L212), cloud_release.yaml (L18),
   mirror_demos.yaml (L12), mirror_deps.yaml (L12),
   mirror_releases.yaml (L13), operator_release.yaml (L18+L143),
   perf_common.yaml (L37+L60). vizier_release.yaml L18 was fixed
   already in 0fd9c3f.

2. bazel/ui.bzl — main reverted PR #64's webpack-build fixes that
   broke release/cloud/v0.0.10 with 'export: `18': not a valid
   identifier'. Restored: 'set -x' for action-shell tracing, PATH
   that puts /opt/px_dev/tools/node/bin FIRST, 'hash -r', the
   STABLE_BUILD_TAG|BUILD_TIMESTAMP allowlist sed (vs the wildcard
   that word-splits FORMATTED_DATE), and use_default_shell_env=True
   so --incompatible_strict_action_env doesn't strip yarn from PATH.

3. 28 fork-cloud config files — main's PR pixie-io#2391 (cert-manager
   migration) deleted private/cockpit/*,
   terraform/kubernetes/auth0/*, terraform/kubernetes/cloud_deps/*,
   .sops.yaml, private/skaffold_cloud.yaml. These are still load-
   bearing for the AOCC pixie-cloud deployment; the fork hasn't
   migrated to cert-manager-compatible secrets yet (PR pixie-io#2391's
   monitor.go fallback path is in place, so adoption is the
   follow-up, not a blocker). Restored all 28 from origin/ae-prod.

Genuine main pickups that were CORRECT to keep (no fix needed): the
src/utils/shared/k8s/{apply,delete}.go import-order +
sets.New[string] generics migration,
src/operator/controllers/monitor.go's cert-manager secret fallback,
and the src/carnot/BUILD.bazel + src/carnot/exec/BUILD.bazel
additions.
11 TRUE + 2 PARTIAL CodeRabbit comments verified against current code;
all valid. This commit lands every one of them as a discrete change,
keeps the existing tests green, adds new tests where the contract
itself changed.

🔴 Real bugs:

  controller/controller.go: handle() now SNAPSHOTS c.active[hash]
    before mutation and ROLLS BACK on sink.Write failure. Without
    this, a failed persist left c.active extended; an already-running
    pushPixieRows would re-snapshot and fan out data based on an
    attribution row that never landed in CH. Updated
    TestController_SinkErrorNonFatal to pin the new rollback contract
    (active==0 after sink error) and added a writeAttempts() observer
    on fakeSink so the test doesn't race.

  sink/fastencode.go: appendJSONValue now rejects NaN/+Inf/-Inf
    floats with errFastEncodeUnsupported (triggers the encoding/json
    fallback). Previously strconv.AppendFloat emitted invalid JSON
    tokens that made CH reject the whole batch.

  streaming/writer.go: flush() keeps the row buffer on write failure
    (so the next attempt retries instead of silently dropping), and
    the shutdown branch uses context.Background() for the final flush
    so it isn't fast-failed by the already-cancelled parent ctx.

  control/server.go: decode() now wraps the body in
    http.MaxBytesReader(w, body, 4 MiB) so an oversized JSON payload
    on the public(-ish) control endpoint can't OOM the operator. The
    JWT auth still gates access — this hardens what a holding-an-
    acceptable-JWT attacker can do.

🟠 Lower-impact:

  cmd/main.go: isOperatorManagedScript matches the EXACT builtin
    names (no "ch-" prefix), so a user-authored script named
    "ch-something-custom" can't get deleted under
    INSTALL_PRESET_SCRIPTS=true.

  chhttp/chhttp.go: QueryStream now uses a separate http.Client with
    no Timeout. Go's http.Client.Timeout covers body reads, so
    reusing the 30s default would silently truncate a multi-MB
    active-set rehydrate. Stream callers must bound via ctx.Deadline.

  passthrough/passthrough.go: tickConcurrent's precompile-skip branch
    now calls l.rec(...) so the compiled and legacy paths produce
    identical reconcile-row counts. Previously the compiled path
    silently dropped a table → invisible divergence.

  trigger/oracle_test.go: COLLECT: labelled break stops the busy-spin
    on deadline expiry. Bare "break" only exits the select, leaving
    the for-loop to busy-spin until the count condition is reached.

🟡 Cosmetic / cleanup:

  control/server_test.go: TestBadInputRejected now pins t_end<=0 on
    /export/start AND inverted/zero window on /query — 4 new
    assertions for validators that existed but were untested.

  pixieapi/pixieapi.go: TODO comment marking the bounded pxapi.Client
    leak for follow-up when direct-mode throughput crosses ~1 q/s.

  sink/clickhouse.go: "sink: pixie write completed" demoted Info
    → Debug. One log per Pixie batch in fan-out paths was avoidable
    log-volume pressure; the silent-drop guard below still fires
    loud on the actual failure mode.

  sink/integration_test.go: per-table 15s ctx (was a shared 60s for
    the entire loop) so a slow early table can't starve later ones.

  trigger/clickhouse.go: PollLimit docstring now matches the
    0→10000 rewrite in New(); "unlimited" is no longer a documented
    option.

Local gates: build exit 0, go test 14/14 packages pass,
tools/linters/lint_like_ci.sh OKAY (0 Errors, 0 Warnings).
…lename-linter

The fork's filename-linter (inherited from upstream pixie-io) rejects
any PR diff that touches a path containing 'private'. The restored
fork-cloud config from the main-merge fix (cb81ecd) added 12
private/cockpit/* + 1 terraform/credentials/cockpit/* files, which
upstream-main had deleted via PR pixie-io#2391 — the linter then correctly
saw them as 'new private/* additions' and failed.

These specific paths are LEGITIMATE fork-cloud deployment config that
pre-dates the upstream linter; the original 'no private/* leaks'
intent stays in place for everything else via two negative-glob
exceptions.
…ccepts the fork-cloud config upstream

Honest fix for the merge-regression: the fork's filename-linter rejects
any PR diff touching a path containing 'private' (an upstream-pixie-io
guard against secrets leaking into OSS PRs). The fork is a CONTRIBUTOR
that upstreams, so the guard applies; the prior 'restore from ae-prod'
fix legitimately re-added private/cockpit/* (which upstream-main pixie-io#2391
deleted via cert-manager migration) but tripped the linter.

Reverts the lazy carve-out fix (c579e48 reverted 7f43c51) and does
the structural rename instead.

Moves:
  private/cockpit/                  -> cockpit/
  private/skaffold_cloud.yaml       -> skaffold/skaffold_cloud.yaml
                                       (collapsed with the upstream-
                                       resurrected skaffold/skaffold_cloud.yaml
                                       — the fork's private/ version was
                                       the maintained superset)

Reference updates:
  cockpit/kustomization.yaml  ../../k8s/cloud/...  ->  ../k8s/cloud/...
                                       (depth-1 after the rename)
  skaffold/skaffold_cloud.yaml:84  - private/cockpit  ->  - cockpit

Linter intent preserved: any future PR that touches a real 'private/*'
path still fails — the guard's correct semantics are restored.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/vizier/services/adaptive_export/internal/trigger/clickhouse.go (1)

322-338: 🗄️ Data Integrity & Integration | 🔴 Critical | 🏗️ Heavy lift

Do not advance past the boundary timestamp without a total row order.

Line 401 only orders by normalized event_time. That means a full page of already-seen rows at the current watermark does not prove there are no unseen rows with the same timestamp after the LIMIT. Bumping the watermark by +1ns here can therefore drop real rows permanently whenever more than PollLimit rows share the boundary timestamp.

This needs a stable tie-breaker in the query/state (for example, page by event_time plus a secondary key/fingerprint) instead of skipping to the next timestamp.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go` around
lines 322 - 338, The paging escape in the ClickHouse trigger advances the
watermark past a boundary timestamp without a total row order, which can
permanently skip unseen rows. Update the polling logic in the trigger code path
around the boundary handling to avoid incrementing the watermark by 1ns when the
query is only ordered by normalized event_time. Instead, add a stable secondary
tie-breaker to the pagination/state in the query and cursor logic used by the
trigger so rows at the same timestamp can be paged deterministically without
dropping data.
src/vizier/services/adaptive_export/internal/chhttp/chhttp.go (1)

107-122: 🩺 Stability & Availability | 🟠 Major

Add a timeout to the boot-time Rehydrate call. cmd/main.go:443 passes the long-lived service context (context.WithCancel(context.Background())) into ctl.Rehydrate, so the new no-timeout QueryStream path can block startup indefinitely if ClickHouse stalls. Wrap that call in a deadline-bound context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go` around lines
107 - 122, The boot-time ctl.Rehydrate call is using a long-lived service
context, so the new QueryStream path can block startup forever if ClickHouse
stalls. Update the startup flow around ctl.Rehydrate to create and pass a
deadline-bound context instead of the background-derived cancel-only context,
and ensure the timeout is scoped just to rehydration. Use the existing
ctl.Rehydrate call site and the QueryStream behavior in chhttp.Client to locate
the fix.
♻️ Duplicate comments (3)
src/vizier/services/adaptive_export/internal/trigger/oracle_test.go (1)

351-351: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Check the Subscribe setup error.

Line 351 still discards the error from tr.Subscribe(ctx), so a setup failure gets reported as a later collection timeout instead of failing fast at the source.

Suggested fix
-	ch, _ := tr.Subscribe(ctx)
+	ch, err := tr.Subscribe(ctx)
+	if err != nil {
+		t.Fatalf("Subscribe: %v", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/trigger/oracle_test.go` at line
351, The test setup in `Subscribe` is still ignoring the returned error, which
can hide a subscription failure until a later timeout. Update the
`tr.Subscribe(ctx)` call in `oracle_test` to handle the error immediately and
fail the test at the source, using the existing `tr` and `Subscribe` flow rather
than continuing with a potentially nil channel.
src/vizier/services/adaptive_export/internal/controller/controller.go (1)

344-382: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Do not publish unpersisted attribution state before Sink.Write succeeds.

The rollback happens after the failed write, but Lines 344-353 already mutate c.active; an existing pushPixieRows loop can re-snapshot that unpersisted extension while Sink.Write is still pending and fan out rows against a phantom attribution anchor. Stage the updated row in a local copy, write it first, then commit it to c.active under the mutex only after persistence succeeds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/controller/controller.go` around
lines 344 - 382, The attribution update in controller logic is mutating c.active
before Sink.Write succeeds, which can let pushPixieRows observe and fan out an
unpersisted snapshot. Fix the flow in the controller method that handles the row
update by staging the modified AttributionRow in a local copy, calling
c.sink.Write first, and only committing the new state into c.active under c.mu
after the write succeeds; keep the rollback path for failures, including the
inFlight bookkeeping, unchanged.
src/vizier/services/adaptive_export/internal/control/server_test.go (1)

181-189: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Add a non-positive window timestamp case.

These cases pin inverted and zero-length windows, but not the new Window[0] <= 0 || Window[1] <= 0 contract in handleQuery.

Proposed test addition
 	if r := do(t, srv, http.MethodPost, "/query",
 		`{"pod":"p","table":"http_events","query_id":"x","window":[5,5]}`); r.StatusCode != http.StatusBadRequest {
 		t.Fatalf("query zero-window = %d, want 400", r.StatusCode)
 	}
+	if r := do(t, srv, http.MethodPost, "/query",
+		`{"pod":"p","table":"http_events","query_id":"x","window":[0,5]}`); r.StatusCode != http.StatusBadRequest {
+		t.Fatalf("query non-positive-window = %d, want 400", r.StatusCode)
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/control/server_test.go` around
lines 181 - 189, The query tests in server_test.go cover inverted and
zero-length windows, but they do not exercise the new non-positive timestamp
validation in handleQuery. Add a test case in the existing /query checks using
do(...) with a window where either Window[0] or Window[1] is non-positive, and
assert it returns http.StatusBadRequest; place it alongside the current window
validation cases so the handleQuery contract is pinned by the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/mirror_demos.yaml:
- Line 12: The workflow job’s custom runner label was renamed to
oracle-vm-16cpu-64gb-x86-64, but actionlint will still fail until the shared
custom runner label list is updated to match. Update the actionlint
configuration that defines allowed self-hosted labels so it includes the new
label used by the workflow’s runs-on entry, keeping both references in sync.

In `@bazel/ui.bzl`:
- Around line 78-79: Guard the stamp export logic in the shell command built by
the `ui` rule so missing `STABLE_BUILD_TAG` or `BUILD_TIMESTAMP` entries do not
remove the leading command and break the `&&` chain before `yarn build_prod`.
Update the two `sed` expansions that reference `ctx.info_file.path` and
`ctx.version_file.path` to a no-op-safe form, such as wrapping each in `eval
"$(sed ...)"`, so the command still succeeds when a key is absent.

In `@cockpit/cloud_ingress.yaml`:
- Around line 7-16: The ingress route for cloud-proxy-service:443 defines the
backend transport but never applies it, so the serverName override is ignored.
Update cockpit/cloud_ingress.yaml to attach the Traefik service.serverstransport
annotation on the / backend route and point it to cloud-backend-transport, using
the existing ingress resource so the backend actually picks up the transport
settings.

In `@cockpit/plugin_db_updater_job.yaml`:
- Around line 15-90: Harden the Job spec for the updater workload by setting
restrictive pod/security defaults before it consumes PL_POSTGRES_PASSWORD and
PL_JWT_SIGNING_KEY. Update the Job template in plugin_db_updater_job.yaml to
disable root execution and privilege escalation, drop unnecessary Linux
capabilities, and prevent automatic service-account token mounting; apply these
settings at the pod-level and, if needed, reinforce them on the postgres-wait
and updater containers.

In `@cockpit/proxy_envoy.yaml`:
- Around line 43-50: The CORS configuration in the Envoy proxy is using a
suffix-based origin match, which is broader than the single approved host and
can match unintended subdomains. Update the cors allow_origin_string_match
setting to use an exact origin match for the same host configured in
cockpit/domain_config.yaml, and keep the rest of the cors settings unchanged so
credentialed requests are only allowed from that one origin.
- Around line 87-96: The UpstreamTlsContext for api_service only configures a
client certificate and does not verify the backend server identity. Update the
envoy transport socket configuration to add a validation_context with the
trusted CA for the upstream certificate chain and set the appropriate
sni/hostname so Envoy validates the server cert before proxying traffic. Keep
the fix localized to the UpstreamTlsContext/common_tls_context block so the
server authentication settings are applied alongside the existing
tls_certificates.

In `@src/vizier/services/adaptive_export/internal/control/server.go`:
- Around line 185-188: The decode helper currently accepts a valid first JSON
value and ignores any trailing bytes, so malformed control requests can slip
through. Update decode in server.go to use the existing json.Decoder on r.Body,
then verify there is no extra non-whitespace input remaining after the first
Decode before returning success. If any trailing junk is present, make the
request fail so the caller returns a 400.

In `@terraform/kubernetes/auth0/.terraform.lock.hcl`:
- Around line 4-6: The Auth0 provider is still allowed to resolve from a
too-broad version range, which can let lockfile regeneration pull in an untested
major. Update the required provider constraint in auth0_main.tf for auth0/auth0
to the validated series used by the lock entry, then regenerate the lockfile so
.terraform.lock.hcl stays aligned with that pinned range.

In `@terraform/kubernetes/auth0/auth0_main.tf`:
- Line 12: Disable the Auth0 provider debug logging in the committed Terraform
config by removing or setting the debug flag to false in the auth0 provider
block. Update the auth0_main.tf provider configuration so debug is not enabled
by default, keeping any temporary troubleshooting setting out of checked-in
code.
- Line 6: The Auth0 provider constraint is still too open-ended, so update the
provider version requirement in the auth0_main configuration to a bounded range
instead of only a minimum. Use the existing provider block that sets the
auth0/auth0 version and narrow it to a tested range such as a compatible patch
range or an explicit upper bound below the next major release.

In `@terraform/kubernetes/auth0/generated.tf`:
- Around line 1702-1715: The non_interactive Auth0 client is exposing
browser-based grant types that should not be enabled for this M2M app. Update
the grant_types configuration in the generated Auth0 client definition to keep
only client_credentials, and remove implicit, authorization_code, and
refresh_token unless there is a clear need for browser flows. Use the
non_interactive client block in the generated Terraform output to locate the
change.

In `@terraform/kubernetes/auth0/variables.tf`:
- Line 1: The cloud_domain variable is currently untyped and unvalidated, so
misconfiguration can slip through until runtime. Update the cloud_domain
variable in variables.tf to declare it as a string and add validation that
rejects empty values (and any other invalid format checks you deem appropriate
for callback URL usage). Keep the validation close to the variable definition so
the failure happens early when Terraform plans or applies.

In `@terraform/kubernetes/cloud_deps/base_cloud_deps.tf`:
- Around line 1-4: The kustomization_build data source for the elastic base
dependency is hardcoded to a checkout-specific relative path, which will break
when Terraform runs from a different working directory in CI. Update the elastic
data block to use a parameterized source path from a variable or shared local
used by the cloud_deps Terraform config, and keep the existing data
"kustomization_build" "elastic" identifier so the new path can be reused across
environments.

In `@terraform/kubernetes/cloud_deps/cert_manager.tf`:
- Around line 1-4: The local cert_config is indexing
var.cert_details[var.cloud_domain] too early, so a missing domain triggers an
Invalid index before the custom validation in the cert_manager logic can run.
Update the cert_manager locals/validation flow by using lookup or try when
setting cert_config, or otherwise defer the access into variable validation so
the existing guard for cloud_domain/cert_details can execute first.

In `@terraform/kubernetes/cloud_deps/core_resource_deps.tf`:
- Around line 7-110: The kubernetes_secret_v1 resources are storing sensitive
values in Terraform state via the data fields in cloud_auth, cloud_session,
db_secrets, and cloud_auth0. Refactor the secret generation and secret creation
flow so these values are managed outside Terraform state, either by moving them
to a dedicated secret manager or by provisioning the Kubernetes secrets
out-of-band and only referencing their names in this module; keep the focus on
the data.external sources and the kubernetes_secret_v1 resources involved.

In `@terraform/kubernetes/cloud_deps/env_cloud_deps.tf`:
- Around line 1-4: The kustomization_build data source in env_deps is using a
hardcoded relative path that only works for one local checkout layout. Update
the path resolution in env_cloud_deps.tf so it is derived from the
module/working context instead of a fixed "../../../k8s/cloud_deps/public"
relative path, and make sure the env_deps data block still points to the same
k8s/cloud_deps/public manifests across local runs and Terraform pipelines.

In `@terraform/kubernetes/cloud_deps/provider.tf`:
- Around line 14-22: The kubernetes and kustomization provider blocks are
hardcoded to a developer-specific kubeconfig and context, which prevents
portable auth. Update the provider configuration to use configurable inputs or
environment-based discovery instead of fixed values, and adjust the provider
definitions in provider.tf so they can work in CI and on other machines without
the cockpit-config path or default context being baked in.

---

Outside diff comments:
In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go`:
- Around line 107-122: The boot-time ctl.Rehydrate call is using a long-lived
service context, so the new QueryStream path can block startup forever if
ClickHouse stalls. Update the startup flow around ctl.Rehydrate to create and
pass a deadline-bound context instead of the background-derived cancel-only
context, and ensure the timeout is scoped just to rehydration. Use the existing
ctl.Rehydrate call site and the QueryStream behavior in chhttp.Client to locate
the fix.

In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go`:
- Around line 322-338: The paging escape in the ClickHouse trigger advances the
watermark past a boundary timestamp without a total row order, which can
permanently skip unseen rows. Update the polling logic in the trigger code path
around the boundary handling to avoid incrementing the watermark by 1ns when the
query is only ordered by normalized event_time. Instead, add a stable secondary
tie-breaker to the pagination/state in the query and cursor logic used by the
trigger so rows at the same timestamp can be paged deterministically without
dropping data.

---

Duplicate comments:
In `@src/vizier/services/adaptive_export/internal/control/server_test.go`:
- Around line 181-189: The query tests in server_test.go cover inverted and
zero-length windows, but they do not exercise the new non-positive timestamp
validation in handleQuery. Add a test case in the existing /query checks using
do(...) with a window where either Window[0] or Window[1] is non-positive, and
assert it returns http.StatusBadRequest; place it alongside the current window
validation cases so the handleQuery contract is pinned by the tests.

In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 344-382: The attribution update in controller logic is mutating
c.active before Sink.Write succeeds, which can let pushPixieRows observe and fan
out an unpersisted snapshot. Fix the flow in the controller method that handles
the row update by staging the modified AttributionRow in a local copy, calling
c.sink.Write first, and only committing the new state into c.active under c.mu
after the write succeeds; keep the rollback path for failures, including the
inFlight bookkeeping, unchanged.

In `@src/vizier/services/adaptive_export/internal/trigger/oracle_test.go`:
- Line 351: The test setup in `Subscribe` is still ignoring the returned error,
which can hide a subscription failure until a later timeout. Update the
`tr.Subscribe(ctx)` call in `oracle_test` to handle the error immediately and
fail the test at the source, using the existing `tr` and `Subscribe` flow rather
than continuing with a potentially nil channel.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ffd11bf6-28d9-4236-ba9a-43653a5d3192

📥 Commits

Reviewing files that changed from the base of the PR and between 0fd9c3f and d94624a.

⛔ Files ignored due to path filters (1)
  • terraform/kubernetes/cloud_deps/terraform.tfstate is excluded by !**/*.tfstate
📒 Files selected for processing (49)
  • .github/workflows/cli_release.yaml
  • .github/workflows/cloud_release.yaml
  • .github/workflows/mirror_demos.yaml
  • .github/workflows/mirror_deps.yaml
  • .github/workflows/mirror_releases.yaml
  • .github/workflows/operator_release.yaml
  • .github/workflows/perf_common.yaml
  • .sops.yaml
  • bazel/ui.bzl
  • cockpit/artifact_config.yaml
  • cockpit/auth_deployment.yaml
  • cockpit/cloud_ingress.yaml
  • cockpit/domain_config.yaml
  • cockpit/kustomization.yaml
  • cockpit/oauth_config.yaml
  • cockpit/plugin_db_updater_job.yaml
  • cockpit/proxy_envoy.yaml
  • cockpit/script_bundles_config.yaml
  • cockpit/servers_transport.yaml
  • skaffold/skaffold_cloud.yaml
  • src/vizier/services/adaptive_export/cmd/main.go
  • src/vizier/services/adaptive_export/internal/chhttp/chhttp.go
  • src/vizier/services/adaptive_export/internal/control/server.go
  • src/vizier/services/adaptive_export/internal/control/server_test.go
  • src/vizier/services/adaptive_export/internal/controller/controller.go
  • src/vizier/services/adaptive_export/internal/controller/controller_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/passthrough.go
  • src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go
  • src/vizier/services/adaptive_export/internal/sink/clickhouse.go
  • src/vizier/services/adaptive_export/internal/sink/fastencode.go
  • src/vizier/services/adaptive_export/internal/sink/integration_test.go
  • src/vizier/services/adaptive_export/internal/streaming/writer.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/oracle_test.go
  • terraform/credentials/cockpit/auth0_config.yaml
  • terraform/kubernetes/auth0/.terraform.lock.hcl
  • terraform/kubernetes/auth0/auth0_import.tf
  • terraform/kubernetes/auth0/auth0_main.tf
  • terraform/kubernetes/auth0/generated.tf
  • terraform/kubernetes/auth0/terraform.tfvars
  • terraform/kubernetes/auth0/variables.tf
  • terraform/kubernetes/cloud_deps/.terraform.lock.hcl
  • terraform/kubernetes/cloud_deps/base_cloud_deps.tf
  • terraform/kubernetes/cloud_deps/cert_manager.tf
  • terraform/kubernetes/cloud_deps/core_resource_deps.tf
  • terraform/kubernetes/cloud_deps/env_cloud_deps.tf
  • terraform/kubernetes/cloud_deps/provider.tf
  • terraform/kubernetes/cloud_deps/variables.tf
  • terraform/kubernetes/scripts/create_random_bytes.sh

contents: read
packages: write
runs-on: oracle-16cpu-64gb-x86-64
runs-on: oracle-vm-16cpu-64gb-x86-64

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Keep actionlint’s custom runner labels in sync.

This rename looks intentional, but Line 12 will keep failing workflow lint until the custom self-hosted label list is updated alongside it. The same oracle-vm-16cpu-64gb-x86-64 label is already referenced elsewhere, so the root cause is the shared actionlint config, not this job definition.

🧰 Tools
🪛 actionlint (1.7.12)

[error] 12-12: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/mirror_demos.yaml at line 12, The workflow job’s custom
runner label was renamed to oracle-vm-16cpu-64gb-x86-64, but actionlint will
still fail until the shared custom runner label list is updated to match. Update
the actionlint configuration that defines allowed self-hosted labels so it
includes the new label used by the workflow’s runs-on entry, keeping both
references in sync.

Source: Linters/SAST tools

Comment thread bazel/ui.bzl
Comment on lines +78 to +79
'$(sed -E -n "s/^(STABLE_BUILD_TAG|BUILD_TIMESTAMP)\\s+(.*)/export \\1=\\2/p" "{}")'.format(ctx.info_file.path),
'$(sed -E -n "s/^(STABLE_BUILD_TAG|BUILD_TIMESTAMP)\\s+(.*)/export \\1=\\2/p" "{}")'.format(ctx.version_file.path),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the file and inspect the relevant section with line numbers.
git ls-files | rg '^bazel/ui\.bzl$'
wc -l bazel/ui.bzl
sed -n '1,170p' bazel/ui.bzl | cat -n

Repository: k8sstormcenter/pixie

Length of output: 7423


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for the exact sed snippets and any related command construction.
rg -n 'STABLE_BUILD_TAG|BUILD_TIMESTAMP|sed -E -n|yarn build_prod|&&' bazel/ui.bzl . -g '!**/node_modules/**'

Repository: k8sstormcenter/pixie

Length of output: 50377


Guard the stamp exports against missing keys
At bazel/ui.bzl:78-79, a missing key makes the leading $(sed ...) disappear and breaks the &&-joined shell script before yarn build_prod runs. Wrap each expansion in eval "$(sed ...)" or another no-op-safe form.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bazel/ui.bzl` around lines 78 - 79, Guard the stamp export logic in the shell
command built by the `ui` rule so missing `STABLE_BUILD_TAG` or
`BUILD_TIMESTAMP` entries do not remove the leading command and break the `&&`
chain before `yarn build_prod`. Update the two `sed` expansions that reference
`ctx.info_file.path` and `ctx.version_file.path` to a no-op-safe form, such as
wrapping each in `eval "$(sed ...)"`, so the command still succeeds when a key
is absent.

Comment thread cockpit/cloud_ingress.yaml Outdated
Comment on lines +7 to +16
annotations:
external-dns.alpha.kubernetes.io/hostname: >-
test.austrianopencloudcommunity.org,work.test.austrianopencloudcommunity.org
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- test.austrianopencloudcommunity.org
- work.test.austrianopencloudcommunity.org
secretName: cloud-proxy-tls-certs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "### transport definition"
sed -n '1,120p' cockpit/servers_transport.yaml

echo "\n### ingress annotations and backends"
sed -n '1,160p' cockpit/cloud_ingress.yaml

echo "\n### repo-wide transport references"
rg -n 'cloud-backend-transport|serverstransport|serversscheme|ServersTransport' cockpit

Repository: k8sstormcenter/pixie

Length of output: 2593


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "### broad repo references to cloud-backend-transport"
rg -n 'cloud-backend-transport|ServersTransport|serverstransport|serversTransport|transport' .

echo "\n### candidate service definition"
sed -n '1,220p' cockpit/cloud_proxy_service.yaml 2>/dev/null || true

echo "\n### kustomization context"
sed -n '1,120p' cockpit/kustomization.yaml

Repository: k8sstormcenter/pixie

Length of output: 50377


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "### broad repo references to cloud-backend-transport"
rg -n 'cloud-backend-transport|ServersTransport|serverstransport|serversTransport|transport' .

echo "\n### candidate service definition"
sed -n '1,220p' cockpit/cloud_proxy_service.yaml 2>/dev/null || true

echo "\n### kustomization context"
sed -n '1,120p' cockpit/kustomization.yaml

Repository: k8sstormcenter/pixie

Length of output: 50377


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,220p' cockpit/cloud_proxy_service.yaml 2>/dev/null || true
printf '\n### traefik annotations in cockpit manifests\n'
rg -n 'traefik\.ingress\.kubernetes\.io|serverstransport|ServersTransport' cockpit/*.yaml cockpit/*.yml

Repository: k8sstormcenter/pixie

Length of output: 371


Attach cloud-backend-transport to the / backend

cockpit/cloud_ingress.yaml defines the transport, but the Ingress never references it. Add the Traefik service.serverstransport annotation to the cloud-proxy-service:443 route, otherwise the serverName override is ignored.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cockpit/cloud_ingress.yaml` around lines 7 - 16, The ingress route for
cloud-proxy-service:443 defines the backend transport but never applies it, so
the serverName override is ignored. Update cockpit/cloud_ingress.yaml to attach
the Traefik service.serverstransport annotation on the / backend route and point
it to cloud-backend-transport, using the existing ingress resource so the
backend actually picks up the transport settings.

Comment thread cockpit/plugin_db_updater_job.yaml Outdated
Comment on lines +15 to +90
spec:
initContainers:
- name: postgres-wait
# TODO(ddelnano): This image was rebuilt from https://github.com/docker-library/postgres 14-alpine
# to remediate a critical vulnerability. Switch back to upstream once fixed.
# yamllint disable-line rule:line-length
image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d
command: ['sh', '-c',
'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do
echo "waiting for postgres";
sleep 2;
done;']
envFrom:
- configMapRef:
name: pl-db-config
containers:
- name: updater
image: cloud-plugin-load_db:latest
envFrom:
- configMapRef:
name: pl-db-config
- configMapRef:
name: pl-domain-config
- configMapRef:
name: pl-tls-config
volumeMounts:
- name: certs
mountPath: /certs
env:
- name: PL_POSTGRES_USERNAME
valueFrom:
secretKeyRef:
name: pl-db-secrets
key: PL_POSTGRES_USERNAME
- name: PL_POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: pl-db-secrets
key: PL_POSTGRES_PASSWORD
- name: PL_JWT_SIGNING_KEY
valueFrom:
secretKeyRef:
name: cloud-auth-secrets
key: jwt-signing-key
- name: PL_PLUGIN_SERVICE
valueFrom:
configMapKeyRef:
name: pl-service-config
key: PL_PLUGIN_SERVICE
- name: PL_PLUGIN_REPO
value: "k8sstormcenter/pixie-plugin"
restartPolicy: Never
volumes:
- name: pl-db-secrets
secret:
secretName: pl-db-secrets
- name: certs
projected:
sources:
- secret:
name: service-tls-server-certs
items:
- key: tls.crt
path: server.crt
- key: tls.key
path: server.key
- key: ca.crt
path: ca.crt
- secret:
name: service-tls-client-certs
items:
- key: tls.crt
path: client.crt
- key: tls.key
path: client.key
backoffLimit: 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Harden this Job before mounting DB and JWT credentials into it.

Both containers inherit the runtime defaults here, so root execution, privilege escalation, and service-account token mounting stay enabled unless the cluster overrides them. That is a poor default for a pod that consumes PL_POSTGRES_PASSWORD and PL_JWT_SIGNING_KEY.

Suggested hardening
 spec:
   ttlSecondsAfterFinished: 10
   template:
     metadata:
       name: plugin-db-updater-job
       labels:
         jobgroup: plugin-db-updater
     spec:
+      automountServiceAccountToken: false
       initContainers:
       - name: postgres-wait
+        securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop: ["ALL"]
+          runAsNonRoot: true
+          seccompProfile:
+            type: RuntimeDefault
         image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d
         command: ['sh', '-c',
                   'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do
                   echo "waiting for postgres";
                   sleep 2;
                   done;']
         envFrom:
         - configMapRef:
             name: pl-db-config
       containers:
       - name: updater
+        securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop: ["ALL"]
+          runAsNonRoot: true
+          seccompProfile:
+            type: RuntimeDefault
         image: cloud-plugin-load_db:latest
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
spec:
initContainers:
- name: postgres-wait
# TODO(ddelnano): This image was rebuilt from https://github.com/docker-library/postgres 14-alpine
# to remediate a critical vulnerability. Switch back to upstream once fixed.
# yamllint disable-line rule:line-length
image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d
command: ['sh', '-c',
'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do
echo "waiting for postgres";
sleep 2;
done;']
envFrom:
- configMapRef:
name: pl-db-config
containers:
- name: updater
image: cloud-plugin-load_db:latest
envFrom:
- configMapRef:
name: pl-db-config
- configMapRef:
name: pl-domain-config
- configMapRef:
name: pl-tls-config
volumeMounts:
- name: certs
mountPath: /certs
env:
- name: PL_POSTGRES_USERNAME
valueFrom:
secretKeyRef:
name: pl-db-secrets
key: PL_POSTGRES_USERNAME
- name: PL_POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: pl-db-secrets
key: PL_POSTGRES_PASSWORD
- name: PL_JWT_SIGNING_KEY
valueFrom:
secretKeyRef:
name: cloud-auth-secrets
key: jwt-signing-key
- name: PL_PLUGIN_SERVICE
valueFrom:
configMapKeyRef:
name: pl-service-config
key: PL_PLUGIN_SERVICE
- name: PL_PLUGIN_REPO
value: "k8sstormcenter/pixie-plugin"
restartPolicy: Never
volumes:
- name: pl-db-secrets
secret:
secretName: pl-db-secrets
- name: certs
projected:
sources:
- secret:
name: service-tls-server-certs
items:
- key: tls.crt
path: server.crt
- key: tls.key
path: server.key
- key: ca.crt
path: ca.crt
- secret:
name: service-tls-client-certs
items:
- key: tls.crt
path: client.crt
- key: tls.key
path: client.key
backoffLimit: 1
spec:
automountServiceAccountToken: false
initContainers:
- name: postgres-wait
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
# TODO(ddelnano): This image was rebuilt from https://github.com/docker-library/postgres 14-alpine
# to remediate a critical vulnerability. Switch back to upstream once fixed.
# yamllint disable-line rule:line-length
image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d
command: ['sh', '-c',
'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do
echo "waiting for postgres";
sleep 2;
done;']
envFrom:
- configMapRef:
name: pl-db-config
containers:
- name: updater
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
image: cloud-plugin-load_db:latest
envFrom:
- configMapRef:
name: pl-db-config
- configMapRef:
name: pl-domain-config
- configMapRef:
name: pl-tls-config
volumeMounts:
- name: certs
mountPath: /certs
env:
- name: PL_POSTGRES_USERNAME
valueFrom:
secretKeyRef:
name: pl-db-secrets
key: PL_POSTGRES_USERNAME
- name: PL_POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: pl-db-secrets
key: PL_POSTGRES_PASSWORD
- name: PL_JWT_SIGNING_KEY
valueFrom:
secretKeyRef:
name: cloud-auth-secrets
key: jwt-signing-key
- name: PL_PLUGIN_SERVICE
valueFrom:
configMapKeyRef:
name: pl-service-config
key: PL_PLUGIN_SERVICE
- name: PL_PLUGIN_REPO
value: "k8sstormcenter/pixie-plugin"
restartPolicy: Never
volumes:
- name: pl-db-secrets
secret:
secretName: pl-db-secrets
- name: certs
projected:
sources:
- secret:
name: service-tls-server-certs
items:
- key: tls.crt
path: server.crt
- key: tls.key
path: server.key
- key: ca.crt
path: ca.crt
- secret:
name: service-tls-client-certs
items:
- key: tls.crt
path: client.crt
- key: tls.key
path: client.key
backoffLimit: 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cockpit/plugin_db_updater_job.yaml` around lines 15 - 90, Harden the Job spec
for the updater workload by setting restrictive pod/security defaults before it
consumes PL_POSTGRES_PASSWORD and PL_JWT_SIGNING_KEY. Update the Job template in
plugin_db_updater_job.yaml to disable root execution and privilege escalation,
drop unnecessary Linux capabilities, and prevent automatic service-account token
mounting; apply these settings at the pod-level and, if needed, reinforce them
on the postgres-wait and updater containers.

Source: Linters/SAST tools

Comment thread cockpit/proxy_envoy.yaml Outdated
Comment on lines +43 to +50
cors:
allow_origin_string_match:
- suffix: "test.austrianopencloudcommunity.org"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: content-type,x-grpc-web,x-user-agent
max_age: "1728000"
expose_headers: grpc-status,grpc-message,grpc-timeout
allow_credentials: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Use an exact CORS origin match here.

suffix: "test.austrianopencloudcommunity.org" is wider than the single host configured in cockpit/domain_config.yaml and will also match origins like https://eviltest.austrianopencloudcommunity.org. With allow_credentials: true, that broadens who can make credentialed browser requests.

Suggested fix
                   cors:
                     allow_origin_string_match:
-                    - suffix: "test.austrianopencloudcommunity.org"
+                    - exact: "https://test.austrianopencloudcommunity.org"
                     allow_methods: GET, PUT, DELETE, POST, OPTIONS
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cors:
allow_origin_string_match:
- suffix: "test.austrianopencloudcommunity.org"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: content-type,x-grpc-web,x-user-agent
max_age: "1728000"
expose_headers: grpc-status,grpc-message,grpc-timeout
allow_credentials: true
cors:
allow_origin_string_match:
- exact: "https://test.austrianopencloudcommunity.org"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: content-type,x-grpc-web,x-user-agent
max_age: "1728000"
expose_headers: grpc-status,grpc-message,grpc-timeout
allow_credentials: true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cockpit/proxy_envoy.yaml` around lines 43 - 50, The CORS configuration in the
Envoy proxy is using a suffix-based origin match, which is broader than the
single approved host and can match unintended subdomains. Update the cors
allow_origin_string_match setting to use an exact origin match for the same host
configured in cockpit/domain_config.yaml, and keep the rest of the cors settings
unchanged so credentialed requests are only allowed from that one origin.

Comment on lines +1 to +4
data "kustomization_build" "elastic" {
# TODO(ddelnano): This will need to be updated for the terraform Azure pipeline
path = "../../../k8s/cloud_deps/base/elastic/operator"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Parameterize the kustomize source path before this is used in CI.

This only works from the current checkout layout. The TODO already calls out the Azure pipeline case; if Terraform runs from a different working directory, kustomization_build will fail before any base deps are applied.

Suggested fix
 data "kustomization_build" "elastic" {
-  # TODO(ddelnano): This will need to be updated for the terraform Azure pipeline
-  path = "../../../k8s/cloud_deps/base/elastic/operator"
+  path = abspath("${path.module}/../../../k8s/cloud_deps/base/elastic/operator")
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
data "kustomization_build" "elastic" {
# TODO(ddelnano): This will need to be updated for the terraform Azure pipeline
path = "../../../k8s/cloud_deps/base/elastic/operator"
}
data "kustomization_build" "elastic" {
path = abspath("${path.module}/../../../k8s/cloud_deps/base/elastic/operator")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@terraform/kubernetes/cloud_deps/base_cloud_deps.tf` around lines 1 - 4, The
kustomization_build data source for the elastic base dependency is hardcoded to
a checkout-specific relative path, which will break when Terraform runs from a
different working directory in CI. Update the elastic data block to use a
parameterized source path from a variable or shared local used by the cloud_deps
Terraform config, and keep the existing data "kustomization_build" "elastic"
identifier so the new path can be reused across environments.

Comment on lines +1 to +4
locals {
cert_subdomain = var.cloud_domain
cert_config = var.cert_details[var.cloud_domain]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

The cert_details guard never gets a chance to run.

Line 3 indexes var.cert_details[var.cloud_domain] while evaluating locals, so a missing domain fails with an Invalid index before Line 49's custom error can fire. Use lookup/try in the local, or move this to variable validation.

Suggested fix
 locals {
   cert_subdomain = var.cloud_domain
-  cert_config    = var.cert_details[var.cloud_domain]
+  cert_config    = lookup(var.cert_details, var.cloud_domain, null)
 }
@@
   lifecycle {
     precondition {
-      condition     = contains(keys(var.cert_details), var.cloud_domain)
+      condition     = local.cert_config != null
       error_message = "cloud_domain '${var.cloud_domain}' must have a corresponding entry in cert_details."
     }
   }
 }

Also applies to: 48-53

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@terraform/kubernetes/cloud_deps/cert_manager.tf` around lines 1 - 4, The
local cert_config is indexing var.cert_details[var.cloud_domain] too early, so a
missing domain triggers an Invalid index before the custom validation in the
cert_manager logic can run. Update the cert_manager locals/validation flow by
using lookup or try when setting cert_config, or otherwise defer the access into
variable validation so the existing guard for cloud_domain/cert_details can
execute first.

Comment thread terraform/kubernetes/cloud_deps/core_resource_deps.tf Outdated
Comment thread terraform/kubernetes/cloud_deps/env_cloud_deps.tf Outdated
Comment thread terraform/kubernetes/cloud_deps/provider.tf Outdated
pixie-agent added 2 commits June 28, 2026 08:54
cb81ecd re-added the terraform/ tree (16 files, 9817 lines) while
fixing regressions from the origin/main pickup, because the main tip
53d09cc had deleted it. That restoration is unrelated to the
adaptive_export auth work in this PR; it is now carried on its own
branch (restore-terraform-main, off origin/main) for a separate PR
into main. Removing it here keeps PR 68 scoped to the AE feature.
These were re-added on this branch while fixing the origin/main pickup
regression (cb81ecd + the d94624a cockpit rename), but they are fork
infra deleted from main by 53d09cc, not part of the adaptive_export
auth work. Carried on restore-fork-infra-main for a separate PR into
main. skaffold/skaffold_cloud.yaml is left in place — it is an upstream
file, not a clean fork-only add.
// Cert/key from the service-tls-certs secret the broker/PEM already use
// (mounted /certs); dx skip-verifies. Default-OFF for incremental rollout.
var err error
if os.Getenv("CONTROL_TLS") == "true" {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude, can we assume that CONTROL_TLS is true and only implement a fallback if it is not. But it should NOT boot if TLS is off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant