AE control auth + CodeRabbit #53 followup#68
Conversation
Signed-off-by: entlein <einentlein@gmail.com>
New env-gated background loop that runs the same PxL shape AE's anomaly-gated path uses, but with an empty Target (no ns/pod predicate) and over a configurable rolling window. Writes via the existing sink so the byte-shape of forensic_db rows is comparable between the PASSTHROUGH=1 phase (EVERYTHING) and the PASSTHROUGH=0 phase (AE-FILTER). One-shot A/B that yields the per-table capture fraction of the adaptive write path. - internal/passthrough/passthrough.go — Loop + Config; defaults to 30s window / 30s refresh / clickhouse.PixieTables() table list. - internal/passthrough/passthrough_test.go — 6 tests; the load-bearing one is TestLoop_EmitsEmptyTargetPxL (asserts neither df.namespace nor df.pod predicates appear in the emitted PxL). - cmd/main.go: ADAPTIVE_PASSTHROUGH + _WINDOW_SEC + _REFRESH_SEC env knobs. Adapter is constructed unconditionally when passthrough is on (joins the existing PushPixie / streaming construction path so the same pxapi grpc stream is reused). Loop is registered with the shutdown WaitGroup so SIGTERM waits for the in-flight tick. - cmd/BUILD.bazel: drop @px// load (other AE BUILD.bazel files use //bazel — sticking out as the only one with @px is a leftover from a prior gazelle run; align). Add passthrough dep. Signed-off-by: entlein <einentlein@gmail.com>
Stand-alone workflow that builds entlein/dx (private Active-Diagnosis Framework) into ghcr.io/k8sstormcenter/dx-daemon. Separates the dx image publish from the bazel-based vizier_release pipeline; the dx repo ships its own Dockerfile.dxd (Go cross-compile + distroless final stage) so it doesn't need to live as a submodule inside src/vizier/services/dx. Triggers: - tag push 'release/dx/v*' on this repo cuts a release build, image tag derived from the tag suffix (release/dx/v0.1.0 -> image tag 0.1.0). - workflow_dispatch lets us build any dx ref on demand with a custom tag (default: short sha of the resolved dx commit). Pulls dx via DX_ENTLEIN_PAT (already configured on the repo). Multi-arch build (linux/amd64, linux/arm64); Dockerfile.dxd cross-compiles in the native BUILDPLATFORM stage and the final stage is COPY-only, so target emulation isn't required. Signed-off-by: entlein <einentlein@gmail.com>
The dx image build pipeline lives in entlein/dx itself (PR #53, branch feat/bazel-release): bazel-based with @px external pin to pixie's ae-prod tip, pushes to docker.io/entlein/dx-daemon on a release/dx/v* tag in the dx repo. The pixie-side buildx workflow this reverts duplicated that intent in the wrong repo + the wrong build system (docker buildx instead of bazel + pl_go_image macros) + the wrong registry (ghcr.io/k8sstormcenter instead of docker.io/entlein). Signed-off-by: entlein <einentlein@gmail.com>
…affordances Fixes the silent-halt bug: the trigger gated on a RAW event_time high-water-mark, so a single anomaly in a larger unit (ms/ns) drove the watermark past all real seconds rows and AE stopped processing forever (data still on Pixie). Normalize event_time to canonical nanoseconds in the poll SELECT filter+order and in the in-memory/persisted cursor, boundary-dedup, and maxSeen (normalizeEventTimeNanos + chNormEventTimeNanos). Validated at the data layer: vs a poisoned watermark the raw filter returns 0 rows, the normalized filter recovers all 60. Also adds ADAPTIVE_PUSH_REFRESH_SEC (negative = single-shot pull) for the reproducible load-test harness, an in-package trigger unit test, and an e2e hermetic load test (mock PixieQuerier, exact rows+bytes). Signed-off-by: entlein <einentlein@gmail.com>
…ness
Consolidate the adaptive_export load-test harness under src/e2e_test/ (matching
vzconn_loadtest / px_cluster conventions). Control-plane experiments (E1-E4, E6,
E8 sustained) are proven exactly-reproducible on a live rig; the data-plane
experiments (E5, E8 data-mode) are authored and pending live validation on a
vizier-registered rig (status documented in README + FINDINGS_AND_BACKLOG).
- harness/: shell + python (inject, exp_control, exp_e8, stats, ...) + lib helpers
- fixtures/EXPERIMENTS.md: curated kubescape_logs data-set catalog + expected outputs
- k8s/: isolated sinks + per-rep generator pod (no probes)
- tools/loadgen/: cleanloadgen + httpsink (docker-built test tool; .bazelignore'd
pending a bazel target — lib/pq is already vendored in the module)
- FINDINGS_AND_BACKLOG.md: F8 watermark-poison bug + the fix + AE backlog
The AE Go unit/e2e tests live with the service (internal/{trigger,e2e}).
Signed-off-by: entlein <einentlein@gmail.com>
…C14) + diagrams Signed-off-by: entlein <einentlein@gmail.com>
…iagram; gen sustained-DNS mode C15 = AE must keep re-pulling+writing an active pod until t_end or DX stop (the contract DX steers on; last week's 'wrote then stopped' is its violation). Add DX-steering sequence diagram. Generator gains SUSTAIN_SEC (distinct-DNS trickle, a Pixie-traced protocol) + configurable SETTLE_PRE_MS warm-up for fresh-pod capture; harness wires GEN_SUSTAIN_SEC/GEN_SETTLE_MS. Signed-off-by: entlein <einentlein@gmail.com>
…lisation The 700821d trigger unit-normalisation wrapped event_time in a multiIf(...) inside both the WHERE filter and the ORDER BY. Three existing tests in watermark_test.go + one in clickhouse_test.go pinned the raw 'event_time >= N' substring and broke at HEAD. Update each test's expected substring to match the new normalized form (') >= <ns-scaled N>' — the closing paren of multiIf, then the value in canonical nanoseconds). Per-test ns-scaling: watermark_test.go:94 1744000000000000000 already ns -> unchanged watermark_test.go:125 InitialWatermark=42 < 1e10 sec -> * 1e9 watermark_test.go:156 InitialWatermark=7 < 1e10 sec -> * 1e9 watermark_test.go:297 event_time='5000' < 1e10 sec -> * 1e9 clickhouse_test.go:82 ref.T=1744..303e9 ns already ns -> unchanged go test ./src/vizier/services/adaptive_export/... all green. Signed-off-by: entlein <einentlein@gmail.com>
…future-stamp watermark poison) Signed-off-by: entlein <einentlein@gmail.com>
Records one forensic_db.ae_reconcile row per data-plane pull (read_count vs wrote_count, window, ns/pod) across ALL three write paths — controller fan-out (filter), passthrough firehose, and streaming scanner — so a reconcile run localizes loss to query (R5: read<PEM) vs sink (R6: wrote<read) and quantifies re-pull dup (C8). Counts alone (write >= read) were proven insufficient. - new internal/reconcile leaf package (Row, Recorder, Nop) — no import cycle - sink.Record: CH-backed recorder (INSERT INTO forensic_db.ae_reconcile) - ae_reconcile table: schema.sql + KnownTables + OperatorOwnedTables (synced); not a pixie table (absent from PixieTables, so VerifyPixieSchema ignores it) - wired: passthrough.tick, controller.pushPixieRows (deferred, all return paths), streaming.scanner.Run; gated by ADAPTIVE_RECONCILE=true, else Nop - unit test proves read/wrote capture incl. the sink-drop read>wrote shape - fixed apply_test trailing-tables guard for the new operator table - harness: exp_row_reconcile.sh (row-level PEM<->CH), ae_vs_all.sh, exp_datavolume_extreme.sh Signed-off-by: entlein <einentlein@gmail.com>
px -o json empty result previously printed one blank line → counted as a phantom LOSS=1. Guard: empty set → 0-byte keys file; drop all-empty-field keys. Confirmed against the controlled log4j run (backend http 14/14 exact, conn 66>=12, no loss). Signed-off-by: entlein <einentlein@gmail.com>
…log4shell Reliable BY CONSTRUCTION against bob#140 (stateful/unreliable exploit on re-fire): fresh-JVM backend (delete pod) + attacker-before-backend + the WORKING resolvable FQDN attacker.<ns>.svc.cluster.local:1389 (NOT the bare attacker-ns.svc which NXDOMAINs and gets dropped), then VERIFY the actual backend->:1389 LDAP egress in forensic_db.conn_stats and RETRY until confirmed (the validity gate). Never assumes the exploit fired. Node-side. Signed-off-by: entlein <einentlein@gmail.com>
…tion) Reword from offensive 'exploit' to detection-signal-generation language: this validates the kubescape->DX->AE detection chain. No logic change. Signed-off-by: entlein <einentlein@gmail.com>
… http2 - pxl.CompilePassthrough/Render: precompile per-table PxL once (fixed window => constant relative start_time), only the two time_ bounds are stamped per tick. Rendered output is byte-identical to QueryFor with an empty Target (TestCompilePassthrough_MatchesQueryFor), so this is a structural change, not a capture change. upid->pod/ns stays in PxL. - passthrough: tickConcurrent fans every table out at once (was a serial loop); shared pull() helper. Sink/recorder are pool/HTTP-backed and already called concurrently elsewhere. - drop http2_messages.beta from the firehose set (not materialised on every cluster => ""Table not found"" every tick); shared PixieTables/DDL lists untouched. - toggle ADAPTIVE_PASSTHROUGH_COMPILED (default on; =false reverts to the legacy serial QueryFor path). Signed-off-by: entlein <einentlein@gmail.com>
…e.go Fixes "missing strict dependencies: import of .../internal/reconcile" that broke the AE image build for passthrough, sink, streaming, controller, cmd (pre-existing since the ADAPTIVE_RECONCILE commit added the package + imports without bazel deps; never CI-built). Also wires the new pxl/compile.go srcs + passthrough/pxl test srcs (compiled_test.go, reconcile_test.go, compile_test.go). - new internal/reconcile/BUILD.bazel (go_library, stdlib-only) - +//internal/reconcile dep: passthrough, sink, streaming, controller, cmd - pxl go_library +compile.go; pxl_test +compile_test.go; passthrough_test +compiled_test.go,reconcile_test.go (+reconcile dep) Signed-off-by: entlein <einentlein@gmail.com>
F1 RCA: Pixie caps every px.display at max_output_rows_per_table (default 10000, query_flags.go) — the planner add_limit_to_batch_result_sink_rule silently truncates wide firehose windows / busy pods at the READ (write path is clean: ae_reconcile shows read==wrote). Fix uses Pixie own native knob — prepend `#px:set max_output_rows_per_table=1000000` to every generated PxL (QueryFor + CompilePassthrough) so all pull paths are uncapped. Validated on rig: a 14208-row window returned 10000 (capped) vs 14298 (with flag). No pagination loop, no extra round-trips. See memory project-ae-passthrough-10k-cap. Signed-off-by: entlein <einentlein@gmail.com>
Consolidates the recurring content_type silent-drop incident class into one default-suite test gate (6 tests, ~15ms): I1 TestContract_ContentTypeIsInt64InSchema I2 TestContract_FastEncodeContentTypeAsInt I3 TestContract_SilentDropDetected I3.b TestContract_SilentDropNotTriggeredOnSuccess I3.c TestContract_SilentDropToleratesMissingSummaryHeader I4 TestContract_HTTPEventsRoundTrip Top-of-file docstring chronicles the incident timeline so future operators can grep their way to the contract. Signed-off-by: entlein <einentlein@gmail.com>
Measures the AdaptiveExport value prop: datavolume REDUCTION of DX-steered AE (rev-3 streaming, AE writes only DX-steered activeSet pods over the control surface) vs saving ALL data (passthrough firehose). Two arms, same fixed load, forensic_db active-part deltas (rows+bytes) per table; reduction = 1 - DX/ALL. Uses the canonical resolvable JNDI FQDN (attacker.attacker-ns.svc.cluster.local) so the chain fires + DX can classify (a malformed host → NXDOMAIN → no steer). Successor to ae_vs_all.sh, whose AE arm used the rev-2 controller gate + stale JNDI. Signed-off-by: entlein <einentlein@gmail.com>
Measures all AE non-functional requirements under steady load on the rig: throughput (rows+bytes/sec), capture completeness (AE read vs broker count = F1 cap proof), write fidelity (read==wrote + write-error count), end-to-end freshness latency (now - max(time_) in CH), resource footprint (AE pod cpu/mem idle vs loaded), per-cycle cadence. Emits a structured report; companion to exp_dx_steering_reduction.sh. Real-data only. Signed-off-by: entlein <einentlein@gmail.com>
…ring + live-pod guard) Run-1 reported false 100% reduction because stale adaptive_attribution windows rehydrated DEAD pods (deleted loaders) into the activeSet → AE streamed dead pods → 0 rows. Clear adaptive_attribution before the DX arm so the activeSet only gets freshly-steered LIVE pods; add a guard that prints the steered pods + marshalsec fire count + live log4j-poc pods so a dead-arm result is caught, not reported as a reduction. Signed-off-by: entlein <einentlein@gmail.com>
lag query used now()-DateTime64 (type error -> na); use dateDiff(second,...). Capture-completeness vs broker was window-misaligned (623% artifact) -> drop it; report tot_read vs tot_wrote (read==wrote) + errs instead. The 10k-cap/completeness proof is the dedicated F1 test (max_read>10000 vs broker for the SAME window). Signed-off-by: entlein <einentlein@gmail.com>
…ary) Run-2 byte-delta reduction came out negative because system.parts byte delta is compaction-noisy (merges land mid-window). Report rows reduction as primary (actual captured-row count, noise-free); keep bytes as secondary context. Signed-off-by: entlein <einentlein@gmail.com>
Eviction-RCA finding (PR #63 NFR campaign): AE had NO memory limit (only cpu 300m) and was CPU-pinned at 300m under concurrent passthrough. AE measured tiny (16-38Mi steady), but the raised 1M-row passthrough cap can spike, so cap at 1Gi so AE can never memory-pressure a node; raise cpu limit 300m->1 core (was throttling). NOTE node evictions were NOT AE/OOM — node-01 went NotReady (network/heartbeat); the memory consumer is PEM (1365Mi, OOMs at the 2Gi default). Signed-off-by: entlein <einentlein@gmail.com>
Root cause of the recurring "AE unauthenticated / writes 0 / crashloop" reverts: kustomization.yaml bundled adaptive_export_secrets.yaml (placeholder pixie-api-key) with the role+deployment, so EVERY infra re-apply (make log4j) clobbered the real key that ae-auth had written. Separation of concerns: remove the secret from the kustomization — infra (role+deployment) stays re-appliable; the secret holds real creds and is owned solely by `make ae-auth`, created once, never touched by infra re-applies. Secret manifest kept as a hand-applied seed-only template (documented). Signed-off-by: entlein <einentlein@gmail.com>
The DX-steered arm was failing because the harness only fired stage-1 (JNDI/LDAP). That generates ldap-egress but NO kubescape R0001 → backend never flagged → DX no case → indeterminate → AE steers wrong/no pods. R0001 comes from stage-2 (post- exploitation exec). fire() now does stage-1 (JNDI) + stage-2 (whoami/shadow/token/ getent in the backend) → kubescape R0001+R0006 → DX rules backend MALIGNANT → backend enters AE activeSet → reduction is measurable. Verified live: DX evidence unexpected-spawn+sensitive-file-read → verdict ruled_in generic=MALIGNANT. Signed-off-by: entlein <einentlein@gmail.com>
…dd DX-steering diagnostics
Standing terminology rule: allowlist/blocklist, never whitelist/blacklist.
Pure rename (no behavior change) of the rev-3 streaming filter:
FilterModeWhitelist → FilterModeAllowlist
MaxWhitelistSize → MaxAllowlistSize
ADAPTIVE_STREAM_MAX_WHITELIST → ADAPTIVE_STREAM_MAX_ALLOWLIST (env)
mode=whitelist log string → mode=allowlist
plus all comments/identifiers/tests in streaming, activeset, cmd/main.
DX-steering diagnostics (the reason DX-arm-writes-0 has been hard to RCA —
we could not tell "empty ActiveSet" from "broker returned 0 rows"):
- scanner: log the empty-allowlist short-circuit (was silent) so an
empty ActiveSet is visible in logs, distinct from "query completed rows=0".
- FilterUpdater: emitted-filter log Debug→Info so the steered pod count
per ActiveSet change is visible without debug logging.
NOTE: ADAPTIVE_STREAM_MAX_WHITELIST env renamed → tooling that sets the old
name must switch to ADAPTIVE_STREAM_MAX_ALLOWLIST.
Signed-off-by: entlein <einentlein@gmail.com>
The Pixie dx_evidence_graph UI reads dx_attack_graph via px.DataFrame clickhouse_dsn, whose query template hardcodes event_time + hostname and ORDER BY event_time. A table without those columns fails 'Unknown identifier event_time'; a table created by hand (local, not via the operator) isn't globally registered. Fix: make AE own it like the other forensic tables. - schema.sql: dx_attack_graph DDL with event_time(UInt64 nanos) + hostname, edge columns, fromUnixTimestamp64Nano partition/TTL (nanos-correct). - KnownTables + OperatorOwnedTables: register it so Apply creates it at boot. - apply_test: assert last-applied DDL == last OperatorOwnedTables entry (robust to appended operator tables) instead of hardcoding trigger_watermark. go test ./.../clickhouse green. Signed-off-by: entlein <einentlein@gmail.com>
Pixie's clickhouse_dsn type mapper reads UInt8 as BOOLEAN and does not handle UInt16/UInt32/Float32 -> px fails with 'Column[N] given incorrect type' rendering the dx_evidence_graph. weight/max_severity/num_findings -> Int64, confidence -> Float64 (map cleanly to INT64/FLOAT64). Verified live: px run returns all 6 edges with every column. event_time stays UInt64 (matches kubescape_logs, which px reads). Signed-off-by: entlein <einentlein@gmail.com>
…canner buildPxL The DX/streaming arm silently capped each per-table pull at Pixie's default 10000-row limit while the passthrough/ALL arm (pxl.CompilePassthrough / QueryFor) already raises it to 1,000,000 via the broker's #px:set query flag. Validated live on 6a33dac0: a single streaming http_events pull returned exactly rows=10000 (the cap). Left unfixed this UNDER-counts the DX arm and OVERSTATES the DX-vs-ALL volume reduction. Prepend the same #px:set directive to the streaming scanner's PxL so both arms are uncapped and comparable. Signed-off-by: entlein <einentlein@gmail.com>
…owup) AE control endpoints had no auth. Add it using the SAME shared lib the vizier broker/PEM use (px.dev/pixie/src/shared/services/utils): SetAuth verifies a bearer JWT via jwtutils.ParseToken (signature + audience), middleware on Handler() with /healthz exempt. dx already mints this exact service JWT (GenerateJWTForService, PL_JWT_SIGNING_KEY) — it just attaches it. No new secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH + PL_JWT_SIGNING_KEY), default OFF so it merges before dx sends the bearer; flip on after dx is updated. Also: reject invalid t_end (<=0) and query windows (start>=end, non-positive). Test mints a real JWT via the shared lib; asserts 401 on missing/bad token, pass on valid, /healthz open.
📝 WalkthroughWalkthroughUpdates adaptive export runtime behavior, adds a load-test rig and SOC workflow, and introduces cockpit and Terraform cloud configuration. ChangesAdaptive Export Operator and Load-Test Rig
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- controller: don't fan out Pixie rows when attribution Sink.Write fails (avoids orphaned rows; release in-flight slot, non-fatal). (controller.go:347) - controller.Rehydrate: re-arm rev-1 pushPixieRows for restored windows so a restart doesn't silently miss post-restart Pixie data. (controller.go:255) - passthrough.pull: per-table timeout context so a hung dependency can't stall the sweep / delay shutdown (covers serial+concurrent ticks). (passthrough.go:147) - schema: TTL (30d) on ae_reconcile to cap append-only growth. (schema.sql:485) Verified no-change-needed: adaptive_attribution ORDER BY (hostname, anomaly_hash) is safe — anomaly_hash already encodes namespace+pod (anomaly/hash.go), so rows never collapse across ns/pod. (schema.sql:430) Already on ae-prod (no-op): filter timer leak, stats.py EXACT guard, sink async, watermark paging, pixieapi WithDirectTLSSkipVerify, http.Server timeouts.
|
@dx-agent — cut you a release to regression-test this followup. Release: What changed vs aeprod19 (what to exercise):
Ask: swap your rig's AE to Thanks! |
There was a problem hiding this comment.
Actionable comments posted: 44
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/e2e_log4shell_soc.yaml:
- Around line 14-31: The e2e_log4shell_soc.yaml workflow lacks concurrency
guards, which allows multiple simultaneous runs to race and corrupt shared
self-hosted runner resources like k3s and /etc/rancher/k3s/. Add a concurrency
block at the top level of the workflow (alongside the on and jobs sections) that
specifies a concurrency group to ensure only one run executes at a time. The
concurrency group should cancel in-progress runs when a new run is triggered,
using a meaningful identifier like the workflow name or a fixed string to ensure
all runs of this workflow serialize properly on the oracle-vm-16cpu-64gb-x86-64
runner.
- Around line 58-65: The workflow is directly interpolating `inputs.soc_ref` and
`inputs.dx_image` into shell commands within the run step, creating a command
injection vulnerability. Move these inputs to an env section at the step level,
assign them to environment variables, and validate that they contain only
allowed characters (e.g., alphanumeric, hyphens, dots for versions). Replace the
inline `${{ inputs.soc_ref }}` and `${{ inputs.dx_image }}` references in the
git clone and kubectl commands with properly quoted environment variable
references like `"$SOC_REF"` and `"$DX_IMAGE"` respectively.
In @.github/workflows/vizier_release.yaml:
- Line 143: The runner label oracle-vm-16cpu-64gb-x86-64 in the runs-on field is
not recognized by actionlint, causing lint validation to fail. If this is an
intentional custom self-hosted runner label, register it in your actionlint
configuration file (typically .actionlintrc or actionlint config) to whitelist
this label and allow the workflow to pass validation. Alternatively, if this
label does not exist in your infrastructure, replace it with a valid standard
GitHub Actions runner label or your organization's properly configured
self-hosted runner label.
In `@src/e2e_test/adaptive_export_loadtest/CONTRACTS.md`:
- Line 96: The `## Legend` heading is missing required blank lines for markdown
linting compliance with rule MD022. Add a blank line before the `## Legend`
heading and a blank line after it to separate it from the surrounding content.
This ensures proper spacing around the heading section in the CONTRACTS.md file.
In `@src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md`:
- Around line 50-59: The markdown file has formatting violations in fenced code
blocks at lines 50-59 and 95-97. Add a blank line before and after each fenced
code block to comply with MD031, and add language identifiers to the opening
fence markers (e.g., change ``` to ```sql for SQL code blocks and ```text or
appropriate language for other blocks). This applies to all fenced code blocks
that currently lack language tags or proper spacing.
In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md`:
- Around line 17-18: The documentation in EXPERIMENTS.md incorrectly states that
fixture timestamps are in unix nanoseconds, but the actual harness
implementation in lib.sh (lines 145-147) uses seconds for the event_time field.
Update the documentation at lines 17-18 to accurately reflect that timestamps
are in seconds rather than nanoseconds, and also correct the same misstatement
at lines 35-36 to ensure consistency with the actual harness behavior and
prevent users from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.
In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh`:
- Around line 77-80: The first inj command in the E4 test case (line 79) uses ||
true which silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.
In `@src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh`:
- Around line 74-80: The INJECT command invocations in the stall-detection
experiment are using the || true pattern to suppress failures, which causes
failed anomaly inserts to be treated as success and can produce false STALL
flags. Remove the || true suffix from both INJECT command calls (the first one
with the --same-time flag and the second one without it) to allow injection
failures to properly propagate and be detected, ensuring accurate stall
detection in the experiment.
In `@src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh`:
- Around line 80-85: The measure function accepts a valid parameter but does not
include it when writing results to the RES file at line 84. Add the valid
parameter to the printf statement that writes to RES so it captures whether each
measurement is valid. Additionally, update the summary computation at lines
121-124 to filter the results in RES by the validity column before computing the
aggregate statistics, ensuring the reported means only include measurements
marked as valid.
- Around line 114-116: The matrix arms currently have
ADAPTIVE_PUSH_PIXIE_ROWS=false set on both the ALL arm run_arm call at line 114
and the DX arm run_arm call at line 116, which disables the data-plane write
path and prevents proper measurement of data-volume reduction. To fix this,
change ADAPTIVE_PUSH_PIXIE_ROWS=false to ADAPTIVE_PUSH_PIXIE_ROWS=true in at
least one of these run_arm calls (or both as needed) to enable the data-plane
write path so that data-volume reduction can be properly measured during the
adaptive export load test.
- Line 78: The r0001_recent function has a timestamp unit mismatch where it
multiplies by 1000000000 to convert to nanoseconds, but the event_time field is
stored in seconds according to the harness documentation. This causes the WHERE
clause comparison to fail or undercount valid events. Fix this by removing the
multiplication factor of 1000000000 from the timestamp calculation, so that the
comparison is between seconds and seconds instead of seconds and nanoseconds.
Change the expression from toUInt64((now()-130))*1000000000 to
toUInt64(now()-130).
In `@src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh`:
- Around line 54-56: The reconciliation test setup on line 55 of
exp_row_reconcile.sh disables the write path by setting
ADAPTIVE_PUSH_PIXIE_ROWS=false, which prevents the CH protocol-row writes that
the test is meant to validate, causing false reconciliation failures. Remove the
ADAPTIVE_PUSH_PIXIE_ROWS=false flag from the kubectl set env command in the
adaptive-export deployment to ensure the write path remains enabled during the
reconciliation test.
In `@src/e2e_test/adaptive_export_loadtest/harness/inject.sh`:
- Around line 23-27: The row-shape header comment in the inject.sh file
incorrectly documents event_time as unix-NANOS on line 26, but the actual
contract documented in lines 32-37 and the script's usage shows it is
unix-SECONDS. Correct the comment to change "event_time (UInt64 unix-NANOS)" to
"event_time (UInt64 unix-SECONDS)" to accurately reflect the seconds unit that
the script actually uses and expects.
- Around line 96-108: Replace the predictable temporary filename
`/tmp/inject_resp.$$` with a securely generated temporary file using `mktemp`.
Create a new temporary file at the start of the code block using `mktemp` and
store it in a variable, then substitute all occurrences of `/tmp/inject_resp.$$`
with this variable throughout the curl command, the response file output
redirection, the cat command that displays the response, and the rm command that
cleans up the file. This eliminates the security risk of predictable temp file
names that could be subject to symlink attacks or collisions.
In `@src/e2e_test/adaptive_export_loadtest/harness/lib.sh`:
- Line 33: The bash script header marks KUBECONFIG as required using set -uo
pipefail, but the kubectl operations at lines 51-53 execute without verifying
that KUBECONFIG is actually set and points to a valid file. Add an explicit
guard before the kubectl operations to check that KUBECONFIG is set and
non-empty, ensuring that kubectl commands execute against the intended cluster
context and not the caller's default context. This guard should exit with an
error if KUBECONFIG is unset or empty to prevent unintended mutations.
- Around line 89-94: The chq function is missing timeout controls on the curl
command, which can cause the harness to hang indefinitely if the port-forward or
ClickHouse service becomes unresponsive. Add curl timeout options such as
--connect-timeout for the connection establishment phase and --max-time for the
overall request duration to the curl invocation in the chq function to ensure
the harness can proceed even if the backend stalls.
In `@src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh`:
- Around line 61-66: Add validation immediately after the kubectl commands that
set the BIP and BPORT variables to ensure both contain non-empty values before
proceeding to the curl loop. If either BIP or BPORT is empty, the script should
exit with a clear error message indicating backend service resolution failed.
This prevents malformed requests from being sent by the attacker deployment and
ensures the script fails fast with meaningful diagnostics instead of producing
misleading log4j vulnerability detection results.
In `@src/e2e_test/adaptive_export_loadtest/harness/nfr.sh`:
- Around line 49-56: The script detects no-loss proof mismatches by printing
MISMATCH or READ!=WROTE in the loop that iterates over http_events, dns_events,
and conn_stats, but never exits with a failure status, causing automated runs to
pass despite data loss. Add a flag variable initialized before the loop to track
whether any mismatches occur (checking both the wrote versus CH_rows comparison
and the read versus wrote comparison), then after the loop completes, check this
flag and exit the script with a non-zero status code if any mismatches were
detected.
- Around line 34-39: The setarm function on line 34 suppresses all output and
errors from the kubectl commands using `>/dev/null 2>&1`, which hides failures
from kubectl set env and kubectl rollout status. The run_phase function on line
38 calls setarm without checking its return status, allowing the test to proceed
even if the adaptive-export reconfiguration fails. Remove the output suppression
redirection from the setarm function to allow errors to be visible, and add a
return status check immediately after the setarm call in run_phase so that the
phase aborts with an appropriate error message if the kubectl reconfiguration
fails.
In `@src/e2e_test/adaptive_export_loadtest/harness/run.sh`:
- Around line 32-33: The EVID variable assignment on line 32 uses a hardcoded
user-specific path that is not portable across different environments. Replace
the hardcoded path with a portable default, such as using environment variables
or a system temporary directory. Additionally, the critical pipeline stages in
lines 40-56 lack explicit failure checks, allowing the script to continue even
if a stage fails and produces an invalid verdict set. Add explicit error
handling (such as set -e at the beginning of the script or individual error
checks) to ensure the script exits immediately when any critical stage fails.
In `@src/e2e_test/adaptive_export_loadtest/harness/stats.py`:
- Around line 57-76: The issue is that when there are no numeric PASS values for
a metric (the vals list is empty), the code continues without setting repro_ok
to False, allowing a false positive "EXACTLY REPRODUCIBLE" verdict. To fix this,
add repro_ok = False before the continue statement in the block where not vals
is True, so that any metric lacking valid PASS measurements will correctly mark
the reproducibility check as failed.
In `@src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml`:
- Around line 28-37: Add security hardening to the httpsink container
specification in the pod spec by setting automountServiceAccountToken to false
at the spec level to prevent service account token mounting, and within the
container specification add allowPrivilegeEscalation set to false, drop
unnecessary Linux capabilities, and configure a seccomp profile to restrict
system calls. Apply the same hardening pattern to the pg-sink pod specification
as well to ensure both long-lived sink pods follow the same security baseline
and reduce unnecessary privilege and credential exposure.
In `@src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml`:
- Around line 18-38: The gen-pod.tmpl.yaml pod spec lacks essential security
hardening configurations that should be applied to reduce the attack surface.
Add a securityContext block within the pod spec to enforce non-root execution,
disable privilege escalation, and set a read-only root filesystem. Additionally,
add automountServiceAccountToken set to false at the pod spec level to prevent
unnecessary service account token mounting. These hardening measures should be
applied to the spec section containing the restartPolicy and containers
definitions to ensure the cleanloadgen workload runs with minimal privileges and
reduced security risks.
In `@src/e2e_test/adaptive_export_loadtest/README.md`:
- Around line 51-57: The markdown file has linting compliance issues with the
Layout heading and its associated code fence block. First, ensure there is a
blank line before the "## Layout" heading and a blank line after it to satisfy
MD022 heading spacing requirements. Second, add a language tag to the opening
fence (using triple backticks) to satisfy MD040 requirements. Third, ensure
there is a blank line before the opening fence and a blank line after the
closing fence to satisfy MD031 fenced block spacing requirements. These same
spacing rules also apply to the heading at line 71.
In
`@src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go`:
- Around line 64-71: The envInt function currently accepts negative integer
values from environment variables, which can cause unintended behavior in
generation loops. After successfully parsing the integer value with strconv.Atoi
in the envInt function, add a validation check to ensure the parsed value is
non-negative. If the value is negative, call fatalf with an appropriate error
message indicating that the environment variable must be a non-negative integer.
In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod`:
- Line 3: The Go version directive in the go.mod file is set to version 1.22,
which is no longer receiving active security updates. Update the go directive
from version 1.22 to version 1.26 (or 1.25 if necessary) to ensure the loadgen
binaries run on a currently supported Go runtime baseline and receive ongoing
security patches.
In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 799-807: The issue is that isOperatorManagedScript() function uses
a broad prefix check (ch-) to identify operator-managed scripts, which can
accidentally match user-authored scripts that start with ch-. This causes
unintended deletion of user scripts when INSTALL_PRESET_SCRIPTS=true. Replace
the prefix-based matching logic in isOperatorManagedScript() with an explicit
list of known operator-managed script names that the function should check
against. Similarly, apply the same fix to the duplicate logic mentioned at lines
839-847 to ensure both locations use the same explicit script name matching
approach.
In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go`:
- Around line 36-38: The QueryStream method claims to support unbounded
incremental streaming but uses the same http.Client that has a timeout set via
DefaultTimeout. In Go, http.Client.Timeout applies to the entire
request-response cycle including body reads, which will cut off long-running
responses mid-stream. For QueryStream, create a separate http.Client instance
without a timeout constraint, or modify the client initialization to
conditionally apply the timeout only for non-streaming operations. Ensure
QueryStream uses the timeout-free client while other operations continue to use
the DefaultTimeout-constrained client.
In `@src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go`:
- Around line 145-152: The test uses a single 30-second context timeout for both
the Apply and VerifyPixieSchema operations, which means if Apply consumes most
of the time, VerifyPixieSchema may fail with a deadline exceeded error even
though the schema is correct. Create separate context.WithTimeout calls for each
operation so that both Apply and VerifyPixieSchema each get their own
independent 30-second timeout deadline.
In `@src/vizier/services/adaptive_export/internal/clickhouse/schema.sql`:
- Around line 68-70: The kubescape_logs table has unsafe unit handling for the
event_time column in both the PARTITION BY and TTL expressions. The
toDateTime(event_time) calls assume event_time is in seconds, but if the actual
values are in milliseconds or nanoseconds, this will produce incorrect partition
keys and TTL horizons. Determine the actual unit of event_time (likely
milliseconds or nanoseconds), then update both the PARTITION BY
toYYYYMM(toDateTime(event_time)) expression on line 68 and the TTL
toDateTime(event_time) + INTERVAL 30 DAY DELETE expression on line 69 to apply
the appropriate unit conversion before passing to toDateTime(), such as dividing
by 1000 for milliseconds or by 1000000000 for nanoseconds.
In `@src/vizier/services/adaptive_export/internal/control/server_test.go`:
- Around line 155-169: The TestBadInputRejected test function is missing
explicit test cases for the new t_end and window validators. Add test cases that
verify the server returns http.StatusBadRequest for t_end <= 0 on the
/export/start endpoint, and for window[0] >= window[1] and non-positive window
values on the /query endpoint. Follow the existing pattern in
TestBadInputRejected by using the do() helper function with appropriate test
payloads for each validator and asserting the expected 400 status code with
descriptive failure messages.
In `@src/vizier/services/adaptive_export/internal/control/server.go`:
- Around line 120-143: The handleDXAttackGraph method is vulnerable to memory
DoS attacks because the decode function reads the entire unbounded request body
and then copies it into a second buffer. Wrap the request body with
http.MaxBytesReader to enforce a maximum size limit before passing the request
to the decode function. This will ensure oversized requests fail gracefully
instead of exhausting memory resources.
In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 316-366: The code modifies c.active[hash] before calling
c.sink.Write(), but if the write fails, it returns without rolling back the
changes to c.active. This allows in-flight workers already processing the same
hash to continue using unpersisted state extensions. Save the prior state of
c.active[hash] before making any modifications (before the block that either
creates a new row or updates the existing row), and then add rollback logic in
the error handling block after the c.sink.Write() call fails to restore the
original state, ensuring that in-flight workers do not reference orphaned state.
In `@src/vizier/services/adaptive_export/internal/e2e/e2e_test.go`:
- Around line 146-167: The test currently validates only the payload shape
within the INSERT bodies, but does not verify that the INSERT statement targets
the correct table forensic_db.adaptive_attribution. Add an assertion that checks
the captured SQL statements (via insertedSQL) to ensure they contain the correct
table reference, preventing regressions where the wrong table could be inserted
into while still maintaining the correct payload structure. This check should be
added alongside or after the existing body content validation loop that checks
for matched payloads.
In `@src/vizier/services/adaptive_export/internal/passthrough/passthrough.go`:
- Around line 124-131: The compiled path silently skips invalid tables when
pxl.CompilePassthrough fails but does not record these failures using
l.rec(...), which changes the error recording semantics compared to legacy mode.
In the error handling block where pxl.CompilePassthrough(table, cfg.Window)
returns an error, add a call to l.rec(...) to record the failure before
continuing to skip the table. Apply this same fix to the similar error handling
code at lines 235-239 that also needs to record failures when tables cannot be
compiled.
In `@src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go`:
- Around line 34-37: The capRec struct's Record method appends to the rows slice
without synchronization, which causes race conditions when multiple goroutines
call Record concurrently. Add a sync.Mutex field to the capRec struct and use it
to protect the append operation in the Record method by locking before appending
to rows and unlocking after. Also update any code that reads the rows field to
acquire the same lock before accessing the slice to ensure consistent visibility
of all recorded rows.
In `@src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go`:
- Around line 112-160: Add a TODO comment to the Query method of the Adapter
struct that documents the need to implement a long-lived client with JWT-refresh
mechanism for high-throughput direct-mode scenarios. Place this TODO near the
existing comment about GC-time reclamation and connection-leak mitigation to
ensure future developers are aware of the technical debt and the conditions
under which it should be addressed.
In `@src/vizier/services/adaptive_export/internal/sink/clickhouse.go`:
- Around line 190-207: The unconditional Info log statement that logs "sink:
pixie write completed" with the WithFields call executes on every batch write in
the hot path, creating excessive log volume. Wrap the entire logging block (from
the summary := res.Summary assignment through the Info call) behind a
debug-level check using log.IsLevelEnabled(log.DebugLevel) or a temporary
feature flag so this diagnostic logging only occurs when explicitly enabled for
debugging rather than on every write.
In `@src/vizier/services/adaptive_export/internal/sink/fastencode_test.go`:
- Around line 36-58: The encodeViaJSON and parseNDJSON helper functions are
ignoring errors from enc.Encode and json.Unmarshal respectively, which can allow
tests to pass with invalid data. Modify both encodeViaJSON and parseNDJSON to
accept a *testing.T parameter as the first argument, then check the error
returns from enc.Encode and json.Unmarshal and call t.Fatalf to fail the test
immediately if either operation returns an error. Finally, update all call sites
throughout the test file where encodeViaJSON and parseNDJSON are invoked to pass
the test instance t as the first argument.
In `@src/vizier/services/adaptive_export/internal/sink/fastencode.go`:
- Around line 184-187: The appendFloat function is being called with float32 and
float64 values without first validating that they are finite numbers. Non-finite
values like NaN and Infinity will be emitted as invalid JSON, causing
whole-batch inserts to fail. Add a check before calling appendFloat in both the
float32 case (around line 184) and float64 case (around line 186) to verify the
value is finite using a function like math.IsNaN or math.IsInf. If a non-finite
value is detected, reject it or handle it appropriately instead of passing it to
appendFloat. Apply the same validation pattern to the similar float handling
code around lines 215-218.
In `@src/vizier/services/adaptive_export/internal/sink/integration_test.go`:
- Around line 158-164: The shared 60-second timeout context created before the
loop causes the test to be flaky because it applies to the entire table
iteration sequence rather than individual tables. Move the context.WithTimeout
call inside the for loop that iterates over chpkg.PixieTables() so that each
table gets its own independent 60-second timeout. This way, if one table's
WritePixieRows call takes time, it won't affect the timeout for subsequent
tables, ensuring failures are isolated to the specific table being tested.
In `@src/vizier/services/adaptive_export/internal/streaming/writer.go`:
- Around line 117-134: The flush() method clears buf unconditionally after
attempting WritePixieRows, which causes data loss when the write fails. Only
clear buf by setting it to buf[:0] when WritePixieRows succeeds (move this
operation into the else block), not in the error case. Additionally, in the
shutdown branch (lines 138-140), the timeout context is derived from an
already-canceled parent context, causing the final flush to fail fast. For the
shutdown case, create the timeout context using context.Background() instead of
the canceled parent context to give the final flush operation a proper chance to
complete.
In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go`:
- Around line 73-78: The PollLimit field documentation incorrectly states that 0
means unlimited behavior, but the New() function actually rewrites 0 to 10000.
Update the comment for the PollLimit field to accurately reflect the actual
runtime behavior that 0 is rewritten to the default value of 10000, ensuring
operators understand the correct semantics when tuning this setting.
In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`:
- Around line 123-132: The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0f1144a8-c2c2-4681-a318-0629757dcbdb
📒 Files selected for processing (116)
.arclint.bazelignore.github/workflows/e2e_log4shell_soc.yaml.github/workflows/vizier_release.yamlk8s/vizier/bootstrap/adaptive_export_deployment.yamlk8s/vizier/bootstrap/adaptive_export_secrets.yamlk8s/vizier/bootstrap/kustomization.yamlskaffold/skaffold_vizier.yamlsrc/api/go/pxapi/opts.gosrc/e2e_test/adaptive_export_loadtest/CONTRACTS.mdsrc/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.mdsrc/e2e_test/adaptive_export_loadtest/README.mdsrc/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.mdsrc/e2e_test/adaptive_export_loadtest/harness/ae_config.shsrc/e2e_test/adaptive_export_loadtest/harness/exp_control.shsrc/e2e_test/adaptive_export_loadtest/harness/exp_e5.shsrc/e2e_test/adaptive_export_loadtest/harness/exp_e8.shsrc/e2e_test/adaptive_export_loadtest/harness/exp_matrix.shsrc/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.shsrc/e2e_test/adaptive_export_loadtest/harness/inject.shsrc/e2e_test/adaptive_export_loadtest/harness/lib.shsrc/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.shsrc/e2e_test/adaptive_export_loadtest/harness/nfr.shsrc/e2e_test/adaptive_export_loadtest/harness/run.shsrc/e2e_test/adaptive_export_loadtest/harness/stats.pysrc/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yamlsrc/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yamlsrc/e2e_test/adaptive_export_loadtest/tools/loadgen/Dockerfilesrc/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.gosrc/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/httpsink/main.gosrc/e2e_test/adaptive_export_loadtest/tools/loadgen/go.modsrc/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazelsrc/vizier/services/adaptive_export/BUILD.bazelsrc/vizier/services/adaptive_export/cmd/BUILD.bazelsrc/vizier/services/adaptive_export/cmd/main.gosrc/vizier/services/adaptive_export/internal/activeset/BUILD.bazelsrc/vizier/services/adaptive_export/internal/activeset/activeset.gosrc/vizier/services/adaptive_export/internal/activeset/activeset_test.gosrc/vizier/services/adaptive_export/internal/anomaly/BUILD.bazelsrc/vizier/services/adaptive_export/internal/anomaly/hash.gosrc/vizier/services/adaptive_export/internal/anomaly/hash_bench_test.gosrc/vizier/services/adaptive_export/internal/anomaly/hash_test.gosrc/vizier/services/adaptive_export/internal/chhttp/BUILD.bazelsrc/vizier/services/adaptive_export/internal/chhttp/chhttp.gosrc/vizier/services/adaptive_export/internal/chhttp/chhttp_test.gosrc/vizier/services/adaptive_export/internal/clickhouse/BUILD.bazelsrc/vizier/services/adaptive_export/internal/clickhouse/apply.gosrc/vizier/services/adaptive_export/internal/clickhouse/apply_test.gosrc/vizier/services/adaptive_export/internal/clickhouse/columns_test.gosrc/vizier/services/adaptive_export/internal/clickhouse/ddl.gosrc/vizier/services/adaptive_export/internal/clickhouse/ddl_test.gosrc/vizier/services/adaptive_export/internal/clickhouse/insert.gosrc/vizier/services/adaptive_export/internal/clickhouse/insert_test.gosrc/vizier/services/adaptive_export/internal/clickhouse/integration_test.gosrc/vizier/services/adaptive_export/internal/clickhouse/schema.sqlsrc/vizier/services/adaptive_export/internal/config/BUILD.bazelsrc/vizier/services/adaptive_export/internal/config/definition.gosrc/vizier/services/adaptive_export/internal/control/BUILD.bazelsrc/vizier/services/adaptive_export/internal/control/server.gosrc/vizier/services/adaptive_export/internal/control/server_test.gosrc/vizier/services/adaptive_export/internal/controller/BUILD.bazelsrc/vizier/services/adaptive_export/internal/controller/controller.gosrc/vizier/services/adaptive_export/internal/controller/controller_test.gosrc/vizier/services/adaptive_export/internal/e2e/BUILD.bazelsrc/vizier/services/adaptive_export/internal/e2e/e2e_test.gosrc/vizier/services/adaptive_export/internal/e2e/loadtest_test.gosrc/vizier/services/adaptive_export/internal/kubescape/BUILD.bazelsrc/vizier/services/adaptive_export/internal/kubescape/extract.gosrc/vizier/services/adaptive_export/internal/kubescape/extract_test.gosrc/vizier/services/adaptive_export/internal/passthrough/BUILD.bazelsrc/vizier/services/adaptive_export/internal/passthrough/compiled_test.gosrc/vizier/services/adaptive_export/internal/passthrough/passthrough.gosrc/vizier/services/adaptive_export/internal/passthrough/passthrough_test.gosrc/vizier/services/adaptive_export/internal/passthrough/reconcile_test.gosrc/vizier/services/adaptive_export/internal/pixie/pixie.gosrc/vizier/services/adaptive_export/internal/pixieapi/BUILD.bazelsrc/vizier/services/adaptive_export/internal/pixieapi/pixieapi.gosrc/vizier/services/adaptive_export/internal/pixieapi/pixieapi_test.gosrc/vizier/services/adaptive_export/internal/pxl/BUILD.bazelsrc/vizier/services/adaptive_export/internal/pxl/compile.gosrc/vizier/services/adaptive_export/internal/pxl/compile_test.gosrc/vizier/services/adaptive_export/internal/pxl/pxl.gosrc/vizier/services/adaptive_export/internal/pxl/queryfor.gosrc/vizier/services/adaptive_export/internal/pxl/queryfor_bench_test.gosrc/vizier/services/adaptive_export/internal/pxl/queryfor_test.gosrc/vizier/services/adaptive_export/internal/pxl/tables.gosrc/vizier/services/adaptive_export/internal/pxl/tables_test.gosrc/vizier/services/adaptive_export/internal/reconcile/BUILD.bazelsrc/vizier/services/adaptive_export/internal/reconcile/reconcile.gosrc/vizier/services/adaptive_export/internal/script/script.gosrc/vizier/services/adaptive_export/internal/sink/BUILD.bazelsrc/vizier/services/adaptive_export/internal/sink/clickhouse.gosrc/vizier/services/adaptive_export/internal/sink/clickhouse_test.gosrc/vizier/services/adaptive_export/internal/sink/content_type_contract_test.gosrc/vizier/services/adaptive_export/internal/sink/encode_bench_test.gosrc/vizier/services/adaptive_export/internal/sink/fastencode.gosrc/vizier/services/adaptive_export/internal/sink/fastencode_test.gosrc/vizier/services/adaptive_export/internal/sink/integration_test.gosrc/vizier/services/adaptive_export/internal/streaming/BUILD.bazelsrc/vizier/services/adaptive_export/internal/streaming/filter.gosrc/vizier/services/adaptive_export/internal/streaming/filter_test.gosrc/vizier/services/adaptive_export/internal/streaming/integration_test.gosrc/vizier/services/adaptive_export/internal/streaming/notifier.gosrc/vizier/services/adaptive_export/internal/streaming/notifier_test.gosrc/vizier/services/adaptive_export/internal/streaming/scanner.gosrc/vizier/services/adaptive_export/internal/streaming/scanner_test.gosrc/vizier/services/adaptive_export/internal/streaming/supervisor.gosrc/vizier/services/adaptive_export/internal/streaming/writer.gosrc/vizier/services/adaptive_export/internal/trigger/BUILD.bazelsrc/vizier/services/adaptive_export/internal/trigger/clickhouse.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse_test.gosrc/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.gosrc/vizier/services/adaptive_export/internal/trigger/integration_test.gosrc/vizier/services/adaptive_export/internal/trigger/watermark.gosrc/vizier/services/adaptive_export/internal/trigger/watermark_test.go
💤 Files with no reviewable changes (2)
- src/vizier/services/adaptive_export/internal/pxl/pxl.go
- src/vizier/services/adaptive_export/internal/config/definition.go
| isolates each rep even with overlapping windows). Timestamps are explicit unix | ||
| nanos — fixtures NEVER use wall-clock `now()`. |
There was a problem hiding this comment.
Timestamp unit contract is inconsistent with harness behavior.
Line 17 and Line 35 document nanosecond semantics, but src/e2e_test/adaptive_export_loadtest/harness/lib.sh (Line 145–147) states fixture event_time is in seconds. Following this doc as written can undersize ADAPTIVE_WINDOW_BEFORE_SEC and miss band rows.
Suggested doc correction
- Timestamps are explicit unix nanos — fixtures NEVER use wall-clock `now()`.
+ Timestamps are explicit unix seconds — fixtures NEVER use wall-clock `now()`.
- 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0)/1e9 + margin` so window start
+ 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0) + margin` so window startAlso applies to: 35-36
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md` around lines
17 - 18, The documentation in EXPERIMENTS.md incorrectly states that fixture
timestamps are in unix nanoseconds, but the actual harness implementation in
lib.sh (lines 145-147) uses seconds for the event_time field. Update the
documentation at lines 17-18 to accurately reflect that timestamps are in
seconds rather than nanoseconds, and also correct the same misstatement at lines
35-36 to ensure consistency with the actual harness behavior and prevent users
from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.
| // fingerprintNoFmt is the Fprintf-free reference. Same output guarantee | ||
| // is NOT asserted here — this is a perf-comparison anchor only. If the | ||
| // numbers diverge by >2× from rowFingerprint, the fmt.Fprintf path is | ||
| // a real cost. | ||
| func fingerprintNoFmt(r kubescape.Row) string { | ||
| h := sha256.New() | ||
| var b strings.Builder | ||
| b.Grow(64 + len(r.RuleID) + len(r.Hostname) + len(r.K8sDetails) + len(r.ProcessDetails)) | ||
| _, _ = fmt.Fprintf(&b, "%d", r.EventTime) | ||
| b.WriteByte(0) |
There was a problem hiding this comment.
fingerprintNoFmt still uses fmt.Fprintf, so the benchmark baseline is skewed.
Line 131 keeps formatting overhead in the “no fmt” path, which weakens the comparison.
Proposed fix
import (
"crypto/sha256"
"encoding/hex"
- "fmt"
+ "strconv"
"strings"
"testing"
@@
- _, _ = fmt.Fprintf(&b, "%d", r.EventTime)
+ var nbuf [20]byte
+ b.Write(strconv.AppendUint(nbuf[:0], r.EventTime, 10))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`
around lines 123 - 132, The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.
| on: | ||
| workflow_dispatch: | ||
| inputs: | ||
| dx_image: | ||
| description: dx-daemon image to test (default = .image-tags pin) | ||
| required: false | ||
| default: "" | ||
| soc_ref: | ||
| description: k8sstormcenter/soc branch | ||
| required: false | ||
| default: "218-clickhouse-schema" | ||
| permissions: | ||
| contents: read | ||
|
|
||
| jobs: | ||
| e2e: | ||
| runs-on: oracle-vm-16cpu-64gb-x86-64 # eBPF + 16cpu/64gb; ubuntu-latest cannot run Pixie | ||
| timeout-minutes: 90 |
There was a problem hiding this comment.
Add workflow concurrency guards for the self-hosted rig.
This job provisions global host resources (k3s, /etc/rancher/k3s/...). Without concurrency, overlapping manual runs can race and corrupt each other.
Proposed fix
on:
workflow_dispatch:
@@
+concurrency:
+ group: e2e-log4shell-soc-${{ github.ref }}
+ cancel-in-progress: false
+
jobs:
e2e:🧰 Tools
🪛 actionlint (1.7.12)
[error] 30-30: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 zizmor (1.25.2)
[info] 29-29: workflow or action definition without a name (anonymous-definition): this job
(anonymous-definition)
[warning] 14-24: insufficient job-level concurrency limits (concurrency-limits): workflow is missing concurrency setting
(concurrency-limits)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/e2e_log4shell_soc.yaml around lines 14 - 31, The
e2e_log4shell_soc.yaml workflow lacks concurrency guards, which allows multiple
simultaneous runs to race and corrupt shared self-hosted runner resources like
k3s and /etc/rancher/k3s/. Add a concurrency block at the top level of the
workflow (alongside the on and jobs sections) that specifies a concurrency group
to ensure only one run executes at a time. The concurrency group should cancel
in-progress runs when a new run is triggered, using a meaningful identifier like
the workflow name or a fixed string to ensure all runs of this workflow
serialize properly on the oracle-vm-16cpu-64gb-x86-64 runner.
Source: Linters/SAST tools
| E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash | ||
| filt="cp-e4-${R}" | ||
| inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true | ||
| inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; } |
There was a problem hiding this comment.
E4 currently masks a failed first fixture insert.
Line 79 uses || true, so E4 can pass even when the first boundary-collision row was never inserted. This weakens the reproducibility assertion for that experiment.
Proposed fix
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
- inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
+ inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \
+ || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
;;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash | |
| filt="cp-e4-${R}" | |
| inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true | |
| inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; } | |
| E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash | |
| filt="cp-e4-${R}" | |
| inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \ | |
| || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; } | |
| inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh` around lines 77
- 80, The first inj command in the E4 test case (line 79) uses || true which
silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.
| @@ -0,0 +1,5 @@ | |||
| module aeload | |||
|
|
|||
| go 1.22 | |||
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the latest patched Go release in the 1.22 line, and which GO advisories are fixed there (including GO-2026-4870, GO-2026-4918, GO-2026-4980)?
💡 Result:
As of June 21, 2026, the latest release in the Go 1.22 line is Go 1.22.12 [1][2]. Regarding the specified Go security advisories: - GO-2026-4870: This advisory affects crypto/tls and is noted as affecting versions before go1.25.9, as well as versions from go1.26.0-0 before go1.26.2 [3]. It is not listed as affecting the 1.22 release line. - GO-2026-4918: This advisory concerns net/http and golang.org/x/net/http2 [4][5]. It affects net/http versions before go1.25.10 and versions from go1.26.0-0 before go1.26.3 [4]. It is not listed as affecting the 1.22 release line. - GO-2026-4980: This advisory concerns html/template and affects versions before go1.25.10, as well as versions from go1.26.0-0 before go1.26.3 [6]. It is not listed as affecting the 1.22 release line. Go 1.22 has reached a state where it is no longer receiving the same security updates as newer Go release lines (such as 1.25 and 1.26) [2][3][4][6]. Users are encouraged to upgrade to a supported release version [2].
Citations:
- 1: https://go.dev/doc/devel/release
- 2: https://endoflife.date/go
- 3: https://pkg.go.dev/vuln/GO-2026-4870
- 4: https://pkg.go.dev/vuln/GO-2026-4918
- 5: https://nvd.nist.gov/vuln/detail/CVE-2026-33814
- 6: https://pkg.go.dev/vuln/GO-2026-4980
Upgrade to a currently supported Go release.
Go 1.22 is no longer receiving active security updates. The specified Go advisories (GO-2026-4870, GO-2026-4918, GO-2026-4980) do not affect the 1.22 release line but affect newer versions. Upgrade to the latest 1.25 or 1.26 release to ensure the loadgen binaries run on a supported runtime baseline.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod` at line 3, The Go
version directive in the go.mod file is set to version 1.22, which is no longer
receiving active security updates. Update the go directive from version 1.22 to
version 1.26 (or 1.25 if necessary) to ensure the loadgen binaries run on a
currently supported Go runtime baseline and receive ongoing security patches.
Source: Linters/SAST tools
…#67) ae-prod's boundary handling accumulates the seen-fingerprint set on a no-progress tick, but if >PollLimit rows share one normalized event_time the SQL (>= watermark ORDER BY time LIMIT N, no secondary key) returns the same N boundary rows every poll → rows beyond N at that timestamp are never emitted (infinite boundary). Detect all-skipped-at-capacity and advance the watermark by 1ns to make forward progress (fingerprint dedup already tolerates the 1ns overlap). Cherry-picked the trigger fix + its test from the stale CodeRabbit-chat PR #67 (687851d); dropped that PR's unrelated gen-pod.tmpl.yaml churn. #67 itself is NOT mergeable (77 commits behind ae-prod, re-adds deleted terraform).
|
there are indications that it causes PEM to crash, needs RCA |
Auth without TLS is half a control: tcpdump on dx→AE :9100 captured 720 cleartext `Authorization: Bearer` JWTs in 70s — the #68 token crosses the CNI in plaintext. CONTROL_TLS=true now serves TLS with server.crt/key from the service-tls-certs secret (broker/PEM already use it; dx skip-verifies). Default-OFF for incremental rollout, symmetric to CONTROL_REQUIRE_AUTH. Stacks on #68 (ae-followup-auth). dx client half: entlein/dx#88. Co-authored-by: Entlein <eineintlein@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # src/vizier/services/adaptive_export/cmd/main.go # src/vizier/services/adaptive_export/internal/clickhouse/schema.sql # src/vizier/services/adaptive_export/internal/control/BUILD.bazel # src/vizier/services/adaptive_export/internal/control/server.go # src/vizier/services/adaptive_export/internal/control/server_test.go # src/vizier/services/adaptive_export/internal/controller/controller.go # src/vizier/services/adaptive_export/internal/passthrough/passthrough.go # src/vizier/services/adaptive_export/internal/trigger/clickhouse.go # src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
|
Merged
Verified: 0 conflict markers, |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/vizier/services/adaptive_export/cmd/main.go (1)
583-588: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winFail-open when
CONTROL_REQUIRE_AUTH=truebut signing key is missing.If an operator explicitly sets
CONTROL_REQUIRE_AUTH=truebutPL_JWT_SIGNING_KEYis empty/unmounted, the condition falls through to theelsebranch and the control surface comes up with auth disabled — only aWarnis emitted. A misconfiguration (forgot to mount the key) thus silently leaves/export/start,/export/stop,/dx/attack_graph, and/queryopen, contradicting theREQUIREintent. This should fail-fast instead of fail-open.🔒 Proposed fix to fail-fast on the require-but-misconfigured case
- if key := os.Getenv("PL_JWT_SIGNING_KEY"); key != "" && os.Getenv("CONTROL_REQUIRE_AUTH") == "true" { - ctrlSrv.SetAuth(key, "vizier") - log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)") - } else { - log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)") - } + if os.Getenv("CONTROL_REQUIRE_AUTH") == "true" { + key := os.Getenv("PL_JWT_SIGNING_KEY") + if key == "" { + log.Fatal("CONTROL_REQUIRE_AUTH=true but PL_JWT_SIGNING_KEY is empty — refusing to start an unauthenticated control surface") + } + ctrlSrv.SetAuth(key, "vizier") + log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)") + } else { + log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)") + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/cmd/main.go` around lines 583 - 588, The current logic allows the control surface to start with auth disabled when CONTROL_REQUIRE_AUTH is explicitly set to "true" but PL_JWT_SIGNING_KEY is empty or missing. This is a fail-open misconfiguration scenario. Refactor the conditional logic to detect when CONTROL_REQUIRE_AUTH equals "true" but the signing key is empty, and in that case call log.Fatal with a descriptive error message to fail-fast instead of falling through to the else branch. The else branch should only warn and disable auth when CONTROL_REQUIRE_AUTH is not explicitly set to "true".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/vizier_release.yaml:
- Line 18: The runs-on field is missing the vm prefix in the oracle runner
label, causing actionlint to flag it as unknown. Update the runs-on field value
from oracle-16cpu-64gb-x86-64 to oracle-vm-16cpu-64gb-x86-64 by adding the vm
prefix to match the canonical form used consistently across other workflows and
documented in the infrastructure documentation.
---
Outside diff comments:
In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 583-588: The current logic allows the control surface to start
with auth disabled when CONTROL_REQUIRE_AUTH is explicitly set to "true" but
PL_JWT_SIGNING_KEY is empty or missing. This is a fail-open misconfiguration
scenario. Refactor the conditional logic to detect when CONTROL_REQUIRE_AUTH
equals "true" but the signing key is empty, and in that case call log.Fatal with
a descriptive error message to fail-fast instead of falling through to the else
branch. The else branch should only warn and disable auth when
CONTROL_REQUIRE_AUTH is not explicitly set to "true".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0211e817-9080-4f53-bbd5-5df5e5a74ef9
📒 Files selected for processing (5)
.github/workflows/vizier_release.yamlk8s/vizier/bootstrap/adaptive_export_deployment.yamlsrc/vizier/services/adaptive_export/cmd/main.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/vizier/services/adaptive_export/cmd/main.go (1)
583-588: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winFail-open when
CONTROL_REQUIRE_AUTH=truebut signing key is missing.If an operator explicitly sets
CONTROL_REQUIRE_AUTH=truebutPL_JWT_SIGNING_KEYis empty/unmounted, the condition falls through to theelsebranch and the control surface comes up with auth disabled — only aWarnis emitted. A misconfiguration (forgot to mount the key) thus silently leaves/export/start,/export/stop,/dx/attack_graph, and/queryopen, contradicting theREQUIREintent. This should fail-fast instead of fail-open.🔒 Proposed fix to fail-fast on the require-but-misconfigured case
- if key := os.Getenv("PL_JWT_SIGNING_KEY"); key != "" && os.Getenv("CONTROL_REQUIRE_AUTH") == "true" { - ctrlSrv.SetAuth(key, "vizier") - log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)") - } else { - log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)") - } + if os.Getenv("CONTROL_REQUIRE_AUTH") == "true" { + key := os.Getenv("PL_JWT_SIGNING_KEY") + if key == "" { + log.Fatal("CONTROL_REQUIRE_AUTH=true but PL_JWT_SIGNING_KEY is empty — refusing to start an unauthenticated control surface") + } + ctrlSrv.SetAuth(key, "vizier") + log.Info("control surface: bearer-JWT auth ENABLED (audience=vizier)") + } else { + log.Warn("control surface: auth DISABLED (set CONTROL_REQUIRE_AUTH=true + PL_JWT_SIGNING_KEY)") + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/cmd/main.go` around lines 583 - 588, The current logic allows the control surface to start with auth disabled when CONTROL_REQUIRE_AUTH is explicitly set to "true" but PL_JWT_SIGNING_KEY is empty or missing. This is a fail-open misconfiguration scenario. Refactor the conditional logic to detect when CONTROL_REQUIRE_AUTH equals "true" but the signing key is empty, and in that case call log.Fatal with a descriptive error message to fail-fast instead of falling through to the else branch. The else branch should only warn and disable auth when CONTROL_REQUIRE_AUTH is not explicitly set to "true".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/vizier_release.yaml:
- Line 18: The runs-on field is missing the vm prefix in the oracle runner
label, causing actionlint to flag it as unknown. Update the runs-on field value
from oracle-16cpu-64gb-x86-64 to oracle-vm-16cpu-64gb-x86-64 by adding the vm
prefix to match the canonical form used consistently across other workflows and
documented in the infrastructure documentation.
---
Outside diff comments:
In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 583-588: The current logic allows the control surface to start
with auth disabled when CONTROL_REQUIRE_AUTH is explicitly set to "true" but
PL_JWT_SIGNING_KEY is empty or missing. This is a fail-open misconfiguration
scenario. Refactor the conditional logic to detect when CONTROL_REQUIRE_AUTH
equals "true" but the signing key is empty, and in that case call log.Fatal with
a descriptive error message to fail-fast instead of falling through to the else
branch. The else branch should only warn and disable auth when
CONTROL_REQUIRE_AUTH is not explicitly set to "true".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0211e817-9080-4f53-bbd5-5df5e5a74ef9
📒 Files selected for processing (5)
.github/workflows/vizier_release.yamlk8s/vizier/bootstrap/adaptive_export_deployment.yamlsrc/vizier/services/adaptive_export/cmd/main.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
🛑 Comments failed to post (1)
.github/workflows/vizier_release.yaml (1)
18-18: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Runner label appears to be missing the
vmprefix.Line 18 specifies
oracle-16cpu-64gb-x86-64, but the change summary indicates the intent wasoracle-vm-16cpu-64gb-x86-64. The current label is flagged by actionlint as unknown, and thevm-prefixed variant is used consistently in.github/workflows/e2e_log4shell_soc.yamland documented in the infrastructure README as the canonical form.🔧 Proposed fix: add `vm` prefix
- runs-on: oracle-16cpu-64gb-x86-64 + runs-on: oracle-vm-16cpu-64gb-x86-64🧰 Tools
🪛 actionlint (1.7.12)
[error] 18-18: label "oracle-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/vizier_release.yaml at line 18, The runs-on field is missing the vm prefix in the oracle runner label, causing actionlint to flag it as unknown. Update the runs-on field value from oracle-16cpu-64gb-x86-64 to oracle-vm-16cpu-64gb-x86-64 by adding the vm prefix to match the canonical form used consistently across other workflows and documented in the infrastructure documentation.
… → 73.7% cov) The trigger's incremental kubescape-events pump has 3 moving parts that must agree: watermark advancement, boundary fingerprint dedup, and PollLimit-saturated draining (PR #67 fix). Existing tests cover each in isolation. This new file pins them TOGETHER against the simplest possible reference ("drain everything in event_time order, dedupe by fingerprint, advance the cursor to max(event_time)"). If the iterative trigger and the reference disagree on the set of emitted rows for ANY poll sequence, one of the three parts is wrong. Four oracle tests: TestOracle_TriggerEmitsNaiveSet_StaggeredCorpus — 50 rows scattered across distinct event_times, PollLimit=10 forces ≥5 polls; trigger must emit exactly the 50 rows the naive reference computes. TestOracle_PollLimitSaturation_AtCapacity — regression guard for PR #67 (dfdc465): when EXACTLY PollLimit rows share a boundary event_time, every one of them must emit, and the cursor must clear the boundary for the next-event_time row that follows. TestOracle_PollLimitOverflow_DocumentsLossBound — pins PR #67's intentional trade-off: when >PollLimit rows share a boundary event_time, the first PollLimit emit, then the 1ns escape advances the cursor past the surplus. Lock-in test: if a future fix recovers the surplus (good!) this test fails loudly so both can update in lock-step instead of regressing silently. TestOracle_BoundaryDedup_NoDuplicates — when CH returns the same boundary row in two consecutive polls (real production case: a new row lands at the same event_time after a previous poll's cursor advanced past it), the seenAtBoundary map must filter the duplicate. Coverage: trigger 71.6% → 73.7% (statement-level). The oracle's value isn't in statement count — it's in property-level coverage on invariants the per-feature tests can't enforce together. Build.bazel: adds oracle_test.go to the existing trigger_test target.
…cle-vm-16cpu) Merge from main re-introduced the deprecated 'oracle-16cpu-64gb-x86-64' label on the Build Release job. That label doesn't resolve in the fork's runner pool, so aeprod22's release workflow sat queued indefinitely. update-gh-artifacts-manifest at L143 already had the '-vm-' variant; align Build Release at L18 to match. Same fix the fork applied earlier in 21d536e for the standalone vizier_release workflow — main keeps regressing it.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/vizier/services/adaptive_export/internal/trigger/oracle_test.go`:
- Around line 351-365: The overflow test has two issues: the timeout branch in
the receive loop only breaks the select, so the outer loop can spin forever, and
the error returned by tr.Subscribe(ctx) is being ignored. Update the Subscribe
setup in oracle_test.go to check the error and fail fast with t.Fatalf, and
change the deadline case in the collected-events loop to explicitly fail the
test or exit the loop so the timeout path cannot hang.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 50bd8567-0709-4cfd-b548-c297ff735a59
📒 Files selected for processing (3)
.github/workflows/vizier_release.yamlsrc/vizier/services/adaptive_export/internal/trigger/BUILD.bazelsrc/vizier/services/adaptive_export/internal/trigger/oracle_test.go
…i.bzl, fork cockpit) The merge of main into ae-followup-auth (0c65751) silently took the upstream version on a swath of files where the fork had previously deviated. This commit reverts the merge's regressions back to the fork-correct state (origin/ae-prod) AND completes the runner-label sweep so every release/mirror/perf workflow uses the same -vm-16cpu label the fork's runner pool actually has. 1. Runner labels — main re-introduced 'oracle-16cpu-64gb-x86-64' and 'oracle-8cpu-32gb-x86-64' on 7 workflows. Fork's pool only resolves 'oracle-vm-16cpu-64gb-x86-64'; both stale labels sit queued forever. Fixed: cli_release.yaml (L18+L212), cloud_release.yaml (L18), mirror_demos.yaml (L12), mirror_deps.yaml (L12), mirror_releases.yaml (L13), operator_release.yaml (L18+L143), perf_common.yaml (L37+L60). vizier_release.yaml L18 was fixed already in 0fd9c3f. 2. bazel/ui.bzl — main reverted PR #64's webpack-build fixes that broke release/cloud/v0.0.10 with 'export: `18': not a valid identifier'. Restored: 'set -x' for action-shell tracing, PATH that puts /opt/px_dev/tools/node/bin FIRST, 'hash -r', the STABLE_BUILD_TAG|BUILD_TIMESTAMP allowlist sed (vs the wildcard that word-splits FORMATTED_DATE), and use_default_shell_env=True so --incompatible_strict_action_env doesn't strip yarn from PATH. 3. 28 fork-cloud config files — main's PR pixie-io#2391 (cert-manager migration) deleted private/cockpit/*, terraform/kubernetes/auth0/*, terraform/kubernetes/cloud_deps/*, .sops.yaml, private/skaffold_cloud.yaml. These are still load- bearing for the AOCC pixie-cloud deployment; the fork hasn't migrated to cert-manager-compatible secrets yet (PR pixie-io#2391's monitor.go fallback path is in place, so adoption is the follow-up, not a blocker). Restored all 28 from origin/ae-prod. Genuine main pickups that were CORRECT to keep (no fix needed): the src/utils/shared/k8s/{apply,delete}.go import-order + sets.New[string] generics migration, src/operator/controllers/monitor.go's cert-manager secret fallback, and the src/carnot/BUILD.bazel + src/carnot/exec/BUILD.bazel additions.
11 TRUE + 2 PARTIAL CodeRabbit comments verified against current code;
all valid. This commit lands every one of them as a discrete change,
keeps the existing tests green, adds new tests where the contract
itself changed.
🔴 Real bugs:
controller/controller.go: handle() now SNAPSHOTS c.active[hash]
before mutation and ROLLS BACK on sink.Write failure. Without
this, a failed persist left c.active extended; an already-running
pushPixieRows would re-snapshot and fan out data based on an
attribution row that never landed in CH. Updated
TestController_SinkErrorNonFatal to pin the new rollback contract
(active==0 after sink error) and added a writeAttempts() observer
on fakeSink so the test doesn't race.
sink/fastencode.go: appendJSONValue now rejects NaN/+Inf/-Inf
floats with errFastEncodeUnsupported (triggers the encoding/json
fallback). Previously strconv.AppendFloat emitted invalid JSON
tokens that made CH reject the whole batch.
streaming/writer.go: flush() keeps the row buffer on write failure
(so the next attempt retries instead of silently dropping), and
the shutdown branch uses context.Background() for the final flush
so it isn't fast-failed by the already-cancelled parent ctx.
control/server.go: decode() now wraps the body in
http.MaxBytesReader(w, body, 4 MiB) so an oversized JSON payload
on the public(-ish) control endpoint can't OOM the operator. The
JWT auth still gates access — this hardens what a holding-an-
acceptable-JWT attacker can do.
🟠 Lower-impact:
cmd/main.go: isOperatorManagedScript matches the EXACT builtin
names (no "ch-" prefix), so a user-authored script named
"ch-something-custom" can't get deleted under
INSTALL_PRESET_SCRIPTS=true.
chhttp/chhttp.go: QueryStream now uses a separate http.Client with
no Timeout. Go's http.Client.Timeout covers body reads, so
reusing the 30s default would silently truncate a multi-MB
active-set rehydrate. Stream callers must bound via ctx.Deadline.
passthrough/passthrough.go: tickConcurrent's precompile-skip branch
now calls l.rec(...) so the compiled and legacy paths produce
identical reconcile-row counts. Previously the compiled path
silently dropped a table → invisible divergence.
trigger/oracle_test.go: COLLECT: labelled break stops the busy-spin
on deadline expiry. Bare "break" only exits the select, leaving
the for-loop to busy-spin until the count condition is reached.
🟡 Cosmetic / cleanup:
control/server_test.go: TestBadInputRejected now pins t_end<=0 on
/export/start AND inverted/zero window on /query — 4 new
assertions for validators that existed but were untested.
pixieapi/pixieapi.go: TODO comment marking the bounded pxapi.Client
leak for follow-up when direct-mode throughput crosses ~1 q/s.
sink/clickhouse.go: "sink: pixie write completed" demoted Info
→ Debug. One log per Pixie batch in fan-out paths was avoidable
log-volume pressure; the silent-drop guard below still fires
loud on the actual failure mode.
sink/integration_test.go: per-table 15s ctx (was a shared 60s for
the entire loop) so a slow early table can't starve later ones.
trigger/clickhouse.go: PollLimit docstring now matches the
0→10000 rewrite in New(); "unlimited" is no longer a documented
option.
Local gates: build exit 0, go test 14/14 packages pass,
tools/linters/lint_like_ci.sh OKAY (0 Errors, 0 Warnings).
…lename-linter The fork's filename-linter (inherited from upstream pixie-io) rejects any PR diff that touches a path containing 'private'. The restored fork-cloud config from the main-merge fix (cb81ecd) added 12 private/cockpit/* + 1 terraform/credentials/cockpit/* files, which upstream-main had deleted via PR pixie-io#2391 — the linter then correctly saw them as 'new private/* additions' and failed. These specific paths are LEGITIMATE fork-cloud deployment config that pre-dates the upstream linter; the original 'no private/* leaks' intent stays in place for everything else via two negative-glob exceptions.
… from filename-linter" This reverts commit 7f43c51.
…ccepts the fork-cloud config upstream Honest fix for the merge-regression: the fork's filename-linter rejects any PR diff touching a path containing 'private' (an upstream-pixie-io guard against secrets leaking into OSS PRs). The fork is a CONTRIBUTOR that upstreams, so the guard applies; the prior 'restore from ae-prod' fix legitimately re-added private/cockpit/* (which upstream-main pixie-io#2391 deleted via cert-manager migration) but tripped the linter. Reverts the lazy carve-out fix (c579e48 reverted 7f43c51) and does the structural rename instead. Moves: private/cockpit/ -> cockpit/ private/skaffold_cloud.yaml -> skaffold/skaffold_cloud.yaml (collapsed with the upstream- resurrected skaffold/skaffold_cloud.yaml — the fork's private/ version was the maintained superset) Reference updates: cockpit/kustomization.yaml ../../k8s/cloud/... -> ../k8s/cloud/... (depth-1 after the rename) skaffold/skaffold_cloud.yaml:84 - private/cockpit -> - cockpit Linter intent preserved: any future PR that touches a real 'private/*' path still fails — the guard's correct semantics are restored.
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/vizier/services/adaptive_export/internal/trigger/clickhouse.go (1)
322-338: 🗄️ Data Integrity & Integration | 🔴 Critical | 🏗️ Heavy liftDo not advance past the boundary timestamp without a total row order.
Line 401 only orders by normalized
event_time. That means a full page of already-seen rows at the current watermark does not prove there are no unseen rows with the same timestamp after theLIMIT. Bumping the watermark by+1nshere can therefore drop real rows permanently whenever more thanPollLimitrows share the boundary timestamp.This needs a stable tie-breaker in the query/state (for example, page by
event_timeplus a secondary key/fingerprint) instead of skipping to the next timestamp.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go` around lines 322 - 338, The paging escape in the ClickHouse trigger advances the watermark past a boundary timestamp without a total row order, which can permanently skip unseen rows. Update the polling logic in the trigger code path around the boundary handling to avoid incrementing the watermark by 1ns when the query is only ordered by normalized event_time. Instead, add a stable secondary tie-breaker to the pagination/state in the query and cursor logic used by the trigger so rows at the same timestamp can be paged deterministically without dropping data.src/vizier/services/adaptive_export/internal/chhttp/chhttp.go (1)
107-122: 🩺 Stability & Availability | 🟠 MajorAdd a timeout to the boot-time
Rehydratecall.cmd/main.go:443passes the long-lived service context (context.WithCancel(context.Background())) intoctl.Rehydrate, so the new no-timeoutQueryStreampath can block startup indefinitely if ClickHouse stalls. Wrap that call in a deadline-bound context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go` around lines 107 - 122, The boot-time ctl.Rehydrate call is using a long-lived service context, so the new QueryStream path can block startup forever if ClickHouse stalls. Update the startup flow around ctl.Rehydrate to create and pass a deadline-bound context instead of the background-derived cancel-only context, and ensure the timeout is scoped just to rehydration. Use the existing ctl.Rehydrate call site and the QueryStream behavior in chhttp.Client to locate the fix.
♻️ Duplicate comments (3)
src/vizier/services/adaptive_export/internal/trigger/oracle_test.go (1)
351-351: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winCheck the
Subscribesetup error.Line 351 still discards the error from
tr.Subscribe(ctx), so a setup failure gets reported as a later collection timeout instead of failing fast at the source.Suggested fix
- ch, _ := tr.Subscribe(ctx) + ch, err := tr.Subscribe(ctx) + if err != nil { + t.Fatalf("Subscribe: %v", err) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/trigger/oracle_test.go` at line 351, The test setup in `Subscribe` is still ignoring the returned error, which can hide a subscription failure until a later timeout. Update the `tr.Subscribe(ctx)` call in `oracle_test` to handle the error immediately and fail the test at the source, using the existing `tr` and `Subscribe` flow rather than continuing with a potentially nil channel.src/vizier/services/adaptive_export/internal/controller/controller.go (1)
344-382: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftDo not publish unpersisted attribution state before
Sink.Writesucceeds.The rollback happens after the failed write, but Lines 344-353 already mutate
c.active; an existingpushPixieRowsloop can re-snapshot that unpersisted extension whileSink.Writeis still pending and fan out rows against a phantom attribution anchor. Stage the updated row in a local copy, write it first, then commit it toc.activeunder the mutex only after persistence succeeds.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/controller/controller.go` around lines 344 - 382, The attribution update in controller logic is mutating c.active before Sink.Write succeeds, which can let pushPixieRows observe and fan out an unpersisted snapshot. Fix the flow in the controller method that handles the row update by staging the modified AttributionRow in a local copy, calling c.sink.Write first, and only committing the new state into c.active under c.mu after the write succeeds; keep the rollback path for failures, including the inFlight bookkeeping, unchanged.src/vizier/services/adaptive_export/internal/control/server_test.go (1)
181-189: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winAdd a non-positive window timestamp case.
These cases pin inverted and zero-length windows, but not the new
Window[0] <= 0 || Window[1] <= 0contract inhandleQuery.Proposed test addition
if r := do(t, srv, http.MethodPost, "/query", `{"pod":"p","table":"http_events","query_id":"x","window":[5,5]}`); r.StatusCode != http.StatusBadRequest { t.Fatalf("query zero-window = %d, want 400", r.StatusCode) } + if r := do(t, srv, http.MethodPost, "/query", + `{"pod":"p","table":"http_events","query_id":"x","window":[0,5]}`); r.StatusCode != http.StatusBadRequest { + t.Fatalf("query non-positive-window = %d, want 400", r.StatusCode) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/control/server_test.go` around lines 181 - 189, The query tests in server_test.go cover inverted and zero-length windows, but they do not exercise the new non-positive timestamp validation in handleQuery. Add a test case in the existing /query checks using do(...) with a window where either Window[0] or Window[1] is non-positive, and assert it returns http.StatusBadRequest; place it alongside the current window validation cases so the handleQuery contract is pinned by the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/mirror_demos.yaml:
- Line 12: The workflow job’s custom runner label was renamed to
oracle-vm-16cpu-64gb-x86-64, but actionlint will still fail until the shared
custom runner label list is updated to match. Update the actionlint
configuration that defines allowed self-hosted labels so it includes the new
label used by the workflow’s runs-on entry, keeping both references in sync.
In `@bazel/ui.bzl`:
- Around line 78-79: Guard the stamp export logic in the shell command built by
the `ui` rule so missing `STABLE_BUILD_TAG` or `BUILD_TIMESTAMP` entries do not
remove the leading command and break the `&&` chain before `yarn build_prod`.
Update the two `sed` expansions that reference `ctx.info_file.path` and
`ctx.version_file.path` to a no-op-safe form, such as wrapping each in `eval
"$(sed ...)"`, so the command still succeeds when a key is absent.
In `@cockpit/cloud_ingress.yaml`:
- Around line 7-16: The ingress route for cloud-proxy-service:443 defines the
backend transport but never applies it, so the serverName override is ignored.
Update cockpit/cloud_ingress.yaml to attach the Traefik service.serverstransport
annotation on the / backend route and point it to cloud-backend-transport, using
the existing ingress resource so the backend actually picks up the transport
settings.
In `@cockpit/plugin_db_updater_job.yaml`:
- Around line 15-90: Harden the Job spec for the updater workload by setting
restrictive pod/security defaults before it consumes PL_POSTGRES_PASSWORD and
PL_JWT_SIGNING_KEY. Update the Job template in plugin_db_updater_job.yaml to
disable root execution and privilege escalation, drop unnecessary Linux
capabilities, and prevent automatic service-account token mounting; apply these
settings at the pod-level and, if needed, reinforce them on the postgres-wait
and updater containers.
In `@cockpit/proxy_envoy.yaml`:
- Around line 43-50: The CORS configuration in the Envoy proxy is using a
suffix-based origin match, which is broader than the single approved host and
can match unintended subdomains. Update the cors allow_origin_string_match
setting to use an exact origin match for the same host configured in
cockpit/domain_config.yaml, and keep the rest of the cors settings unchanged so
credentialed requests are only allowed from that one origin.
- Around line 87-96: The UpstreamTlsContext for api_service only configures a
client certificate and does not verify the backend server identity. Update the
envoy transport socket configuration to add a validation_context with the
trusted CA for the upstream certificate chain and set the appropriate
sni/hostname so Envoy validates the server cert before proxying traffic. Keep
the fix localized to the UpstreamTlsContext/common_tls_context block so the
server authentication settings are applied alongside the existing
tls_certificates.
In `@src/vizier/services/adaptive_export/internal/control/server.go`:
- Around line 185-188: The decode helper currently accepts a valid first JSON
value and ignores any trailing bytes, so malformed control requests can slip
through. Update decode in server.go to use the existing json.Decoder on r.Body,
then verify there is no extra non-whitespace input remaining after the first
Decode before returning success. If any trailing junk is present, make the
request fail so the caller returns a 400.
In `@terraform/kubernetes/auth0/.terraform.lock.hcl`:
- Around line 4-6: The Auth0 provider is still allowed to resolve from a
too-broad version range, which can let lockfile regeneration pull in an untested
major. Update the required provider constraint in auth0_main.tf for auth0/auth0
to the validated series used by the lock entry, then regenerate the lockfile so
.terraform.lock.hcl stays aligned with that pinned range.
In `@terraform/kubernetes/auth0/auth0_main.tf`:
- Line 12: Disable the Auth0 provider debug logging in the committed Terraform
config by removing or setting the debug flag to false in the auth0 provider
block. Update the auth0_main.tf provider configuration so debug is not enabled
by default, keeping any temporary troubleshooting setting out of checked-in
code.
- Line 6: The Auth0 provider constraint is still too open-ended, so update the
provider version requirement in the auth0_main configuration to a bounded range
instead of only a minimum. Use the existing provider block that sets the
auth0/auth0 version and narrow it to a tested range such as a compatible patch
range or an explicit upper bound below the next major release.
In `@terraform/kubernetes/auth0/generated.tf`:
- Around line 1702-1715: The non_interactive Auth0 client is exposing
browser-based grant types that should not be enabled for this M2M app. Update
the grant_types configuration in the generated Auth0 client definition to keep
only client_credentials, and remove implicit, authorization_code, and
refresh_token unless there is a clear need for browser flows. Use the
non_interactive client block in the generated Terraform output to locate the
change.
In `@terraform/kubernetes/auth0/variables.tf`:
- Line 1: The cloud_domain variable is currently untyped and unvalidated, so
misconfiguration can slip through until runtime. Update the cloud_domain
variable in variables.tf to declare it as a string and add validation that
rejects empty values (and any other invalid format checks you deem appropriate
for callback URL usage). Keep the validation close to the variable definition so
the failure happens early when Terraform plans or applies.
In `@terraform/kubernetes/cloud_deps/base_cloud_deps.tf`:
- Around line 1-4: The kustomization_build data source for the elastic base
dependency is hardcoded to a checkout-specific relative path, which will break
when Terraform runs from a different working directory in CI. Update the elastic
data block to use a parameterized source path from a variable or shared local
used by the cloud_deps Terraform config, and keep the existing data
"kustomization_build" "elastic" identifier so the new path can be reused across
environments.
In `@terraform/kubernetes/cloud_deps/cert_manager.tf`:
- Around line 1-4: The local cert_config is indexing
var.cert_details[var.cloud_domain] too early, so a missing domain triggers an
Invalid index before the custom validation in the cert_manager logic can run.
Update the cert_manager locals/validation flow by using lookup or try when
setting cert_config, or otherwise defer the access into variable validation so
the existing guard for cloud_domain/cert_details can execute first.
In `@terraform/kubernetes/cloud_deps/core_resource_deps.tf`:
- Around line 7-110: The kubernetes_secret_v1 resources are storing sensitive
values in Terraform state via the data fields in cloud_auth, cloud_session,
db_secrets, and cloud_auth0. Refactor the secret generation and secret creation
flow so these values are managed outside Terraform state, either by moving them
to a dedicated secret manager or by provisioning the Kubernetes secrets
out-of-band and only referencing their names in this module; keep the focus on
the data.external sources and the kubernetes_secret_v1 resources involved.
In `@terraform/kubernetes/cloud_deps/env_cloud_deps.tf`:
- Around line 1-4: The kustomization_build data source in env_deps is using a
hardcoded relative path that only works for one local checkout layout. Update
the path resolution in env_cloud_deps.tf so it is derived from the
module/working context instead of a fixed "../../../k8s/cloud_deps/public"
relative path, and make sure the env_deps data block still points to the same
k8s/cloud_deps/public manifests across local runs and Terraform pipelines.
In `@terraform/kubernetes/cloud_deps/provider.tf`:
- Around line 14-22: The kubernetes and kustomization provider blocks are
hardcoded to a developer-specific kubeconfig and context, which prevents
portable auth. Update the provider configuration to use configurable inputs or
environment-based discovery instead of fixed values, and adjust the provider
definitions in provider.tf so they can work in CI and on other machines without
the cockpit-config path or default context being baked in.
---
Outside diff comments:
In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go`:
- Around line 107-122: The boot-time ctl.Rehydrate call is using a long-lived
service context, so the new QueryStream path can block startup forever if
ClickHouse stalls. Update the startup flow around ctl.Rehydrate to create and
pass a deadline-bound context instead of the background-derived cancel-only
context, and ensure the timeout is scoped just to rehydration. Use the existing
ctl.Rehydrate call site and the QueryStream behavior in chhttp.Client to locate
the fix.
In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go`:
- Around line 322-338: The paging escape in the ClickHouse trigger advances the
watermark past a boundary timestamp without a total row order, which can
permanently skip unseen rows. Update the polling logic in the trigger code path
around the boundary handling to avoid incrementing the watermark by 1ns when the
query is only ordered by normalized event_time. Instead, add a stable secondary
tie-breaker to the pagination/state in the query and cursor logic used by the
trigger so rows at the same timestamp can be paged deterministically without
dropping data.
---
Duplicate comments:
In `@src/vizier/services/adaptive_export/internal/control/server_test.go`:
- Around line 181-189: The query tests in server_test.go cover inverted and
zero-length windows, but they do not exercise the new non-positive timestamp
validation in handleQuery. Add a test case in the existing /query checks using
do(...) with a window where either Window[0] or Window[1] is non-positive, and
assert it returns http.StatusBadRequest; place it alongside the current window
validation cases so the handleQuery contract is pinned by the tests.
In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 344-382: The attribution update in controller logic is mutating
c.active before Sink.Write succeeds, which can let pushPixieRows observe and fan
out an unpersisted snapshot. Fix the flow in the controller method that handles
the row update by staging the modified AttributionRow in a local copy, calling
c.sink.Write first, and only committing the new state into c.active under c.mu
after the write succeeds; keep the rollback path for failures, including the
inFlight bookkeeping, unchanged.
In `@src/vizier/services/adaptive_export/internal/trigger/oracle_test.go`:
- Line 351: The test setup in `Subscribe` is still ignoring the returned error,
which can hide a subscription failure until a later timeout. Update the
`tr.Subscribe(ctx)` call in `oracle_test` to handle the error immediately and
fail the test at the source, using the existing `tr` and `Subscribe` flow rather
than continuing with a potentially nil channel.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: ffd11bf6-28d9-4236-ba9a-43653a5d3192
⛔ Files ignored due to path filters (1)
terraform/kubernetes/cloud_deps/terraform.tfstateis excluded by!**/*.tfstate
📒 Files selected for processing (49)
.github/workflows/cli_release.yaml.github/workflows/cloud_release.yaml.github/workflows/mirror_demos.yaml.github/workflows/mirror_deps.yaml.github/workflows/mirror_releases.yaml.github/workflows/operator_release.yaml.github/workflows/perf_common.yaml.sops.yamlbazel/ui.bzlcockpit/artifact_config.yamlcockpit/auth_deployment.yamlcockpit/cloud_ingress.yamlcockpit/domain_config.yamlcockpit/kustomization.yamlcockpit/oauth_config.yamlcockpit/plugin_db_updater_job.yamlcockpit/proxy_envoy.yamlcockpit/script_bundles_config.yamlcockpit/servers_transport.yamlskaffold/skaffold_cloud.yamlsrc/vizier/services/adaptive_export/cmd/main.gosrc/vizier/services/adaptive_export/internal/chhttp/chhttp.gosrc/vizier/services/adaptive_export/internal/control/server.gosrc/vizier/services/adaptive_export/internal/control/server_test.gosrc/vizier/services/adaptive_export/internal/controller/controller.gosrc/vizier/services/adaptive_export/internal/controller/controller_test.gosrc/vizier/services/adaptive_export/internal/passthrough/passthrough.gosrc/vizier/services/adaptive_export/internal/pixieapi/pixieapi.gosrc/vizier/services/adaptive_export/internal/sink/clickhouse.gosrc/vizier/services/adaptive_export/internal/sink/fastencode.gosrc/vizier/services/adaptive_export/internal/sink/integration_test.gosrc/vizier/services/adaptive_export/internal/streaming/writer.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse.gosrc/vizier/services/adaptive_export/internal/trigger/oracle_test.goterraform/credentials/cockpit/auth0_config.yamlterraform/kubernetes/auth0/.terraform.lock.hclterraform/kubernetes/auth0/auth0_import.tfterraform/kubernetes/auth0/auth0_main.tfterraform/kubernetes/auth0/generated.tfterraform/kubernetes/auth0/terraform.tfvarsterraform/kubernetes/auth0/variables.tfterraform/kubernetes/cloud_deps/.terraform.lock.hclterraform/kubernetes/cloud_deps/base_cloud_deps.tfterraform/kubernetes/cloud_deps/cert_manager.tfterraform/kubernetes/cloud_deps/core_resource_deps.tfterraform/kubernetes/cloud_deps/env_cloud_deps.tfterraform/kubernetes/cloud_deps/provider.tfterraform/kubernetes/cloud_deps/variables.tfterraform/kubernetes/scripts/create_random_bytes.sh
| contents: read | ||
| packages: write | ||
| runs-on: oracle-16cpu-64gb-x86-64 | ||
| runs-on: oracle-vm-16cpu-64gb-x86-64 |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Keep actionlint’s custom runner labels in sync.
This rename looks intentional, but Line 12 will keep failing workflow lint until the custom self-hosted label list is updated alongside it. The same oracle-vm-16cpu-64gb-x86-64 label is already referenced elsewhere, so the root cause is the shared actionlint config, not this job definition.
🧰 Tools
🪛 actionlint (1.7.12)
[error] 12-12: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/mirror_demos.yaml at line 12, The workflow job’s custom
runner label was renamed to oracle-vm-16cpu-64gb-x86-64, but actionlint will
still fail until the shared custom runner label list is updated to match. Update
the actionlint configuration that defines allowed self-hosted labels so it
includes the new label used by the workflow’s runs-on entry, keeping both
references in sync.
Source: Linters/SAST tools
| '$(sed -E -n "s/^(STABLE_BUILD_TAG|BUILD_TIMESTAMP)\\s+(.*)/export \\1=\\2/p" "{}")'.format(ctx.info_file.path), | ||
| '$(sed -E -n "s/^(STABLE_BUILD_TAG|BUILD_TIMESTAMP)\\s+(.*)/export \\1=\\2/p" "{}")'.format(ctx.version_file.path), |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the file and inspect the relevant section with line numbers.
git ls-files | rg '^bazel/ui\.bzl$'
wc -l bazel/ui.bzl
sed -n '1,170p' bazel/ui.bzl | cat -nRepository: k8sstormcenter/pixie
Length of output: 7423
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for the exact sed snippets and any related command construction.
rg -n 'STABLE_BUILD_TAG|BUILD_TIMESTAMP|sed -E -n|yarn build_prod|&&' bazel/ui.bzl . -g '!**/node_modules/**'Repository: k8sstormcenter/pixie
Length of output: 50377
Guard the stamp exports against missing keys
At bazel/ui.bzl:78-79, a missing key makes the leading $(sed ...) disappear and breaks the &&-joined shell script before yarn build_prod runs. Wrap each expansion in eval "$(sed ...)" or another no-op-safe form.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@bazel/ui.bzl` around lines 78 - 79, Guard the stamp export logic in the shell
command built by the `ui` rule so missing `STABLE_BUILD_TAG` or
`BUILD_TIMESTAMP` entries do not remove the leading command and break the `&&`
chain before `yarn build_prod`. Update the two `sed` expansions that reference
`ctx.info_file.path` and `ctx.version_file.path` to a no-op-safe form, such as
wrapping each in `eval "$(sed ...)"`, so the command still succeeds when a key
is absent.
| annotations: | ||
| external-dns.alpha.kubernetes.io/hostname: >- | ||
| test.austrianopencloudcommunity.org,work.test.austrianopencloudcommunity.org | ||
| cert-manager.io/cluster-issuer: "letsencrypt-prod" | ||
| spec: | ||
| tls: | ||
| - hosts: | ||
| - test.austrianopencloudcommunity.org | ||
| - work.test.austrianopencloudcommunity.org | ||
| secretName: cloud-proxy-tls-certs |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "### transport definition"
sed -n '1,120p' cockpit/servers_transport.yaml
echo "\n### ingress annotations and backends"
sed -n '1,160p' cockpit/cloud_ingress.yaml
echo "\n### repo-wide transport references"
rg -n 'cloud-backend-transport|serverstransport|serversscheme|ServersTransport' cockpitRepository: k8sstormcenter/pixie
Length of output: 2593
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "### broad repo references to cloud-backend-transport"
rg -n 'cloud-backend-transport|ServersTransport|serverstransport|serversTransport|transport' .
echo "\n### candidate service definition"
sed -n '1,220p' cockpit/cloud_proxy_service.yaml 2>/dev/null || true
echo "\n### kustomization context"
sed -n '1,120p' cockpit/kustomization.yamlRepository: k8sstormcenter/pixie
Length of output: 50377
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "### broad repo references to cloud-backend-transport"
rg -n 'cloud-backend-transport|ServersTransport|serverstransport|serversTransport|transport' .
echo "\n### candidate service definition"
sed -n '1,220p' cockpit/cloud_proxy_service.yaml 2>/dev/null || true
echo "\n### kustomization context"
sed -n '1,120p' cockpit/kustomization.yamlRepository: k8sstormcenter/pixie
Length of output: 50377
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,220p' cockpit/cloud_proxy_service.yaml 2>/dev/null || true
printf '\n### traefik annotations in cockpit manifests\n'
rg -n 'traefik\.ingress\.kubernetes\.io|serverstransport|ServersTransport' cockpit/*.yaml cockpit/*.ymlRepository: k8sstormcenter/pixie
Length of output: 371
Attach cloud-backend-transport to the / backend
cockpit/cloud_ingress.yaml defines the transport, but the Ingress never references it. Add the Traefik service.serverstransport annotation to the cloud-proxy-service:443 route, otherwise the serverName override is ignored.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cockpit/cloud_ingress.yaml` around lines 7 - 16, The ingress route for
cloud-proxy-service:443 defines the backend transport but never applies it, so
the serverName override is ignored. Update cockpit/cloud_ingress.yaml to attach
the Traefik service.serverstransport annotation on the / backend route and point
it to cloud-backend-transport, using the existing ingress resource so the
backend actually picks up the transport settings.
| spec: | ||
| initContainers: | ||
| - name: postgres-wait | ||
| # TODO(ddelnano): This image was rebuilt from https://github.com/docker-library/postgres 14-alpine | ||
| # to remediate a critical vulnerability. Switch back to upstream once fixed. | ||
| # yamllint disable-line rule:line-length | ||
| image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d | ||
| command: ['sh', '-c', | ||
| 'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do | ||
| echo "waiting for postgres"; | ||
| sleep 2; | ||
| done;'] | ||
| envFrom: | ||
| - configMapRef: | ||
| name: pl-db-config | ||
| containers: | ||
| - name: updater | ||
| image: cloud-plugin-load_db:latest | ||
| envFrom: | ||
| - configMapRef: | ||
| name: pl-db-config | ||
| - configMapRef: | ||
| name: pl-domain-config | ||
| - configMapRef: | ||
| name: pl-tls-config | ||
| volumeMounts: | ||
| - name: certs | ||
| mountPath: /certs | ||
| env: | ||
| - name: PL_POSTGRES_USERNAME | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: pl-db-secrets | ||
| key: PL_POSTGRES_USERNAME | ||
| - name: PL_POSTGRES_PASSWORD | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: pl-db-secrets | ||
| key: PL_POSTGRES_PASSWORD | ||
| - name: PL_JWT_SIGNING_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: cloud-auth-secrets | ||
| key: jwt-signing-key | ||
| - name: PL_PLUGIN_SERVICE | ||
| valueFrom: | ||
| configMapKeyRef: | ||
| name: pl-service-config | ||
| key: PL_PLUGIN_SERVICE | ||
| - name: PL_PLUGIN_REPO | ||
| value: "k8sstormcenter/pixie-plugin" | ||
| restartPolicy: Never | ||
| volumes: | ||
| - name: pl-db-secrets | ||
| secret: | ||
| secretName: pl-db-secrets | ||
| - name: certs | ||
| projected: | ||
| sources: | ||
| - secret: | ||
| name: service-tls-server-certs | ||
| items: | ||
| - key: tls.crt | ||
| path: server.crt | ||
| - key: tls.key | ||
| path: server.key | ||
| - key: ca.crt | ||
| path: ca.crt | ||
| - secret: | ||
| name: service-tls-client-certs | ||
| items: | ||
| - key: tls.crt | ||
| path: client.crt | ||
| - key: tls.key | ||
| path: client.key | ||
| backoffLimit: 1 |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Harden this Job before mounting DB and JWT credentials into it.
Both containers inherit the runtime defaults here, so root execution, privilege escalation, and service-account token mounting stay enabled unless the cluster overrides them. That is a poor default for a pod that consumes PL_POSTGRES_PASSWORD and PL_JWT_SIGNING_KEY.
Suggested hardening
spec:
ttlSecondsAfterFinished: 10
template:
metadata:
name: plugin-db-updater-job
labels:
jobgroup: plugin-db-updater
spec:
+ automountServiceAccountToken: false
initContainers:
- name: postgres-wait
+ securityContext:
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop: ["ALL"]
+ runAsNonRoot: true
+ seccompProfile:
+ type: RuntimeDefault
image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d
command: ['sh', '-c',
'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do
echo "waiting for postgres";
sleep 2;
done;']
envFrom:
- configMapRef:
name: pl-db-config
containers:
- name: updater
+ securityContext:
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop: ["ALL"]
+ runAsNonRoot: true
+ seccompProfile:
+ type: RuntimeDefault
image: cloud-plugin-load_db:latest📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| spec: | |
| initContainers: | |
| - name: postgres-wait | |
| # TODO(ddelnano): This image was rebuilt from https://github.com/docker-library/postgres 14-alpine | |
| # to remediate a critical vulnerability. Switch back to upstream once fixed. | |
| # yamllint disable-line rule:line-length | |
| image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d | |
| command: ['sh', '-c', | |
| 'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do | |
| echo "waiting for postgres"; | |
| sleep 2; | |
| done;'] | |
| envFrom: | |
| - configMapRef: | |
| name: pl-db-config | |
| containers: | |
| - name: updater | |
| image: cloud-plugin-load_db:latest | |
| envFrom: | |
| - configMapRef: | |
| name: pl-db-config | |
| - configMapRef: | |
| name: pl-domain-config | |
| - configMapRef: | |
| name: pl-tls-config | |
| volumeMounts: | |
| - name: certs | |
| mountPath: /certs | |
| env: | |
| - name: PL_POSTGRES_USERNAME | |
| valueFrom: | |
| secretKeyRef: | |
| name: pl-db-secrets | |
| key: PL_POSTGRES_USERNAME | |
| - name: PL_POSTGRES_PASSWORD | |
| valueFrom: | |
| secretKeyRef: | |
| name: pl-db-secrets | |
| key: PL_POSTGRES_PASSWORD | |
| - name: PL_JWT_SIGNING_KEY | |
| valueFrom: | |
| secretKeyRef: | |
| name: cloud-auth-secrets | |
| key: jwt-signing-key | |
| - name: PL_PLUGIN_SERVICE | |
| valueFrom: | |
| configMapKeyRef: | |
| name: pl-service-config | |
| key: PL_PLUGIN_SERVICE | |
| - name: PL_PLUGIN_REPO | |
| value: "k8sstormcenter/pixie-plugin" | |
| restartPolicy: Never | |
| volumes: | |
| - name: pl-db-secrets | |
| secret: | |
| secretName: pl-db-secrets | |
| - name: certs | |
| projected: | |
| sources: | |
| - secret: | |
| name: service-tls-server-certs | |
| items: | |
| - key: tls.crt | |
| path: server.crt | |
| - key: tls.key | |
| path: server.key | |
| - key: ca.crt | |
| path: ca.crt | |
| - secret: | |
| name: service-tls-client-certs | |
| items: | |
| - key: tls.crt | |
| path: client.crt | |
| - key: tls.key | |
| path: client.key | |
| backoffLimit: 1 | |
| spec: | |
| automountServiceAccountToken: false | |
| initContainers: | |
| - name: postgres-wait | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| capabilities: | |
| drop: ["ALL"] | |
| runAsNonRoot: true | |
| seccompProfile: | |
| type: RuntimeDefault | |
| # TODO(ddelnano): This image was rebuilt from https://github.com/docker-library/postgres 14-alpine | |
| # to remediate a critical vulnerability. Switch back to upstream once fixed. | |
| # yamllint disable-line rule:line-length | |
| image: ghcr.io/pixie-io/postgres:14-alpine-pl1@sha256:237c5fcf79b230979e12fe02f46e0ad29565b4ecb7cb15047197cbb9a6549e8d | |
| command: ['sh', '-c', | |
| 'until pg_isready -h ${PL_POSTGRES_HOSTNAME} -p ${PL_POSTGRES_PORT}; do | |
| echo "waiting for postgres"; | |
| sleep 2; | |
| done;'] | |
| envFrom: | |
| - configMapRef: | |
| name: pl-db-config | |
| containers: | |
| - name: updater | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| capabilities: | |
| drop: ["ALL"] | |
| runAsNonRoot: true | |
| seccompProfile: | |
| type: RuntimeDefault | |
| image: cloud-plugin-load_db:latest | |
| envFrom: | |
| - configMapRef: | |
| name: pl-db-config | |
| - configMapRef: | |
| name: pl-domain-config | |
| - configMapRef: | |
| name: pl-tls-config | |
| volumeMounts: | |
| - name: certs | |
| mountPath: /certs | |
| env: | |
| - name: PL_POSTGRES_USERNAME | |
| valueFrom: | |
| secretKeyRef: | |
| name: pl-db-secrets | |
| key: PL_POSTGRES_USERNAME | |
| - name: PL_POSTGRES_PASSWORD | |
| valueFrom: | |
| secretKeyRef: | |
| name: pl-db-secrets | |
| key: PL_POSTGRES_PASSWORD | |
| - name: PL_JWT_SIGNING_KEY | |
| valueFrom: | |
| secretKeyRef: | |
| name: cloud-auth-secrets | |
| key: jwt-signing-key | |
| - name: PL_PLUGIN_SERVICE | |
| valueFrom: | |
| configMapKeyRef: | |
| name: pl-service-config | |
| key: PL_PLUGIN_SERVICE | |
| - name: PL_PLUGIN_REPO | |
| value: "k8sstormcenter/pixie-plugin" | |
| restartPolicy: Never | |
| volumes: | |
| - name: pl-db-secrets | |
| secret: | |
| secretName: pl-db-secrets | |
| - name: certs | |
| projected: | |
| sources: | |
| - secret: | |
| name: service-tls-server-certs | |
| items: | |
| - key: tls.crt | |
| path: server.crt | |
| - key: tls.key | |
| path: server.key | |
| - key: ca.crt | |
| path: ca.crt | |
| - secret: | |
| name: service-tls-client-certs | |
| items: | |
| - key: tls.crt | |
| path: client.crt | |
| - key: tls.key | |
| path: client.key | |
| backoffLimit: 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cockpit/plugin_db_updater_job.yaml` around lines 15 - 90, Harden the Job spec
for the updater workload by setting restrictive pod/security defaults before it
consumes PL_POSTGRES_PASSWORD and PL_JWT_SIGNING_KEY. Update the Job template in
plugin_db_updater_job.yaml to disable root execution and privilege escalation,
drop unnecessary Linux capabilities, and prevent automatic service-account token
mounting; apply these settings at the pod-level and, if needed, reinforce them
on the postgres-wait and updater containers.
Source: Linters/SAST tools
| cors: | ||
| allow_origin_string_match: | ||
| - suffix: "test.austrianopencloudcommunity.org" | ||
| allow_methods: GET, PUT, DELETE, POST, OPTIONS | ||
| allow_headers: content-type,x-grpc-web,x-user-agent | ||
| max_age: "1728000" | ||
| expose_headers: grpc-status,grpc-message,grpc-timeout | ||
| allow_credentials: true |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Use an exact CORS origin match here.
suffix: "test.austrianopencloudcommunity.org" is wider than the single host configured in cockpit/domain_config.yaml and will also match origins like https://eviltest.austrianopencloudcommunity.org. With allow_credentials: true, that broadens who can make credentialed browser requests.
Suggested fix
cors:
allow_origin_string_match:
- - suffix: "test.austrianopencloudcommunity.org"
+ - exact: "https://test.austrianopencloudcommunity.org"
allow_methods: GET, PUT, DELETE, POST, OPTIONS📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cors: | |
| allow_origin_string_match: | |
| - suffix: "test.austrianopencloudcommunity.org" | |
| allow_methods: GET, PUT, DELETE, POST, OPTIONS | |
| allow_headers: content-type,x-grpc-web,x-user-agent | |
| max_age: "1728000" | |
| expose_headers: grpc-status,grpc-message,grpc-timeout | |
| allow_credentials: true | |
| cors: | |
| allow_origin_string_match: | |
| - exact: "https://test.austrianopencloudcommunity.org" | |
| allow_methods: GET, PUT, DELETE, POST, OPTIONS | |
| allow_headers: content-type,x-grpc-web,x-user-agent | |
| max_age: "1728000" | |
| expose_headers: grpc-status,grpc-message,grpc-timeout | |
| allow_credentials: true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cockpit/proxy_envoy.yaml` around lines 43 - 50, The CORS configuration in the
Envoy proxy is using a suffix-based origin match, which is broader than the
single approved host and can match unintended subdomains. Update the cors
allow_origin_string_match setting to use an exact origin match for the same host
configured in cockpit/domain_config.yaml, and keep the rest of the cors settings
unchanged so credentialed requests are only allowed from that one origin.
| data "kustomization_build" "elastic" { | ||
| # TODO(ddelnano): This will need to be updated for the terraform Azure pipeline | ||
| path = "../../../k8s/cloud_deps/base/elastic/operator" | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Parameterize the kustomize source path before this is used in CI.
This only works from the current checkout layout. The TODO already calls out the Azure pipeline case; if Terraform runs from a different working directory, kustomization_build will fail before any base deps are applied.
Suggested fix
data "kustomization_build" "elastic" {
- # TODO(ddelnano): This will need to be updated for the terraform Azure pipeline
- path = "../../../k8s/cloud_deps/base/elastic/operator"
+ path = abspath("${path.module}/../../../k8s/cloud_deps/base/elastic/operator")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| data "kustomization_build" "elastic" { | |
| # TODO(ddelnano): This will need to be updated for the terraform Azure pipeline | |
| path = "../../../k8s/cloud_deps/base/elastic/operator" | |
| } | |
| data "kustomization_build" "elastic" { | |
| path = abspath("${path.module}/../../../k8s/cloud_deps/base/elastic/operator") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@terraform/kubernetes/cloud_deps/base_cloud_deps.tf` around lines 1 - 4, The
kustomization_build data source for the elastic base dependency is hardcoded to
a checkout-specific relative path, which will break when Terraform runs from a
different working directory in CI. Update the elastic data block to use a
parameterized source path from a variable or shared local used by the cloud_deps
Terraform config, and keep the existing data "kustomization_build" "elastic"
identifier so the new path can be reused across environments.
| locals { | ||
| cert_subdomain = var.cloud_domain | ||
| cert_config = var.cert_details[var.cloud_domain] | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
The cert_details guard never gets a chance to run.
Line 3 indexes var.cert_details[var.cloud_domain] while evaluating locals, so a missing domain fails with an Invalid index before Line 49's custom error can fire. Use lookup/try in the local, or move this to variable validation.
Suggested fix
locals {
cert_subdomain = var.cloud_domain
- cert_config = var.cert_details[var.cloud_domain]
+ cert_config = lookup(var.cert_details, var.cloud_domain, null)
}
@@
lifecycle {
precondition {
- condition = contains(keys(var.cert_details), var.cloud_domain)
+ condition = local.cert_config != null
error_message = "cloud_domain '${var.cloud_domain}' must have a corresponding entry in cert_details."
}
}
}Also applies to: 48-53
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@terraform/kubernetes/cloud_deps/cert_manager.tf` around lines 1 - 4, The
local cert_config is indexing var.cert_details[var.cloud_domain] too early, so a
missing domain triggers an Invalid index before the custom validation in the
cert_manager logic can run. Update the cert_manager locals/validation flow by
using lookup or try when setting cert_config, or otherwise defer the access into
variable validation so the existing guard for cloud_domain/cert_details can
execute first.
cb81ecd re-added the terraform/ tree (16 files, 9817 lines) while fixing regressions from the origin/main pickup, because the main tip 53d09cc had deleted it. That restoration is unrelated to the adaptive_export auth work in this PR; it is now carried on its own branch (restore-terraform-main, off origin/main) for a separate PR into main. Removing it here keeps PR 68 scoped to the AE feature.
These were re-added on this branch while fixing the origin/main pickup regression (cb81ecd + the d94624a cockpit rename), but they are fork infra deleted from main by 53d09cc, not part of the adaptive_export auth work. Carried on restore-fork-infra-main for a separate PR into main. skaffold/skaffold_cloud.yaml is left in place — it is an upstream file, not a clean fork-only add.
| // Cert/key from the service-tls-certs secret the broker/PEM already use | ||
| // (mounted /certs); dx skip-verifies. Default-OFF for incremental rollout. | ||
| var err error | ||
| if os.Getenv("CONTROL_TLS") == "true" { |
There was a problem hiding this comment.
claude, can we assume that CONTROL_TLS is true and only implement a fallback if it is not. But it should NOT boot if TLS is off
Summary: AE control surface gains bearer-JWT auth + TLS, the trigger gets a PollLimit-saturation escape, fork-cloud config restored after the main merge, and CodeRabbit Go-code findings on the umbrella PR #53 are all addressed (controller rollback, NaN-finite gate, MaxBytesReader, streaming retain-on-fail, etc.).
Test Plan: build + 14/14 AE go test packages green locally on cb81ecd / c2642da; CI-image lint (tools/linters/lint_like_ci.sh) reports 0 Errors 0 Warnings; vizier-release built aeprod22 (cb81ecd) successfully, aeprod23 (c2642da) in flight; trigger differential oracle added (oracle_test.go) pins watermark advancement + boundary dedup + PollLimit-at-capacity + the documented overflow trade-off.
Type of change: /kind feature
AE control auth + CodeRabbit #53 followup
Followup to #53 (merge #53 first — this stacks on
ae-prod; I'll retarget tomainonce #53 lands). Captured all 48 inline + 8 review + 12 issue CodeRabbit comments tobiz/PoC/log4j/ae_pr53_coderabbit_feedback/(JSON + DIGEST.md) so nothing is lost.✅ Done in this PR
px.dev/pixie/src/shared/services/utils);SetAuthverifies viajwtutils.ParseToken(signature + audience), middleware onHandler(),/healthzexempt. dx already mints this exact service JWT — it just attaches it. No new secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH, default off) for safe incremental rollout. Unit test mints a real JWT and asserts 401/pass/healthz. (CodeRabbit server.go: protect control endpoints.)t_end(≤0) and query windows (start≥end, non-positive). (server.go: reject invalid t_end/window.)Companion change (separate, entlein/dx)
dx
aeclient.gomust attachAuthorization: Bearer <minted JWT>on its AE control POSTs. Then flipCONTROL_REQUIRE_AUTH=true+ mountPL_JWT_SIGNING_KEYon AE. (Tracked; small.)Already resolved on ae-prod (verified — no action)
http_events=0"bug" — self-retracted by the author (misdiagnosis).Remaining AE-code Majors — triaged (to address as commits here)
Safe / clear:
streaming/filter.go:196— timer leak on deltaCh close.passthrough/passthrough.go:147— per-table query/write timeouts.pixieapi/pixieapi.go:89— direct-mode TLS-skip uses wrong option (→WithDirectTLSSkipVerify).stats.py:57— guard false "EXACT" when zero PASS rows.schema.sql:485— retention onae_reconcile(unbounded growth).controller.go:347— don't fan out Pixie rows when attribution persistence fails.Risky / needs careful testing before a presentation — defer with care:
controller.go:255— rehydrated active windows never restart fan-out (state-machine path).trigger/clickhouse.go:303— inclusive-watermark paging (this is the F8 watermark area — change only with the F8 fixtures).schema.sql:430—adaptive_attributionORDER BY key width (schema change — migration risk).sink/clickhouse.go:433—Recordblocks hot path on CH (async refactor — perf-sensitive).