diff --git a/docs/designs/eval-scenarios-table-integration.md b/docs/designs/eval-scenarios-table-integration.md
new file mode 100644
index 0000000000..d39ed7265e
--- /dev/null
+++ b/docs/designs/eval-scenarios-table-integration.md
@@ -0,0 +1,314 @@
+# Eval Scenarios Table — ETL Integration
+
+**Created:** 2026-05-22
+**Status:** RFC — Eng-reviewed; ready to implement (Phase 1)
+**Related:** [eval-etl-engine](./eval-etl-engine.md), [etl-engine](./etl-engine.md), [eval-filtering](./eval-filtering.md)
+**Authors:** Arda
+
+---
+
+## Summary
+
+The `EtlPocScenarios` PoC (the `/etl-poc` page) showed an evaluation run
+scenarios table can be fast with a specific strategy: thin rows (identity
+only), page-level bulk hydration into molecule caches, self-resolving cells,
+and ETL-engine-backed predicate filtering.
+
+This doc covers folding that strategy into the **real** eval run scenarios
+table (`EvalRunDetails/Table.tsx`) and retiring the PoC. It is **phased** — a
+core production table never big-bangs.
+
+> **Eng review reframe.** The first draft called the data-layer swap a
+> "low-risk mechanical port." That was wrong. The PoC only ever ran against
+> *finished* runs — it fabricates `scenario: {status: "success"}`
+> (`EtlResolvedCell.tsx:135`, `index.tsx:692`). Rendering pending / running /
+> failed scenarios and a real "skeleton while pending" policy is **unbuilt
+> design work**, now scoped into Phase 1. See [Resolved decisions](#resolved-decisions).
+
+---
+
+## Current state — two implementations of one table
+
+| | Production (`EvalRunDetails/Table.tsx`) | PoC (`EtlPocScenarios`) |
+|---|---|---|
+| Store | `evaluationPreviewTableStore` — semi-full rows | `scenarioThinPaginatedStore` — `{key, id, scenarioId}` |
+| Columns | backend metadata (`usePreviewColumns`) | run graph (`useEtlColumns` → `resolveMappings`) |
+| Cell data | per-visible-cell fetch (`useScenarioCellValue`) | `EtlResolvedCell` from molecule caches; `useHydrateScenarios` bulk-fills per page |
+| Filtering | none | predicate bar + ETL viewport-fill loop |
+| Comparison | interleaved rows, 2-4 runs | single run |
+| Live runs | 5s run-status poll; 15-30s `staleTime`; human-only metrics gap-fill | none — assumes terminal data |
+| Fetch path | `fetchEvaluationScenarioWindow` | **same** `fetchEvaluationScenarioWindow` |
+
+`scenarioPaginatedStore.ts`'s own header states the intent: *"replace
+`evaluationPreviewTableStore` with this once the scenarios view is on the
+molecule-cache pattern."* This is that project.
+
+---
+
+## Resolved decisions
+
+| # | Decision |
+|---|----------|
+| **Hydration shape** | Thin rows + self-hydrating cells. The thin row carries identity + `testcaseId` (comparison join key) + `status` (live-update + skeleton) — never column data. |
+| **Column source** | Run graph (`data.steps` + `data.mappings`) via `resolveMappings`. **Correction:** `useEtlColumns` currently drops `group.kind === "other"` columns (`useEtlColumns.tsx:56`, "skip in the test page"). Production must keep them — that shortcut is removed in T2. |
+| **Cell caching** | Same molecules over the same TanStack layer; no net regression *expected* — validated, not asserted, by the Phase 1 perf gate. |
+| **D1 — phasing** | Phase the migration. Phase 1: data-layer swap. Phase 2: filtering. Phase 3: comparison + live + co-consumers. Then retire the PoC. Each phase reviewable and revertable; the table works between phases. |
+| **D2 — comparison display** | Interleaved rows (today's model), not testcase-aligned columns. Compare-mode column set = shared testcase inputs + the **common-evaluator intersection** across compared runs + the standard invocation output. Reuses single-run column derivation. |
+| **D3 — live updates** | Match production's modest bar: run-status poll + page invalidation while non-terminal + human-eval metrics gap-fill. No real scenario streaming. |
+| **D5 — perf gate** | After Phase 1, benchmark the new table vs the current `useScenarioCellValue` table on a 1000+ scenario run with comparison on. A regression stops Phase 2. |
+| **D8 — filter composition** | **Multi-predicate from day 1.** Phase 2 ships multi-condition AND/OR filtering, not the PoC's single predicate. The predicate type generalises to a condition *group* (`{op: "and" \| "or", conditions: RowPredicate[]}`); the filter bar reuses the observability multi-condition filter UI. Closed at Phase 2 start (was the one open decision). |
+
+**No open decisions.**
+
+---
+
+## Architecture — target
+
+The production table adopts the PoC strategy in place. Four layers (per
+`eval-etl-engine.md`):
+
+```
+EvalRunDetails table (OSS UI)
+  └── thin scenarios store ──> ETL filter pipeline ──> rendered viewport
+        (identity + join keys)   (runLoop + filterTransform)   (InfiniteVirtualTable)
+                                        │
+        cells self-resolve from ────────┘
+        molecule caches (results / metrics / testcases / traces)
+        bulk-hydrated per page; cell-materialized on demand
+```
+
+- **Source** — the thin scenarios store; reuses `fetchEvaluationScenarioWindow`.
+- **Transform** — the eval-specific `filterTransform`. "Skeleton while
+  pending" for rows whose slices are not yet hydrated **or whose scenario has
+  not yet run** — these two cases are distinct and both must be handled.
+- **Sink** — the rendered viewport; the loop runs until the viewport fills.
+- **Cells** — `EtlResolvedCell`, resolving from molecule caches.
+
+---
+
+## Phase 1 — column + cell swap
+
+The table's internals, table-only. The focus drawer and `SingleScenarioViewer`
+stay on `useScenarioCellValue` (kept alive) until Phase 3.
+
+> **Implementation-time finding — T1 dropped.** Reading
+> `evaluationPreviewTableStore.ts` confirmed it is *already* a thin store:
+> `PreviewTableRow` carries only identity + `testcaseId` + `status` +
+> `scenarioIndex` + comparison fields — zero column data — and it already does
+> per-eval-type window order (line 114). The PoC's separate
+> `scenarioThinPaginatedStore` exists only to drop a couple of cheap unused
+> fields. **The store stays as-is — there is no T1.** The eng-review outside
+> voice flagged this; a direct read confirmed it.
+
+Phase 1 is the **column + cell swap** in `Table.tsx`. T2 and T3 are **coupled**
+— a column definition carries its own cell `render` function, so the column
+source and the cell renderer swap together in one change.
+
+- **T2 — schema columns (display only).** Wire `useEtlColumns` /
+  `resolveMappings` into `Table.tsx` for the **rendered** columns. **Remove the
+  "other"-column drop** (`useEtlColumns.tsx:56`) so the visible set matches
+  today — note this ripples: `ColumnLeaf["kind"]`, `EtlResolvedCell`'s
+  `columnKind`, and `useCellMaterialization`'s slice map all need an "other"
+  case. `usePreviewColumns` / `usePreviewTableData` / `columnResult` **stay
+  alive** — the CSV export (`exportResolveValue`, `columnLookupMap`) is keyed
+  off `columnResult` ids, which differ from `useEtlColumns` keys. Full
+  retirement of the old column path moves to Phase 3 with the export
+  migration (T5). Two column systems coexist transitionally (accepted under D1).
+- **T3 — self-hydrating cells + non-terminal rendering.** `EtlResolvedCell` +
+  `useHydrateScenarios` + `useCellMaterialization`, against the existing
+  `evaluationPreviewTableStore` rows (keyed by `scenarioId`). **Not** purely
+  mechanical: add real rendering for pending / running / failed / partial
+  scenarios, and a "skeleton while pending" policy that distinguishes
+  *slice-not-hydrated* from *scenario-not-run*. The PoC's `status: "success"`
+  fabrication is removed.
+
+**Perf gate (D5)** — after T2+T3 land: benchmark the new table against the
+current one on a 1000+ scenario run, comparison on. Regression → stop, rethink.
+
+---
+
+## Phase 2 — filtering
+
+- **T4 — multi-predicate filtering (D8).** Ships multi-condition AND/OR
+  from day 1 — not the PoC's single predicate. `filterSchema` derives
+  filterable fields: columns → evaluator steps → evaluator output schemas
+  → typed fields + type-matched operators (`eval-filtering.md` D4). The
+  predicate generalises from `RowPredicate` to a condition *group*
+  (`{op: "and" | "or", conditions: RowPredicate[]}`, one nesting level for
+  v1 — flat AND/OR, no arbitrary trees); `predicateToEntitySlices` takes
+  the union of every condition's slices. The `filterTransform` evaluates
+  the group per row against hydrated metrics; the loop runs until the
+  viewport fills. The filter bar reuses the observability multi-condition
+  filter UI. **Reuse `withRateLimitRetry`** for the scan — a low-hit-ratio
+  filter scans many scenario + metric pages and EE throttling will 429 it
+  (the batch-add lesson).
+
+---
+
+## Phase 3 — comparison, live updates, co-consumers
+
+- **T5 — comparison.** A **build, not a port** — the PoC has zero multi-run
+  code. Per compared run: a second store scope, a **schema fetch** (needed for
+  the common-evaluator intersection), and per-run hydration of result slices
+  (`testcase_id` lives on results). Then align compared runs to the *filtered*
+  main rows by `testcase_id` (the `mergedRows` logic exists in production —
+  port it over the thin/cache model). The **CSV export path**
+  (`Table.tsx:542`) rebuilds the merge logic and migrates here too.
+- **T6 — live updates.** Run-status poll + non-terminal page invalidation +
+  human-eval metrics gap-fill.
+- **T8 — co-consumer migration.** Migrate the focus drawer (`focusDrawerAtom`,
+  `FocusDrawerHeader`, `FocusDrawerSidePanel`, `FocusDrawer`) and
+  `SingleScenarioViewerPOC` off `useScenarioCellValue` + `evaluationPreviewTableStore`,
+  then delete `useScenarioCellValue`.
+
+---
+
+## Retire the PoC
+
+- **T7** — delete `EtlPocScenarios/` and the `/etl-poc` routes (oss + ee).
+  Gated on Phase 3 parity verified.
+
+---
+
+## Design — interaction states & filter UX
+
+From the design review (focused — the migration preserves the table's visual
+design; these are the genuinely new design surfaces).
+
+**Cell states:**
+- *Skeleton (not hydrated)* — reuse the PoC's `EtlSkeletonCell`: a fixed-height
+  placeholder bar, identical row height to a populated row, so there is no
+  layout jump when data lands.
+- *Non-terminal scenarios (running / failed / pending)* — match production's
+  existing live-table rendering. Hard rule: a *running* cell must read as
+  in-progress, visually distinct from a missing value — never a bare "—" for a
+  running scenario (the user must be able to tell "computed nothing" from
+  "still computing").
+
+**Filter states:**
+- *Scanning* — a live hit-ratio counter ("Scanned N / matched M", from
+  `hitRatioAtom`), not a silent spinner. A picky filter scans thousands of
+  scenarios to surface a few; the counter explains the wait and keeps trust.
+- *No match* — a real empty state: "No scenarios match this filter" + a
+  one-click **Clear filter** action. Not "No items found."
+- *Rate-limited / scan failed* — keep the partial viewport visible with a
+  non-blocking inline indicator ("Filtering paused — retrying…"). Never a
+  blocking overlay.
+
+**Filter bar** — lives in the eval run details header row, following the
+observability `Filters` placement. Multi-predicate AND/OR composition (D8) —
+reuses the observability multi-condition filter UI.
+
+## Test plan
+
+```
+PLANNED COMPONENT                                    TESTS
+T1 thin store          [GAP] page fetch → skeleton+merge; per-eval-type order
+T2 schema columns      [GAP][CRITICAL][REGRESSION] resolveMappings column set
+                              == usePreviewColumns for auto/human/online,
+                              "other" columns INCLUDED — before deleting the old path
+T3 cells               [GAP] resolve from caches; pending/running/failed render
+                       [GAP] skeleton-while-pending: not-hydrated vs not-run
+T4 filtering           [GAP] filterSchema typed fields; multi-predicate
+                              AND/OR filterTransform — match/no-match/pending
+                              + group semantics; [→E2E] multi-condition → rows
+T5 comparison          [GAP] compare-run schema fetch; testcase_id join;
+                              common-evaluator intersection; [→E2E] compare+filter
+T6 live updates        [GAP] poll stops at terminal; page invalidation; gap-fill
+T8 co-consumers        [VERIFIED] focus drawer + scenario viewer render
+                              unchanged — independent old data path (D9)
+```
+
+Pure logic — `filterTransform`, `filterSchema` derivation, the comparison
+testcase-join — unit-tests in `@agenta/entities` vitest (the batch-add
+harness). The two `[→E2E]` flows go to Playwright. The T2 and T8 **regression
+guards are mandatory** — they protect a user-visible column set and two live
+co-consumers.
+
+---
+
+## Edge cases & constraints
+
+- **Non-terminal scenarios** — pending / running / failed / partial rows must
+  render; the PoC never did. Distinguish slice-not-hydrated from
+  scenario-not-run.
+- **"Other" columns** — `useEtlColumns` must keep them (T2).
+- **Eval types** — auto / human / online all derive columns from the run
+  schema; online fetches `descending`.
+- **Filtered + comparison** — filtering the main run re-drives compare
+  alignment; a filtered-out main row drops its compare group.
+- **Filter scan throttling** — reuse `withRateLimitRetry` (T4).
+- **Large-run memory** — within a run, bulk-hydrate fills caches;
+  `useScopeChangeEviction` only evicts on run change. Verify at 10k scenarios;
+  per-chunk eviction exists if needed.
+- **Testset mismatch** — compared runs not sharing the main testset produce an
+  empty testcase join (the candidate filter already guards this).
+- **`EvalRunDetails` / `EvalRunDetails2` split** — do not deepen it; touched
+  code consolidates into `EvalRunDetails/`.
+
+---
+
+## NOT in scope
+
+- **Testcase-aligned comparison columns** — interleaved rows for now (D2).
+- **Real scenario streaming** — match production's poll bar (D3).
+- **Cross-run filter predicates** ("main high, run B regressed") — filter the
+  main run only; cross-run is a v2 feature.
+- **Consolidating `EvalRunDetails2`** — only the touched code moves.
+- **Backend filter param** — client-side filter is v1 (`eval-filtering.md`).
+
+## What already exists (reused, not rebuilt)
+
+- The ETL engine + generic primitives (`@agenta/entities/etl`) — built and
+  tested this session.
+- `evaluationPreviewTableStore` — already a thin store (identity + `status`,
+  no column data, per-eval-type order); kept as-is, no swap needed (T1 dropped).
+- `fetchEvaluationScenarioWindow` — the scenario fetch; reused unchanged.
+- `mergedRows` testcase_id-join alignment — ported, not reinvented.
+- `withRateLimitRetry` — reused for the filter scan.
+- The PoC's `useHydrateScenarios` / `useEtlColumns` / `EtlResolvedCell` /
+  `useCellMaterialization` — ported (with the corrections above).
+
+---
+
+## Implementation tasks
+
+**Phase 1 — column + cell swap** (T1 dropped — `evaluationPreviewTableStore` is already thin; see Phase 1)
+- [x] **T2 (P1)** — schema columns for the **rendered** table; keeps "other" columns; **column-parity regression test** (`groupRunColumns.test.ts`). `usePreviewColumns`/`columnResult` kept alive for the export path. Landed with T3.
+- [x] **T3 (P1)** — self-hydrating cells **plus non-terminal scenario rendering + skeleton-while-pending**. Landed with T2.
+- [ ] **Perf gate (P1)** — benchmark vs the old table, 1000+ scenarios, comparison on. **Gates T4.**
+
+**Phase 2 — filtering**
+- [x] **T4 (P1)** — multi-predicate AND/OR filtering (D8): `filterSchema` + `evaluateRowFilter` / `PredicateGroup` core (entities, unit-tested) + a popover filter bar in the run header + confirmed-match incremental rendering + viewport-fill loop. Column value types come from the evaluator output schema. v1 withholds testset/application columns behind a UI allowlist and `in`/`nin` operators from the UI.
+
+**Phase 3 — comparison, live, co-consumers**
+- [x] **T5 (P1)** — comparison: testcase-id join on the filtered base run; compare runs are eagerly paged while a filter is active so each matched base row finds its counterpart. Compare rows resolve against the base run's schema (best-effort, per the Phase-1 note).
+- [x] **T6 (P2)** — live updates (`useScenarioLiveUpdates`): while the run is non-terminal, periodically refetch the loaded scenario pages (row statuses) and evict + re-prefetch the results / metrics molecule caches of running / just-finished scenarios; one final pass at terminal, then stop.
+- [x] **T8 (P1)** — verified: the focus drawer + `SingleScenarioViewer` are **not regressed** by the cell swap — both run on the fully-preserved, independent old data path (`scenarioColumnValues.ts` + its dependency atoms), fetching their own values regardless of what the table renders. Full ETL migration is **deferred** (see D9): `useScenarioCellValue` cannot be deleted while the static invocation-metrics group (kept in the table, D7) and the CSV export both still depend on the old-path cells.
+
+**Cleanup**
+- [x] **T7** — `EtlPocScenarios/` + `/etl-poc` routes (oss + ee) deleted. Done ahead of the Phase-3 gate at the maintainer's direction: production has its own copies of the ported hooks, so the PoC was dead test-page code.
+
+**Open / advisory**
+- The **D5 perf gate** was not formally benchmarked — the table was QA'd functionally throughout Phase 1 + 2 instead.
+
+## GSTACK REVIEW REPORT
+
+| Review | Trigger | Why | Runs | Status | Findings |
+|--------|---------|-----|------|--------|----------|
+| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — |
+| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
+| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | clean | 5 decisions resolved, 0 critical gaps open |
+| Design Review | `/plan-design-review` | UI/UX gaps | 1 | clean | 5/10 → 9/10, 2 decisions, focused on states |
+| DX Review | `/plan-devex-review` | Developer experience gaps | 0 | — | — |
+
+- **OUTSIDE VOICE (eng):** Claude subagent — caught 5 real gaps the section review under-weighted: T1-T3 mislabeled as "low-risk mechanical port" (non-terminal rendering is unbuilt), `useEtlColumns` drops "other" columns (guaranteed regression), the perf premise was asserted not measured (→ D5 perf gate), T5 comparison is a build not a port (+ unlisted compare-schema fetch), the CSV export path was missed. All folded into the plan.
+- **ENG DECISIONS:** D1 phase the migration · D2 interleaved rows + common-evaluator intersection columns · D3 match production's live bar · D4 outside voice ran · D5 perf-validation gate after Phase 1.
+- **DESIGN DECISIONS:** focused review (migration preserves the visual design) · live hit-ratio counter for filter scanning · interaction-state specs added (skeleton, non-terminal cells, filter no-match empty state, rate-limited indicator).
+- **D6 (implementation-time finding):** starting Phase 1 confirmed `evaluationPreviewTableStore` is already a thin store (identity + status, no column data, per-eval-type order). **T1 is dropped** — Phase 1 is the coupled T2+T3 column+cell swap against the existing store. Confirms the eng-review outside voice's "T1 re-implements an existing store" point.
+- **D7 (implementation-time finding):** reading `Table.tsx` showed the CSV export path (`exportResolveValue`, `columnLookupMap`, `loadAllPagesBeforeExport`) is keyed off `columnResult` column ids, which differ from `useEtlColumns` keys. **Phase 1 swaps display columns only** and keeps `usePreviewColumns`/`columnResult` alive for export; the old column path fully retires in Phase 3 with the export migration (T5). The "other"-column un-drop ripples into `ColumnLeaf`, `EtlResolvedCell`, and `useCellMaterialization`.
+- **D8 (Phase 2 decision):** filter composition resolved — **multi-predicate AND/OR from day 1**, not the PoC's single predicate. The predicate type generalises to a flat condition group; the filter bar reuses the observability multi-condition UI.
+- **D9 (implementation-time finding):** **T8 co-consumers verified, full migration deferred.** Tracing `scenarioColumnValues.ts` and `SingleScenarioViewerPOC` confirmed the focus drawer and `SingleScenarioViewer` resolve their values through the old data path's independent atom families — they do not depend on the table's cells and so are **not regressed** by the T2+T3 cell swap. The design's "delete `useScenarioCellValue`" goal is blocked: that hook still backs `MetricCell`/`InputCell`/`InvocationCell`, which render the static invocation-metrics group kept in the production table (`metricGroupKeys`, D7) and feed the CSV export. A full ETL rebuild of the 1551-line `FocusDrawer` (incl. compare mode) is out of proportion to "no regression to fix" and would not enable the deletion. T8 closes as verified; the migration moves to the eventual old-column-path retirement.
+- **UNRESOLVED:** 0 — filter composition closed (D8). No open decisions.
+- **STATUS:** Phases 1–3 shipped — T2+T3 column/cell swap, T4 multi-predicate filtering, T5 comparison (testcase-id join), T6 live updates, T7 PoC retired, T8 co-consumers verified (no regression; full migration deferred per D9). Feature complete on `fe-experiment/etl-eval-scenario-filtering`.
+- **VERDICT:** ENG + DESIGN REVIEW CLEARED — all phases shipped.
diff --git a/web/oss/src/components/EvalRunDetails/Table.tsx b/web/oss/src/components/EvalRunDetails/Table.tsx
index bc5d6560e4..a5ef706696 100644
--- a/web/oss/src/components/EvalRunDetails/Table.tsx
+++ b/web/oss/src/components/EvalRunDetails/Table.tsx
@@ -1,8 +1,9 @@
-import {useCallback, useMemo, useRef} from "react"
+import {useCallback, useEffect, useMemo, useRef} from "react"
 
+import type {RunSchema} from "@agenta/entities/evaluationRun/etl"
 import {message} from "@agenta/ui/app-message"
 import clsx from "clsx"
-import {useAtomValue, useStore} from "jotai"
+import {useAtomValue, useSetAtom, useStore} from "jotai"
 
 import VirtualizedScenarioTableAnnotateDrawer from "@/oss/components/EvalRunDetails/components/AnnotateDrawer/VirtualizedScenarioTableAnnotateDrawer"
 import {
@@ -19,11 +20,20 @@ import {
 import useComparisonPaginations from "../EvalRunDetails2/hooks/useComparisonPaginations"
 
 import {MAX_COMPARISON_RUNS, compareRunIdsAtom, getComparisonColor} from "./atoms/compare"
+import {effectiveProjectIdAtom} from "./atoms/run"
 import {runDisplayNameAtomFamily} from "./atoms/runDerived"
 import type {EvaluationTableColumn} from "./atoms/table"
-import {DEFAULT_SCENARIO_PAGE_SIZE} from "./atoms/table"
+import {DEFAULT_SCENARIO_PAGE_SIZE, evaluationRunQueryAtomFamily} from "./atoms/table"
 import type {PreviewTableRow} from "./atoms/tableRows"
 import ScenarioColumnVisibilityPopoverContent from "./components/columnVisibility/ColumnVisibilityPopoverContent"
+import {CellMaterializerContext} from "./etl/cellMaterializerContext"
+import {scenarioFilterStatusAtomFamily} from "./etl/scenarioFilterState"
+import {useCellMaterialization} from "./etl/useCellMaterialization"
+import {useEtlColumns} from "./etl/useEtlColumns"
+import {useHydrateScenarios} from "./etl/useHydrateScenarios"
+import {useScenarioFilter} from "./etl/useScenarioFilter"
+import {useScenarioLiveUpdates} from "./etl/useScenarioLiveUpdates"
+import {useScopeChangeEviction} from "./etl/useScopeChangeEviction"
 import {
     evaluationPreviewDatasetStore,
     evaluationPreviewTableStore,
@@ -87,6 +97,156 @@ const EvalRunDetailsTable = ({
 
     const previewColumns = usePreviewColumns({columnResult, evaluationType})
 
+    // ── ETL schema columns + self-hydrating cells (Phase 1 — T2 + T3) ──
+    // The schema columns (testset / application / evaluator / metrics /
+    // other) are derived from the run graph and rendered by cells that
+    // resolve from molecule caches. `usePreviewColumns` / `columnResult`
+    // stay alive above for the CSV export path (keyed off `columnResult`
+    // ids) — only the *display* columns swap here.
+    const effectiveProjectId = useAtomValue(effectiveProjectIdAtom)
+    const projectId = _projectId ?? effectiveProjectId
+
+    const runQuery = useAtomValue(useMemo(() => evaluationRunQueryAtomFamily(runId), [runId]))
+    // Run-level status — drives the live-update loop below. While the run
+    // is non-terminal completed scenarios must refresh; once terminal the
+    // loop stops.
+    const runStatus = useMemo<string | null>(
+        () => runQuery.data?.rawRun?.status ?? runQuery.data?.camelRun?.status ?? null,
+        [runQuery.data],
+    )
+    const runSchema = useMemo<RunSchema | null>(() => {
+        const data = runQuery.data?.rawRun?.data
+        const steps = data?.steps
+        const mappings = data?.mappings
+        if (!Array.isArray(steps) || !Array.isArray(mappings)) return null
+        return {steps, mappings}
+    }, [runQuery.data])
+
+    // Phase 2 — multi-predicate filtering (D8). Filters the base run's
+    // rows; each comparison group follows its base row.
+    const {filteredBaseRows, effectiveFilter, active, confirmedMatchCount, isFilling} =
+        useScenarioFilter({
+            projectId,
+            runId,
+            schema: runSchema,
+            baseRows: basePagination.rows,
+            loadNextPage: basePagination.loadNextPage,
+            hasMore: basePagination.paginationInfo.hasMore,
+            isFetching: basePagination.paginationInfo.isFetching,
+        })
+
+    // Comparison + filter: the viewport-fill loop scans base pages, but a
+    // strict filter means the table may never scroll — so compare runs
+    // would stay on page 1 and the testcase-id join in `mergedRows` would
+    // miss most counterparts. While a filter is active, eagerly load every
+    // compare-run page so each matched base row can find its counterpart.
+    useEffect(() => {
+        if (!active) return
+        comparePaginations.forEach((pagination, idx) => {
+            if (!compareSlots[idx]) return
+            const {hasMore, isFetching} = pagination.paginationInfo
+            if (hasMore && !isFetching) pagination.loadNextPage()
+        })
+    }, [active, comparePaginations, compareSlots])
+
+    const etlColumns = useEtlColumns({projectId, runId, schema: runSchema})
+
+    // Page-level hydrate — predicate-aware: with an active filter it
+    // fetches the entity slices the filter needs to be evaluated; with no
+    // filter it is inert and cells materialize their own visible data.
+    const hydration = useHydrateScenarios({
+        projectId,
+        runId,
+        rows: basePagination.rows,
+        schema: runSchema,
+        predicate: effectiveFilter,
+        sliceMode: "auto",
+    })
+
+    // The filter scan is actively working — the viewport-fill loop still
+    // wants more pages (`isFilling`), a page is being fetched, or a
+    // hydrate batch is in flight. Once enough matches are found the loop
+    // stops, so this goes false even though the dataset has more pages
+    // (`hasMore`) — it only shows "scanning" while real work is happening.
+    const scanInProgress =
+        isFilling || (active && (basePagination.paginationInfo.isFetching || hydration.isHydrating))
+    // Nothing has confirmed-matched yet — show the full empty + loading
+    // overlay. Once the first match lands, rows show and grow (no overlay,
+    // no flicker — `filteredBaseRows` only ever grows during a scan).
+    const isScanning = scanInProgress && confirmedMatchCount === 0
+
+    // Publish the scan status so the filter bar — which lives in the run
+    // header, a separate part of the tree — can show the match count.
+    const setFilterStatus = useSetAtom(
+        useMemo(() => scenarioFilterStatusAtomFamily(runId), [runId]),
+    )
+    useEffect(() => {
+        setFilterStatus({matchCount: confirmedMatchCount, scanning: scanInProgress})
+    }, [setFilterStatus, confirmedMatchCount, scanInProgress])
+
+    // Cell-side lazy materializer — coalesces visible cells' slice
+    // requests into one bulk fetch per (slice, run).
+    const cellMaterializer = useCellMaterialization({projectId, runId})
+
+    // Evict molecule caches written for the outgoing run on scope change.
+    useScopeChangeEviction({projectId, runId})
+
+    // Live updates (T6) — while the run is executing, periodically refetch
+    // the loaded scenario pages (row statuses) and refresh the molecule
+    // caches of running / just-finished scenarios so completed cells leave
+    // the "Running" state. Inert once the run is terminal.
+    useScenarioLiveUpdates({projectId, runId, runStatus, pageSize})
+
+    // Production metric-group ids. The scenario table's "Metrics" group is
+    // the static invocation metrics (cost / duration / tokens) — injected
+    // by the backend-metadata path, not run-mapping-derived — so it is kept
+    // as-is rather than replaced by ETL columns.
+    const metricGroupKeys = useMemo(
+        () =>
+            new Set(
+                (columnResult?.groups ?? [])
+                    .filter((g) => g.kind === "metric")
+                    .map((g) => String(g.id)),
+            ),
+        [columnResult?.groups],
+    )
+
+    // Final rendered column set: production meta columns (index / status,
+    // timestamp, action), the column-visibility trigger, and the static
+    // metric group(s) are kept; the testset / application / evaluator /
+    // other schema groups are replaced by the ETL-derived ones. While the
+    // run schema is still loading, the production columns are used whole
+    // (their skeleton groups cover the gap).
+    const tableColumns = useMemo(() => {
+        const src = previewColumns.columns
+        if (!runSchema || etlColumns.length === 0) return src
+        const out: typeof src = []
+        let inserted = false
+        for (const col of src) {
+            const children = (col as {children?: unknown[]}).children
+            const isGroup = Array.isArray(children) && children.length > 0
+            const key = String((col as {key?: unknown}).key ?? "")
+            const isMetricGroup = isGroup && metricGroupKeys.has(key)
+            if (isGroup && !isMetricGroup) {
+                if (!inserted) {
+                    out.push(...(etlColumns as typeof src))
+                    inserted = true
+                }
+                // drop the production schema group column (replaced by ETL)
+            } else {
+                // keep: meta / visibility columns AND static metric groups
+                out.push(col)
+            }
+        }
+        if (!inserted) {
+            // No replaceable production group columns — insert ETL groups
+            // before the trailing column-visibility trigger.
+            const at = Math.max(out.length - 1, 0)
+            out.splice(at, 0, ...(etlColumns as typeof src))
+        }
+        return out
+    }, [previewColumns.columns, etlColumns, runSchema, metricGroupKeys])
+
     // Inject synthetic columns for comparison exports (do not render in UI)
     const exportColumns = useMemo(() => {
         const hasCompareRuns = compareSlots.some(Boolean)
@@ -125,7 +285,7 @@ const EvalRunDetailsTable = ({
 
     const mergedRows = useMemo(() => {
         if (!compareSlots.some(Boolean)) {
-            return basePagination.rows.map((row) => ({
+            return filteredBaseRows.map((row) => ({
                 ...row,
                 baseScenarioId: row.scenarioId ?? row.id,
                 compareIndex: 0,
@@ -133,7 +293,7 @@ const EvalRunDetailsTable = ({
             }))
         }
 
-        const baseRows = basePagination.rows.map((row) => ({
+        const baseRows = filteredBaseRows.map((row) => ({
             ...row,
             baseScenarioId: row.scenarioId ?? row.id,
             compareIndex: 0,
@@ -215,7 +375,7 @@ const EvalRunDetailsTable = ({
         })
 
         return result
-    }, [basePagination.rows, compareSlots, compareRowsBySlot])
+    }, [filteredBaseRows, compareSlots, compareRowsBySlot])
 
     const handleRowClick = useCallback(
         (record: TableRowData) => {
@@ -276,6 +436,9 @@ const EvalRunDetailsTable = ({
 
     const paginationForShell = useMemo<TableFeaturePagination<TableRowData>>(
         () => ({
+            // `mergedRows` is monotonic during a scan (confirmed matches
+            // only), so it can be handed to the table directly — no empty
+            // placeholder swap needed.
             rows: mergedRows,
             loadNextPage: handleLoadMore,
             resetPages: handleResetPages,
@@ -828,74 +991,94 @@ const EvalRunDetailsTable = ({
     const hasCompareRuns = compareSlots.some(Boolean)
 
     return (
-        <section className="bg-zinc-1 w-full h-full overflow-hidden flex flex-col px-2">
-            <div className="w-full grow min-h-0 overflow-auto">
-                <InfiniteVirtualTableFeatureShell<TableRowData>
-                    datasetStore={evaluationPreviewDatasetStore}
-                    tableScope={tableScope}
-                    store={store}
-                    columns={previewColumns.columns}
-                    rowKey={(record) => record.key}
-                    tableClassName={clsx(
-                        "agenta-scenario-table",
-                        `agenta-scenario-table--row-${rowHeight}`,
-                    )}
-                    resizableColumns
-                    useSettingsDropdown
-                    settingsDropdownMenuItems={rowHeightMenuItems}
-                    columnVisibilityMenuRenderer={(
-                        controls,
-                        close,
-                        {scopeId, onExport, isExporting},
-                    ) => (
-                        <ScenarioColumnVisibilityPopoverContent
-                            controls={controls}
-                            onClose={close}
-                            scopeId={scopeId}
-                            runId={runId}
-                            evaluationType={evaluationType}
-                            onExport={onExport}
-                            isExporting={isExporting}
-                        />
-                    )}
-                    pagination={paginationForShell}
-                    exportOptions={exportOptions}
-                    tableProps={{
-                        rowClassName: (record) =>
-                            clsx("scenario-row", {
-                                "scenario-row--comparison": record.isComparisonRow,
-                            }),
-                        size: "small",
-                        sticky: true,
-                        virtual: true,
-                        bordered: true,
-                        tableLayout: "fixed",
-                        onRow: (record) => {
-                            const backgroundColor = hasCompareRuns
-                                ? getComparisonColor(
-                                      typeof record.compareIndex === "number"
-                                          ? record.compareIndex
-                                          : 0,
-                                  )
-                                : "#fff"
-
-                            return {
-                                onClick: (event) => {
-                                    const target = event.target as HTMLElement | null
-                                    if (target?.closest("[data-ivt-stop-row-click]")) return
-                                    handleRowClick(record as TableRowData)
-                                },
-                                className: clsx({
-                                    "comparison-row": record.isComparisonRow,
+        <CellMaterializerContext.Provider value={cellMaterializer}>
+            <section className="bg-zinc-1 w-full h-full overflow-hidden flex flex-col px-2">
+                <div className="w-full grow min-h-0 overflow-auto">
+                    <InfiniteVirtualTableFeatureShell<TableRowData>
+                        /*
+                         * Remount on filter change. Applying a filter
+                         * shrinks the row set sharply; remounting resets
+                         * the virtual render window to the new length so
+                         * its row renderer can't index past the end.
+                         * Column visibility survives (localStorage-backed).
+                         */
+                        key={`scenario-table-${runId}-${JSON.stringify(effectiveFilter)}`}
+                        datasetStore={evaluationPreviewDatasetStore}
+                        tableScope={tableScope}
+                        store={store}
+                        columns={tableColumns}
+                        /*
+                         * Defensive rowKey — antd's virtual table can hand
+                         * this an out-of-range `undefined` record for a
+                         * frame while the filtered row set is shrinking.
+                         */
+                        rowKey={(record, index) => record?.key ?? `__phantom_${index ?? 0}`}
+                        tableClassName={clsx(
+                            "agenta-scenario-table",
+                            `agenta-scenario-table--row-${rowHeight}`,
+                        )}
+                        resizableColumns
+                        useSettingsDropdown
+                        settingsDropdownMenuItems={rowHeightMenuItems}
+                        columnVisibilityMenuRenderer={(
+                            controls,
+                            close,
+                            {scopeId, onExport, isExporting},
+                        ) => (
+                            <ScenarioColumnVisibilityPopoverContent
+                                controls={controls}
+                                onClose={close}
+                                scopeId={scopeId}
+                                runId={runId}
+                                evaluationType={evaluationType}
+                                onExport={onExport}
+                                isExporting={isExporting}
+                            />
+                        )}
+                        pagination={paginationForShell}
+                        exportOptions={exportOptions}
+                        tableProps={{
+                            rowClassName: (record) =>
+                                clsx("scenario-row", {
+                                    "scenario-row--comparison": record.isComparisonRow,
                                 }),
-                                style: backgroundColor ? {backgroundColor} : undefined,
-                            }
-                        },
-                    }}
-                />
-            </div>
-            <VirtualizedScenarioTableAnnotateDrawer runId={runId} />
-        </section>
+                            size: "small",
+                            sticky: true,
+                            virtual: true,
+                            bordered: true,
+                            tableLayout: "fixed",
+                            // One steady overlay while a filter scans for
+                            // its first match — replaces per-page flicker.
+                            loading: isScanning
+                                ? {spinning: true, tip: "Scanning for matches…"}
+                                : false,
+                            onRow: (record) => {
+                                const backgroundColor = hasCompareRuns
+                                    ? getComparisonColor(
+                                          typeof record.compareIndex === "number"
+                                              ? record.compareIndex
+                                              : 0,
+                                      )
+                                    : "#fff"
+
+                                return {
+                                    onClick: (event) => {
+                                        const target = event.target as HTMLElement | null
+                                        if (target?.closest("[data-ivt-stop-row-click]")) return
+                                        handleRowClick(record as TableRowData)
+                                    },
+                                    className: clsx({
+                                        "comparison-row": record.isComparisonRow,
+                                    }),
+                                    style: backgroundColor ? {backgroundColor} : undefined,
+                                }
+                            },
+                        }}
+                    />
+                </div>
+                <VirtualizedScenarioTableAnnotateDrawer runId={runId} />
+            </section>
+        </CellMaterializerContext.Provider>
     )
 }
 
diff --git a/web/oss/src/components/EvalRunDetails/atoms/metrics.ts b/web/oss/src/components/EvalRunDetails/atoms/metrics.ts
index 4c98c6d5e8..cce38b7c9d 100644
--- a/web/oss/src/components/EvalRunDetails/atoms/metrics.ts
+++ b/web/oss/src/components/EvalRunDetails/atoms/metrics.ts
@@ -13,6 +13,7 @@ import {getProjectValues} from "@/oss/state/project"
 import {previewEvalTypeAtom} from "../state/evalType"
 import {resolveValueBySegments, splitPath} from "../utils/valueAccess"
 
+import {isTerminalStatus} from "./compare"
 import {
     createMetricProcessor,
     isLegacyValueLeaf,
@@ -782,6 +783,15 @@ export const evaluationMetricQueryAtomFamily = atomFamily(
             const batcher = get(evaluationMetricBatcherFamily({runId: effectiveRunId}))
             const projectId = resolveProjectId(get)
 
+            // While the run is still executing, poll so a completing
+            // scenario's metrics surface in the table cells + focus drawer
+            // without a manual reload. Stops once the run is terminal.
+            const runQuery = effectiveRunId
+                ? get(evaluationRunQueryAtomFamily(effectiveRunId))
+                : undefined
+            const runStatus = runQuery?.data?.rawRun?.status ?? runQuery?.data?.camelRun?.status
+            const runTerminal = isTerminalStatus(runStatus)
+
             return {
                 queryKey: ["preview", "evaluation-metric", effectiveRunId, projectId, scenarioId],
                 enabled: Boolean(projectId && effectiveRunId && batcher && scenarioId),
@@ -789,6 +799,7 @@ export const evaluationMetricQueryAtomFamily = atomFamily(
                 gcTime: 5 * 60 * 1000,
                 refetchOnWindowFocus: false,
                 refetchOnReconnect: false,
+                refetchInterval: runTerminal ? false : 5000,
                 // Enable structural sharing to prevent unnecessary re-renders when data hasn't changed
                 structuralSharing: true,
                 queryFn: async () => {
diff --git a/web/oss/src/components/EvalRunDetails/atoms/scenarioSteps.ts b/web/oss/src/components/EvalRunDetails/atoms/scenarioSteps.ts
index 3cf86844fc..aca7afb8b2 100644
--- a/web/oss/src/components/EvalRunDetails/atoms/scenarioSteps.ts
+++ b/web/oss/src/components/EvalRunDetails/atoms/scenarioSteps.ts
@@ -8,7 +8,9 @@ import type {IStepResponse} from "@/oss/lib/evaluations"
 import {snakeToCamelCaseKeys} from "@/oss/lib/helpers/casing"
 import {getProjectValues} from "@/oss/state/project"
 
+import {isTerminalStatus} from "./compare"
 import {activePreviewRunIdAtom, effectiveProjectIdAtom} from "./run"
+import {evaluationRunQueryAtomFamily} from "./table/run"
 import type {ScenarioStepsBatchResult} from "./types"
 
 const scenarioStepsBatcherCache = new Map<string, BatchFetcher<string, ScenarioStepsBatchResult>>()
@@ -128,11 +130,21 @@ export const scenarioStepsQueryFamily = atomFamily(
             const effectiveRunId = resolveEffectiveRunId(get, runId)
             const batcher = get(scenarioStepsBatcherFamily({runId: effectiveRunId}))
 
+            // While the run is still executing, poll so the focus drawer /
+            // scenario viewer pick up a scenario's results as it completes.
+            // Stops once the run is terminal.
+            const runQuery = effectiveRunId
+                ? get(evaluationRunQueryAtomFamily(effectiveRunId))
+                : undefined
+            const runStatus = runQuery?.data?.rawRun?.status ?? runQuery?.data?.camelRun?.status
+            const runTerminal = isTerminalStatus(runStatus)
+
             return {
                 queryKey: ["preview", "scenario-steps", effectiveRunId, scenarioId],
                 enabled: Boolean(effectiveRunId && batcher && scenarioId),
                 refetchOnWindowFocus: false,
                 refetchOnReconnect: false,
+                refetchInterval: runTerminal ? false : 5000,
                 staleTime: 30_000,
                 gcTime: 5 * 60 * 1000,
                 // Enable structural sharing to prevent unnecessary re-renders when data hasn't changed
diff --git a/web/oss/src/components/EvalRunDetails/atoms/table/columns.ts b/web/oss/src/components/EvalRunDetails/atoms/table/columns.ts
index 824814ffe4..fd04f1fc8e 100644
--- a/web/oss/src/components/EvalRunDetails/atoms/table/columns.ts
+++ b/web/oss/src/components/EvalRunDetails/atoms/table/columns.ts
@@ -499,9 +499,19 @@ const tableColumnsBaseAtomFamily = atomFamily((runId: string | null) =>
             }
 
             const evaluator = column.evaluatorId ? evaluatorById.get(column.evaluatorId) : undefined
+            // Match the evaluator's metric definition by the canonical
+            // metric key (e.g. "attributes.ag.data.outputs.score") OR the
+            // bare value key (e.g. "score"). `extractMetrics` keys metrics
+            // by the output-schema property name — the bare key — so a
+            // canonical-key-only match misses and `metricType` falls back
+            // to "string", mis-typing the column (e.g. a boolean output).
             const metricKey = column.metricKey || column.valueKey
             const metricDefinition = evaluator?.metrics.find(
-                (metric) => metric.name === metricKey || metric.path === metricKey,
+                (metric) =>
+                    metric.name === metricKey ||
+                    metric.path === metricKey ||
+                    metric.name === column.valueKey ||
+                    metric.path === column.valueKey,
             )
             const metricType =
                 metricDefinition?.metricType || column.metricType || METRIC_TYPE_FALLBACK
diff --git a/web/oss/src/components/EvalRunDetails/components/Page.tsx b/web/oss/src/components/EvalRunDetails/components/Page.tsx
index ebad5df214..7369ee14cc 100644
--- a/web/oss/src/components/EvalRunDetails/components/Page.tsx
+++ b/web/oss/src/components/EvalRunDetails/components/Page.tsx
@@ -140,7 +140,7 @@ const EvalRunPreviewPage = ({runId, evaluationType, projectId = null}: EvalRunPr
             headerClassName="px-4 pt-2"
         >
             <div className="flex h-full min-h-0 flex-col gap-2 [&_.ant-tabs-content]:h-full [&_.ant-tabs-tabpane]:h-full">
-                <PreviewEvalRunMeta runId={runId} projectId={projectId} />
+                <PreviewEvalRunMeta runId={runId} projectId={projectId} activeView={activeView} />
                 <Tabs
                     className="flex-1 min-h-0 overflow-hidden"
                     activeKey={activeView}
diff --git a/web/oss/src/components/EvalRunDetails/components/PreviewEvalRunHeader.tsx b/web/oss/src/components/EvalRunDetails/components/PreviewEvalRunHeader.tsx
index 1653eb8f84..965029bb36 100644
--- a/web/oss/src/components/EvalRunDetails/components/PreviewEvalRunHeader.tsx
+++ b/web/oss/src/components/EvalRunDetails/components/PreviewEvalRunHeader.tsx
@@ -17,6 +17,7 @@ import {
     runTestsetIdsAtomFamily,
     runFlagsAtomFamily,
 } from "../atoms/runDerived"
+import ScenarioFilterBar from "../etl/ScenarioFilterBar"
 import {previewEvalTypeAtom} from "../state/evalType"
 
 import CompareRunsMenu from "./CompareRunsMenu"
@@ -137,10 +138,12 @@ const PreviewEvalRunMeta = ({
     runId,
     projectId,
     className,
+    activeView,
 }: {
     runId: string
     projectId?: string | null
     className?: string
+    activeView?: ActiveView
 }) => {
     const _invocationRefs = useAtomValue(useMemo(() => runInvocationRefsAtomFamily(runId), [runId]))
     const _testsetIds = useAtomValue(useMemo(() => runTestsetIdsAtomFamily(runId), [runId]))
@@ -220,6 +223,7 @@ const PreviewEvalRunMeta = ({
                         </Button>
                     </Tooltip>
                 ) : null}
+                {activeView === "scenarios" ? <ScenarioFilterBar runId={runId} /> : null}
                 <CompareRunsMenu runId={runId} />
             </div>
         </div>
diff --git a/web/oss/src/components/EvalRunDetails/etl/EtlColumnHeader.tsx b/web/oss/src/components/EvalRunDetails/etl/EtlColumnHeader.tsx
new file mode 100644
index 0000000000..d69faa29e0
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/EtlColumnHeader.tsx
@@ -0,0 +1,74 @@
+/**
+ * EtlColumnHeader
+ *
+ * Renders the nested-header label for a column group. The default
+ * `computeColumnGroup` resolver falls back to `Testset <slug>` /
+ * `Application <slug>` because it doesn't fetch the entity itself.
+ *
+ * This header is that override — same pattern production's
+ * `StepGroupHeader` uses: subscribe to the entity reference atom by ID
+ * and surface the entity's name when available, fall back to the slug
+ * otherwise. Evaluator + metrics + other groups already carry
+ * `slugToTitle`-rendered labels, so no entity lookup is needed.
+ */
+
+import {useMemo} from "react"
+
+import type {ColumnGroup} from "@agenta/entities/evaluationRun/etl"
+import {Tooltip} from "antd"
+import {atom, useAtomValue} from "jotai"
+
+import {
+    applicationReferenceQueryAtomFamily,
+    testsetReferenceQueryAtomFamily,
+} from "../atoms/references"
+
+const emptyAtom = atom<{data: {name?: string; slug?: string} | null} | null>(null)
+
+interface EtlColumnHeaderProps {
+    group: ColumnGroup
+}
+
+const pickName = (entity: unknown): string | null => {
+    if (!entity || typeof entity !== "object") return null
+    const name = (entity as {name?: unknown}).name
+    return typeof name === "string" && name.length > 0 ? name : null
+}
+
+const EtlColumnHeader = ({group}: EtlColumnHeaderProps) => {
+    const refAtom = useMemo(() => {
+        if (group.kind === "testset") {
+            const id = (group.refs?.testset as {id?: string} | undefined)?.id
+            return id ? testsetReferenceQueryAtomFamily(id) : emptyAtom
+        }
+        if (group.kind === "application") {
+            const id = (group.refs?.application as {id?: string} | undefined)?.id
+            return id ? applicationReferenceQueryAtomFamily(id) : emptyAtom
+        }
+        return emptyAtom
+    }, [group])
+
+    const ref = useAtomValue(refAtom) as {data?: unknown} | null
+    const name = pickName(ref?.data ?? null)
+
+    const label = useMemo(() => {
+        switch (group.kind) {
+            case "testset":
+                return name ? `Testset ${name}` : group.label
+            case "application":
+                return name ? `Application ${name}` : group.label
+            default:
+                return group.label
+        }
+    }, [group.kind, group.label, name])
+
+    return (
+        <Tooltip title={label} placement="top">
+            <span className="block max-w-full overflow-hidden text-ellipsis whitespace-nowrap text-left">
+                {label}
+            </span>
+        </Tooltip>
+    )
+}
+
+export default EtlColumnHeader
diff --git a/web/oss/src/components/EvalRunDetails/etl/ScenarioFilterBar.tsx b/web/oss/src/components/EvalRunDetails/etl/ScenarioFilterBar.tsx
new file mode 100644
index 0000000000..bed7a7d6a7
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/ScenarioFilterBar.tsx
@@ -0,0 +1,412 @@
+/**
+ * ScenarioFilterBar — multi-condition AND/OR filter for the evaluation
+ * run scenarios table (decision D8).
+ *
+ * Self-contained: given only a `runId` it derives the run schema, the
+ * column value types, and the live scan status from atoms — so it can be
+ * dropped into the run header rather than sitting above the table.
+ *
+ * Follows the observability `Filters` pattern: a compact "Filters" button
+ * opens a popover holding the condition rows. Edits are staged in a draft
+ * and committed on "Apply".
+ */
+
+import {useMemo, useState} from "react"
+
+import {
+    buildFilterSchema,
+    type ColumnGroup,
+    type FilterOperator,
+    type FilterValueType,
+    type PredicateGroup,
+    type RowPredicate,
+    type RunSchema,
+} from "@agenta/entities/evaluationRun/etl"
+import {Button, Divider, Input, InputNumber, Popover, Select, Tooltip} from "antd"
+import {useAtom, useAtomValue} from "jotai"
+import {Filter as FilterIcon, Loader2, Plus, X} from "lucide-react"
+
+import {evaluationRunQueryAtomFamily, tableColumnsAtomFamily} from "../atoms/table"
+
+import {buildColumnValueTypeResolver} from "./columnValueTypes"
+import {
+    scenarioFilterAtomFamily,
+    isConditionComplete,
+    scenarioFilterStatusAtomFamily,
+} from "./scenarioFilterState"
+
+const OP_LABELS: Record<FilterOperator, string> = {
+    eq: "equals",
+    ne: "not equals",
+    lt: "<",
+    lte: "≤",
+    gt: ">",
+    gte: "≥",
+    in: "in",
+    nin: "not in",
+}
+
+// Operators offered in the UI. `in` / `nin` take a list of values (a tag
+// input); the rest take a single value.
+const UI_OPERATORS: FilterOperator[] = ["eq", "ne", "lt", "lte", "gt", "gte", "in", "nin"]
+
+/** True for operators whose value is a list rather than a scalar. */
+const isListOperator = (op: FilterOperator) => op === "in" || op === "nin"
+
+/**
+ * v1 column-kind allowlist for filtering. Only metric-related columns
+ * (evaluator outputs + metrics) are offered for now; testset (input) and
+ * application (output) columns are deliberately withheld.
+ *
+ * This is a UI allowlist only — the filter engine supports every kind.
+ * Flip a kind to `true` here to enable it; no other change is needed.
+ */
+const FILTERABLE_COLUMN_KINDS: Record<ColumnGroup["kind"], boolean> = {
+    evaluator: true,
+    metrics: true,
+    testset: false,
+    application: false,
+    other: false,
+}
+
+const EMPTY_FILTER: PredicateGroup = {op: "and", conditions: []}
+
+const encodeField = (f: {groupKind: string; groupSlug?: string | null; columnName: string}) =>
+    `${f.groupKind}|${f.groupSlug ?? ""}|${f.columnName}`
+
+const blankCondition = (): RowPredicate => ({
+    groupKind: "evaluator",
+    groupSlug: null,
+    columnName: "",
+    op: "eq",
+    value: "",
+})
+
+/** Keep antd Select dropdowns inside the popover so they don't close it. */
+const getWithinPopover = (trigger: HTMLElement) =>
+    (trigger.closest(".ant-popover") as HTMLElement | null) ?? document.body
+
+export interface ScenarioFilterBarProps {
+    runId: string
+}
+
+const ScenarioFilterBar = ({runId}: ScenarioFilterBarProps) => {
+    const [applied, setApplied] = useAtom(scenarioFilterAtomFamily(runId))
+    const {matchCount, scanning} = useAtomValue(scenarioFilterStatusAtomFamily(runId))
+    const [open, setOpen] = useState(false)
+    // Draft conditions edited inside the popover; committed on Apply.
+    const [draft, setDraft] = useState<PredicateGroup>(applied)
+
+    // Run schema (steps + mappings) — drives the filterable columns.
+    const runQuery = useAtomValue(useMemo(() => evaluationRunQueryAtomFamily(runId), [runId]))
+    const schema = useMemo<RunSchema | null>(() => {
+        const data = runQuery.data?.rawRun?.data
+        const steps = data?.steps
+        const mappings = data?.mappings
+        if (!Array.isArray(steps) || !Array.isArray(mappings)) return null
+        return {steps, mappings}
+    }, [runQuery.data])
+
+    // Column value types — sourced from the evaluator output schemas.
+    const columnResult = useAtomValue(useMemo(() => tableColumnsAtomFamily(runId), [runId]))
+    const resolveValueType = useMemo(
+        () => buildColumnValueTypeResolver(columnResult),
+        [columnResult],
+    )
+
+    const fields = useMemo(
+        () =>
+            buildFilterSchema(schema, {resolveValueType}).fields.filter(
+                (f) => FILTERABLE_COLUMN_KINDS[f.groupKind],
+            ),
+        [schema, resolveValueType],
+    )
+    const fieldByKey = useMemo(() => new Map(fields.map((f) => [encodeField(f), f])), [fields])
+    const fieldOptions = useMemo(
+        () =>
+            fields.map((f) => ({
+                value: encodeField(f),
+                label: `${f.groupLabel} · ${f.label}`,
+            })),
+        [fields],
+    )
+
+    // Run graph carries no filterable columns — hide the bar entirely.
+    if (fields.length === 0) return null
+
+    const appliedCount = applied.conditions.filter(isConditionComplete).length
+    const conditions = draft.conditions
+
+    const setConditions = (next: RowPredicate[]) => setDraft((d) => ({...d, conditions: next}))
+    const updateCondition = (index: number, partial: Partial<RowPredicate>) =>
+        setConditions(conditions.map((c, i) => (i === index ? {...c, ...partial} : c)))
+    const removeCondition = (index: number) =>
+        setConditions(conditions.filter((_, i) => i !== index))
+
+    const handleOpenChange = (next: boolean) => {
+        if (next) {
+            // Seed the draft from the applied filter (one blank row when empty).
+            setDraft(
+                applied.conditions.length > 0
+                    ? applied
+                    : {op: "and", conditions: [blankCondition()]},
+            )
+        }
+        setOpen(next)
+    }
+
+    const apply = () => {
+        setApplied({op: draft.op, conditions: draft.conditions.filter(isConditionComplete)})
+        setOpen(false)
+    }
+    const clearAll = () => {
+        setApplied(EMPTY_FILTER)
+        setDraft(EMPTY_FILTER)
+        setOpen(false)
+    }
+
+    const popoverContent = (
+        <div className="flex w-[560px] max-w-[calc(100vw-32px)] flex-col text-xs">
+            <div className="flex items-center justify-between gap-3 px-1 pb-2">
+                <span className="font-medium text-zinc-600">Filter scenarios</span>
+                {appliedCount > 0 ? (
+                    <span className="inline-flex items-center gap-1 font-normal text-zinc-500">
+                        {scanning ? <Loader2 size={12} className="animate-spin" /> : null}
+                        <span>
+                            {matchCount} {matchCount === 1 ? "match" : "matches"}
+                            {scanning ? " · scanning…" : ""}
+                        </span>
+                    </span>
+                ) : null}
+            </div>
+            <Divider className="!my-0 !mb-2" />
+
+            <div className="flex flex-col gap-1.5">
+                {conditions.map((condition, index) => {
+                    const fieldKey = condition.columnName ? encodeField(condition) : undefined
+                    const field = fieldKey ? fieldByKey.get(fieldKey) : undefined
+                    const valueType: FilterValueType = field?.valueType ?? "unknown"
+                    const ops = field
+                        ? UI_OPERATORS.filter((o) => field.operators.includes(o))
+                        : UI_OPERATORS
+
+                    return (
+                        <div key={index} className="flex items-center gap-1.5">
+                            {/*
+                             * Row-level AND/OR connector in a fixed-width
+                             * slot — so the Column select after it lines up
+                             * across every row regardless of whether the
+                             * connector is the "Where" label or the select.
+                             * The group has a single op (flat group — D8),
+                             * so every connector shows and toggles the same
+                             * value.
+                             */}
+                            <div className="flex w-20 shrink-0 items-center text-zinc-400">
+                                {index === 0 ? (
+                                    <span className="pl-2">Where</span>
+                                ) : (
+                                    <Select<"and" | "or">
+                                        variant="borderless"
+                                        className="w-full"
+                                        value={draft.op}
+                                        options={[
+                                            {label: "And", value: "and"},
+                                            {label: "Or", value: "or"},
+                                        ]}
+                                        getPopupContainer={getWithinPopover}
+                                        onChange={(op) => setDraft((d) => ({...d, op}))}
+                                    />
+                                )}
+                            </div>
+                            <Select<string>
+                                placeholder="Column"
+                                className="w-[200px] shrink-0"
+                                showSearch
+                                optionFilterProp="label"
+                                value={fieldKey}
+                                options={fieldOptions}
+                                getPopupContainer={getWithinPopover}
+                                onChange={(value) => {
+                                    const picked = fieldByKey.get(value)
+                                    if (!picked) return
+                                    const nextOps = UI_OPERATORS.filter((o) =>
+                                        picked.operators.includes(o),
+                                    )
+                                    updateCondition(index, {
+                                        groupKind: picked.groupKind,
+                                        groupSlug: picked.groupSlug,
+                                        columnName: picked.columnName,
+                                        op: nextOps[0] ?? "eq",
+                                        value: picked.valueType === "boolean" ? true : "",
+                                    })
+                                }}
+                            />
+                            <Select<FilterOperator>
+                                className="w-[110px] shrink-0"
+                                value={condition.op}
+                                disabled={!field}
+                                options={ops.map((o) => ({value: o, label: OP_LABELS[o]}))}
+                                getPopupContainer={getWithinPopover}
+                                onChange={(op) => {
+                                    // Switching between scalar and list
+                                    // operators changes the value shape —
+                                    // reset it so it stays valid.
+                                    const isList = isListOperator(op)
+                                    const wasList = Array.isArray(condition.value)
+                                    const value =
+                                        isList === wasList ? condition.value : isList ? [] : ""
+                                    updateCondition(index, {op, value})
+                                }}
+                            />
+                            <ConditionValueInput
+                                op={condition.op}
+                                valueType={valueType}
+                                value={condition.value}
+                                disabled={!field}
+                                onChange={(value) => updateCondition(index, {value})}
+                            />
+                            <Tooltip title="Remove">
+                                <Button
+                                    size="small"
+                                    type="text"
+                                    icon={<X size={14} />}
+                                    onClick={() => removeCondition(index)}
+                                />
+                            </Tooltip>
+                        </div>
+                    )
+                })}
+            </div>
+
+            <Button
+                size="small"
+                type="dashed"
+                icon={<Plus size={14} />}
+                className="mt-2 self-start"
+                onClick={() => setConditions([...conditions, blankCondition()])}
+            >
+                Add condition
+            </Button>
+
+            <Divider className="!my-2" />
+            <div className="flex items-center justify-between px-1">
+                <Button size="small" onClick={clearAll} disabled={appliedCount === 0}>
+                    Clear
+                </Button>
+                <div className="flex items-center gap-2">
+                    <Button size="small" onClick={() => setOpen(false)}>
+                        Cancel
+                    </Button>
+                    <Button size="small" type="primary" onClick={apply}>
+                        Apply
+                    </Button>
+                </div>
+            </div>
+        </div>
+    )
+
+    return (
+        <Popover
+            open={open}
+            onOpenChange={handleOpenChange}
+            trigger="click"
+            placement="bottomRight"
+            arrow={false}
+            content={popoverContent}
+        >
+            <Button
+                icon={<FilterIcon size={14} />}
+                aria-label="Filter scenarios"
+                className="inline-flex items-center gap-1"
+            >
+                <span
+                    className={`rounded-full px-1.5 text-[10px] font-medium ${
+                        appliedCount > 0 ? "bg-zinc-700 text-white" : "bg-zinc-100 text-zinc-500"
+                    }`}
+                >
+                    {appliedCount}
+                </span>
+            </Button>
+        </Popover>
+    )
+}
+
+/** Value input — shape depends on the operator and the field value type. */
+const ConditionValueInput = ({
+    op,
+    valueType,
+    value,
+    disabled,
+    onChange,
+}: {
+    op: FilterOperator
+    valueType: FilterValueType
+    value: unknown
+    disabled: boolean
+    onChange: (value: unknown) => void
+}) => {
+    // `in` / `nin` — a list of values entered as tags.
+    if (isListOperator(op)) {
+        const tags = Array.isArray(value) ? value.map((v) => String(v)) : []
+        return (
+            <Select
+                mode="tags"
+                className="w-full"
+                placeholder="Add values…"
+                disabled={disabled}
+                value={tags}
+                open={false}
+                suffixIcon={null}
+                tokenSeparators={[","]}
+                getPopupContainer={getWithinPopover}
+                onChange={(vals: string[]) => {
+                    const coerced =
+                        valueType === "number"
+                            ? vals.map(Number).filter((n) => !Number.isNaN(n))
+                            : vals
+                    onChange(coerced)
+                }}
+            />
+        )
+    }
+    if (valueType === "boolean") {
+        // antd Select option values must be string|number — encode the
+        // boolean as a string and decode on change.
+        return (
+            <Select<string>
+                className="w-full"
+                placeholder="Value"
+                disabled={disabled}
+                value={value === true ? "true" : value === false ? "false" : undefined}
+                options={[
+                    {label: "true", value: "true"},
+                    {label: "false", value: "false"},
+                ]}
+                getPopupContainer={getWithinPopover}
+                onChange={(v) => onChange(v === "true")}
+            />
+        )
+    }
+    if (valueType === "number") {
+        return (
+            <InputNumber
+                className="w-full"
+                placeholder="Value"
+                disabled={disabled}
+                value={typeof value === "number" ? value : null}
+                onChange={(v) => onChange(v ?? "")}
+            />
+        )
+    }
+    return (
+        <Input
+            className="w-full"
+            placeholder="Value"
+            disabled={disabled}
+            value={typeof value === "string" ? value : value == null ? "" : String(value)}
+            onChange={(e) => onChange(e.target.value)}
+        />
+    )
+}
+
+export default ScenarioFilterBar
diff --git a/web/oss/src/components/EvalRunDetails/etl/cellMaterializerContext.ts b/web/oss/src/components/EvalRunDetails/etl/cellMaterializerContext.ts
new file mode 100644
index 0000000000..fa2c3fd900
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/cellMaterializerContext.ts
@@ -0,0 +1,16 @@
+/**
+ * One-line context shared between the table page (provider) and the cells
+ * (consumers). Cells call `materializer.request(slice, req)` when their
+ * column's data is missing from cache; the materializer coalesces
+ * concurrent same-tick requests into one bulk fetch per slice.
+ *
+ * Kept in its own file to avoid a circular import between
+ * `EtlResolvedCell` and the table (the cell imports the context type,
+ * the page sets the context value).
+ */
+
+import {createContext} from "react"
+
+import type {CellMaterializer} from "./useCellMaterialization"
+
+export const CellMaterializerContext = createContext<CellMaterializer | null>(null)
diff --git a/web/oss/src/components/EvalRunDetails/etl/cells/EtlResolvedCell.tsx b/web/oss/src/components/EvalRunDetails/etl/cells/EtlResolvedCell.tsx
new file mode 100644
index 0000000000..d023c89162
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/cells/EtlResolvedCell.tsx
@@ -0,0 +1,377 @@
+/**
+ * EtlResolvedCell — a single cell that resolves its value from molecule caches.
+ *
+ * Each cell:
+ *   1. Subscribes to TanStack cache entries for its scenario via `useQuery`
+ *      with `enabled: false` — no network triggered from a cell render.
+ *      The hydrate / materializer paths populate those entries.
+ *   2. Assembles a HydratedScenarioRow from the four entity slices
+ *      (results / metrics / testcase / traces).
+ *   3. Runs `resolveMappings` against the hydrated row + run schema and
+ *      picks out *just this cell's* column value.
+ *
+ * Non-terminal rendering — the load-bearing difference from the PoC. The
+ * PoC fabricated `scenario.status = "success"` because it only ran
+ * against finished runs. Production scenarios can be pending / running /
+ * failed / partial, so the cell renders four distinct states:
+ *
+ *   value    — resolved, render it.
+ *   running  — scenario not terminal: an in-progress indicator. NEVER a
+ *              bare "—" (the user must tell "still computing" apart from
+ *              "computed nothing").
+ *   loading  — scenario terminal but this cell's slices not hydrated yet.
+ *   missing  — scenario terminal, slices hydrated, genuinely no value: "—".
+ */
+
+import {useContext, useEffect, useMemo} from "react"
+
+import {evaluationResultMolecule, evaluationMetricMolecule} from "@agenta/entities/evaluationRun"
+import {
+    resolveMappings,
+    unwrapStatsForCompare,
+    type RunSchema,
+    type ResolvedColumn,
+    type ColumnGroup,
+    type HydratedScenarioRow,
+    type HydratableScenario,
+} from "@agenta/entities/evaluationRun/etl"
+import {useQuery, useQueryClient} from "@tanstack/react-query"
+import {Tag} from "antd"
+import clsx from "clsx"
+import {useAtomValue} from "jotai"
+
+import {isTerminalStatus} from "../../atoms/compare"
+import {scenarioRowHeightAtom, type ScenarioRowHeight} from "../../state/rowHeight"
+import {CellMaterializerContext} from "../cellMaterializerContext"
+import {hydrationVersionAtom} from "../useHydrateScenarios"
+
+type ColumnKind = ColumnGroup["kind"]
+
+const MAX_LINES_BY_HEIGHT: Record<ScenarioRowHeight, number> = {
+    small: 4,
+    medium: 9,
+    large: 18,
+}
+
+/** Entity slices each column kind reads from. */
+const SLICES_BY_KIND: Record<ColumnKind, ("results" | "metrics" | "testcases" | "traces")[]> = {
+    testset: ["results", "testcases"],
+    application: ["results", "traces"],
+    evaluator: ["results", "metrics"],
+    metrics: ["metrics"],
+    other: ["results"],
+}
+
+export interface EtlResolvedCellProps {
+    projectId: string
+    runId: string
+    scenarioId: string
+    /** Real scenario status — drives the running / loading / missing split. */
+    scenarioStatus: string
+    /** Column the cell should render — group kind + slug + column name. */
+    columnKind: ColumnKind
+    columnGroupSlug: string | null
+    columnName: string
+    /** Run schema (steps + mappings). */
+    schema: RunSchema | null
+}
+
+const EtlResolvedCell = ({
+    projectId,
+    runId,
+    scenarioId,
+    scenarioStatus,
+    columnKind,
+    columnGroupSlug,
+    columnName,
+    schema,
+}: EtlResolvedCellProps) => {
+    const queryClient = useQueryClient()
+    const materializer = useContext(CellMaterializerContext)
+    // Bumped after each hydrate / materialize batch so cells re-render and
+    // pick up late-arriving testcase / trace cache writes.
+    const hydrationVersion = useAtomValue(hydrationVersionAtom)
+    const rowHeight = useAtomValue(scenarioRowHeightAtom)
+    const maxLines = MAX_LINES_BY_HEIGHT[rowHeight]
+
+    // Pure subscriptions — `enabled: false` + no-op queryFn means a cell
+    // render never triggers network. The hydrate / materializer paths are
+    // the only writers; cells just observe.
+    const resultsQ = useQuery<unknown>({
+        queryKey: ["evaluation-results", projectId, runId, scenarioId],
+        queryFn: () => null,
+        enabled: false,
+        staleTime: Infinity,
+    })
+    const metricsQ = useQuery<unknown>({
+        queryKey: ["evaluation-metrics", projectId, runId, scenarioId],
+        queryFn: () => null,
+        enabled: false,
+        staleTime: Infinity,
+    })
+
+    const resultsFetched = resultsQ.data !== undefined
+    const metricsFetched = metricsQ.data !== undefined
+
+    // Resolve this cell's column from the molecule caches.
+    const resolved = useMemo<ResolvedColumn | null>(() => {
+        if (!schema) return null
+
+        const results = (resultsQ.data ??
+            evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId}) ??
+            []) as HydratedScenarioRow["results"]
+        const metrics = (metricsQ.data ??
+            evaluationMetricMolecule.get.byScenario({projectId, runId, scenarioId}) ??
+            []) as HydratedScenarioRow["metrics"]
+
+        const testcaseIdCandidates = results
+            .map((r) => r.testcase_id)
+            .filter((v): v is string => typeof v === "string" && v.length > 0)
+        const testcaseId = testcaseIdCandidates[0] ?? null
+        const testcase = testcaseId
+            ? (queryClient.getQueryData<HydratedScenarioRow["testcase"]>([
+                  "testcase",
+                  projectId,
+                  testcaseId,
+              ]) ?? null)
+            : null
+
+        const traces: Record<string, unknown> = {}
+        for (const r of results) {
+            if (typeof r.trace_id === "string" && r.trace_id) {
+                const cached = queryClient.getQueryData<unknown>([
+                    "trace-entity",
+                    projectId,
+                    r.trace_id,
+                ])
+                if (cached != null) traces[r.trace_id] = cached
+            }
+        }
+
+        const hydrated: HydratedScenarioRow<HydratableScenario> = {
+            scenario: {id: scenarioId, status: scenarioStatus} as HydratableScenario,
+            results,
+            metrics,
+            testcase,
+            traces,
+        }
+
+        const cols = resolveMappings(hydrated, {
+            steps: schema.steps,
+            mappings: schema.mappings,
+        })
+
+        return (
+            cols.find((c) => {
+                if (c.name !== columnName) return false
+                if (c.group.kind !== columnKind) return false
+                if (columnGroupSlug != null && c.group.slug !== columnGroupSlug) return false
+                return true
+            }) ?? null
+        )
+    }, [
+        projectId,
+        runId,
+        scenarioId,
+        scenarioStatus,
+        columnKind,
+        columnGroupSlug,
+        columnName,
+        schema,
+        resultsQ.data,
+        metricsQ.data,
+        hydrationVersion,
+        queryClient,
+    ])
+
+    // Cell-side lazy materialization. Ask the page-level materializer to
+    // fill cache slices this cell needs; the materializer coalesces
+    // concurrent same-tick requests into one bulk fetch per (slice, run).
+    useEffect(() => {
+        if (!materializer || !projectId || !runId || !scenarioId) return
+        for (const slice of SLICES_BY_KIND[columnKind]) {
+            if (slice === "results") {
+                if (!evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId})) {
+                    materializer.request("results", {scenarioId, runId})
+                }
+            } else if (slice === "metrics") {
+                if (!evaluationMetricMolecule.get.byScenario({projectId, runId, scenarioId})) {
+                    materializer.request("metrics", {scenarioId, runId})
+                }
+            } else if (slice === "testcases") {
+                const cachedResults = evaluationResultMolecule.get.byScenario({
+                    projectId,
+                    runId,
+                    scenarioId,
+                })
+                const testcaseId =
+                    cachedResults?.find((r) => typeof r.testcase_id === "string" && r.testcase_id)
+                        ?.testcase_id ?? null
+                if (testcaseId) {
+                    const cached = queryClient.getQueryData(["testcase", projectId, testcaseId])
+                    if (cached == null) materializer.request("testcases", {testcaseId})
+                }
+            } else if (slice === "traces") {
+                const cachedResults = evaluationResultMolecule.get.byScenario({
+                    projectId,
+                    runId,
+                    scenarioId,
+                })
+                const traceId =
+                    cachedResults?.find((r) => typeof r.trace_id === "string" && r.trace_id)
+                        ?.trace_id ?? null
+                if (traceId) {
+                    const cached = queryClient.getQueryData(["trace-entity", projectId, traceId])
+                    if (cached == null) materializer.request("traces", {traceId})
+                }
+            }
+        }
+    }, [materializer, projectId, runId, scenarioId, columnKind, hydrationVersion, queryClient])
+
+    const hasValue = !!resolved && resolved.source !== "missing"
+
+    // Is a slice this cell needs still in flight? Distinguishes
+    // "slice-not-hydrated" (skeleton) from "genuinely missing" ("—") for
+    // a terminal scenario. A slice that the materializer marked failed is
+    // NOT counted as loading — otherwise a permanently rate-limited fetch
+    // would leave the cell on an infinite skeleton.
+    const sliceStillLoading = useMemo(() => {
+        for (const slice of SLICES_BY_KIND[columnKind]) {
+            if (slice === "results") {
+                if (!resultsFetched && !materializer?.hasFailed("results", {scenarioId, runId})) {
+                    return true
+                }
+            } else if (slice === "metrics") {
+                if (!metricsFetched && !materializer?.hasFailed("metrics", {scenarioId, runId})) {
+                    return true
+                }
+            } else if (slice === "testcases") {
+                // Needs results first — covered by the results check above.
+                if (!resultsFetched) continue
+                const cachedResults = evaluationResultMolecule.get.byScenario({
+                    projectId,
+                    runId,
+                    scenarioId,
+                })
+                const testcaseId =
+                    cachedResults?.find((r) => typeof r.testcase_id === "string" && r.testcase_id)
+                        ?.testcase_id ?? null
+                if (!testcaseId) continue
+                const cached = queryClient.getQueryData(["testcase", projectId, testcaseId])
+                if (cached === undefined && !materializer?.hasFailed("testcases", {testcaseId})) {
+                    return true
+                }
+            } else if (slice === "traces") {
+                if (!resultsFetched) continue
+                const cachedResults = evaluationResultMolecule.get.byScenario({
+                    projectId,
+                    runId,
+                    scenarioId,
+                })
+                const traceId =
+                    cachedResults?.find((r) => typeof r.trace_id === "string" && r.trace_id)
+                        ?.trace_id ?? null
+                if (!traceId) continue
+                const cached = queryClient.getQueryData(["trace-entity", projectId, traceId])
+                if (cached === undefined && !materializer?.hasFailed("traces", {traceId})) {
+                    return true
+                }
+            }
+        }
+        return false
+    }, [
+        columnKind,
+        projectId,
+        runId,
+        scenarioId,
+        resultsFetched,
+        metricsFetched,
+        materializer,
+        queryClient,
+        hydrationVersion,
+    ])
+
+    const isTerminal = isTerminalStatus(scenarioStatus)
+
+    let content: React.ReactNode
+    if (hasValue) {
+        content = (
+            <div
+                className="scenario-table-text w-full"
+                style={{
+                    display: "-webkit-box",
+                    WebkitBoxOrient: "vertical",
+                    WebkitLineClamp: maxLines,
+                    overflow: "hidden",
+                    wordBreak: "break-word",
+                }}
+            >
+                {formatValue(unwrapStatsForCompare(resolved!.value))}
+            </div>
+        )
+    } else if (!isTerminal) {
+        // Scenario not finished — in-progress, NOT a missing value.
+        content = <RunningIndicator status={scenarioStatus} />
+    } else if (sliceStillLoading) {
+        // Terminal scenario, this cell's slices not hydrated yet.
+        content = <div className="h-3 w-2/3 rounded bg-neutral-200 animate-pulse" />
+    } else {
+        // Terminal, hydrated, genuinely no value.
+        content = <span className="scenario-table-text scenario-table-placeholder">—</span>
+    }
+
+    return <div className="scenario-table-cell">{content}</div>
+}
+
+/**
+ * In-progress indicator for a non-terminal scenario's cell. A colored,
+ * pulsing dot + label — deliberately distinct from both the grey skeleton
+ * bar (data loading) and the "—" placeholder (no value).
+ */
+const RunningIndicator = ({status}: {status: string}) => {
+    const s = status.toLowerCase()
+    const dotClass = s === "running" ? "bg-blue-500" : "bg-amber-400"
+    const label = s === "running" ? "Running" : s === "queued" ? "Queued" : "Pending"
+    return (
+        <span className="inline-flex items-center gap-1.5 text-xs text-neutral-400">
+            <span className={clsx("h-1.5 w-1.5 rounded-full animate-pulse", dotClass)} />
+            {label}
+        </span>
+    )
+}
+
+/**
+ * Fixed-height placeholder for skeleton (not-yet-keyed) rows. Occupies
+ * the same `scenario-table-cell` box as a populated cell so the table
+ * doesn't jump when a skeleton row resolves to real data.
+ */
+export const EtlSkeletonCell = () => (
+    <div className="scenario-table-cell">
+        <div className="h-3 w-2/3 rounded bg-neutral-200 animate-pulse" />
+    </div>
+)
+
+function formatValue(v: unknown): React.ReactNode {
+    if (v === null || v === undefined) {
+        return <span className="scenario-table-text scenario-table-placeholder">—</span>
+    }
+    if (typeof v === "boolean") {
+        return <Tag color={v ? "green" : "red"}>{String(v)}</Tag>
+    }
+    if (typeof v === "number") {
+        return Number.isInteger(v) ? String(v) : v.toFixed(3)
+    }
+    // Cap at 800 chars as a DOM-size guard; the cell's CSS line-clamp does
+    // the visible truncation.
+    if (typeof v === "string") {
+        return v.length > 800 ? v.slice(0, 800) : v
+    }
+    try {
+        const json = JSON.stringify(v)
+        return json.length > 800 ? json.slice(0, 800) : json
+    } catch {
+        return String(v)
+    }
+}
+
+export default EtlResolvedCell
diff --git a/web/oss/src/components/EvalRunDetails/etl/columnValueTypes.ts b/web/oss/src/components/EvalRunDetails/etl/columnValueTypes.ts
new file mode 100644
index 0000000000..6195b9c8d1
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/columnValueTypes.ts
@@ -0,0 +1,76 @@
+/**
+ * columnValueTypes — resolves a filterable column's value type from the
+ * evaluator output schema.
+ *
+ * The run graph does not carry column value types. The authoritative
+ * source is the evaluator's JSON output schema: `extractMetrics`
+ * (entities) reads each output property's `schema.type` into
+ * `MetricColumnDefinition.metricType`, and the backend-metadata column
+ * builder copies that onto every annotation column as
+ * `EvaluationTableColumn.metricType`.
+ *
+ * This module turns that `metricType` into the `FilterValueType` the
+ * filter bar uses, so a boolean evaluator output (e.g. an LLM-judge
+ * `success` field) is offered only equality operators and a true/false
+ * input — never the numeric comparators.
+ */
+
+import type {FilterValueType} from "@agenta/entities/evaluationRun/etl"
+
+import type {EvaluationTableColumnsResult} from "../atoms/table"
+
+/** Map a JSON-schema-derived `metricType` to a filter value type. */
+function metricTypeToValueType(metricType: string | undefined): FilterValueType | undefined {
+    if (!metricType) return undefined
+    switch (metricType.toLowerCase()) {
+        case "boolean":
+        case "bool":
+            return "boolean"
+        case "number":
+        case "integer":
+        case "float":
+            return "number"
+        case "string":
+            return "string"
+        default:
+            // array / object / anything else — no safe operator set.
+            return "unknown"
+    }
+}
+
+export interface ColumnValueTypeField {
+    groupKind: string
+    groupSlug: string | null
+    columnName: string
+}
+
+/**
+ * Build a `resolveValueType` callback for `buildFilterSchema`, sourced
+ * from the evaluator output schemas (via `columnResult` column
+ * `metricType`). Returns `undefined` for a column with no known type so
+ * `buildFilterSchema` falls back to its schema-only default.
+ */
+export function buildColumnValueTypeResolver(
+    columnResult: EvaluationTableColumnsResult | undefined,
+): (field: ColumnValueTypeField) => FilterValueType | undefined {
+    // Keyed by `<evaluatorSlug>::<columnName>` (disambiguates two
+    // evaluators with same-named outputs) and by column name alone.
+    const bySlugName = new Map<string, string>()
+    const byName = new Map<string, string>()
+
+    for (const col of columnResult?.columns ?? []) {
+        const metricType = col.metricType
+        const name = col.label
+        if (!metricType || typeof name !== "string" || !name) continue
+        byName.set(name, metricType)
+        if (col.evaluatorSlug) bySlugName.set(`${col.evaluatorSlug}::${name}`, metricType)
+    }
+
+    return (field) => {
+        const metricType =
+            (field.groupSlug
+                ? bySlugName.get(`${field.groupSlug}::${field.columnName}`)
+                : undefined) ?? byName.get(field.columnName)
+        return metricTypeToValueType(metricType)
+    }
+}
diff --git a/web/oss/src/components/EvalRunDetails/etl/scenarioFilterState.ts b/web/oss/src/components/EvalRunDetails/etl/scenarioFilterState.ts
new file mode 100644
index 0000000000..a3438fefce
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/scenarioFilterState.ts
@@ -0,0 +1,60 @@
+/**
+ * Scenario table filter state — the active multi-predicate filter
+ * (decision D8: flat AND/OR), one per run.
+ *
+ * The atom holds the *raw* filter the filter bar edits — it may contain
+ * half-built conditions (a column picked but no value yet). Evaluation
+ * uses `toEffectiveFilter`, which drops the incomplete ones, so a
+ * partially-typed condition never filters every row out.
+ */
+
+import type {PredicateGroup, RowPredicate} from "@agenta/entities/evaluationRun/etl"
+import {atom} from "jotai"
+import {atomFamily} from "jotai/utils"
+
+const EMPTY_FILTER: PredicateGroup = {op: "and", conditions: []}
+
+/** Per-run active scenario filter (raw — may contain half-built conditions). */
+export const scenarioFilterAtomFamily = atomFamily((_runId: string) =>
+    atom<PredicateGroup>(EMPTY_FILTER),
+)
+
+/** A condition is complete once it has a column and a defined, non-empty value. */
+export const isConditionComplete = (c: RowPredicate): boolean => {
+    if (!c.columnName) return false
+    // `in` / `nin` values are arrays — complete once non-empty.
+    if (Array.isArray(c.value)) return c.value.length > 0
+    return c.value !== undefined && c.value !== ""
+}
+
+/**
+ * The filter actually evaluated — half-built conditions dropped. Returns
+ * the same `op` with only complete conditions.
+ */
+export const toEffectiveFilter = (group: PredicateGroup): PredicateGroup => ({
+    op: group.op,
+    conditions: group.conditions.filter(isConditionComplete),
+})
+
+/** True when at least one complete condition is set. */
+export const isScenarioFilterActive = (group: PredicateGroup): boolean =>
+    group.conditions.some(isConditionComplete)
+
+/** Live scan status — written by the scenarios table, read by the filter bar. */
+export interface ScenarioFilterStatus {
+    /** Confirmed matches found so far. */
+    matchCount: number
+    /** True while the filter scan is actively working. */
+    scanning: boolean
+}
+
+const EMPTY_STATUS: ScenarioFilterStatus = {matchCount: 0, scanning: false}
+
+/**
+ * Per-run filter scan status. The scenarios table runs the scan and
+ * writes this; the filter bar — which lives in the run header, a separate
+ * part of the component tree — reads it for its match-count indicator.
+ */
+export const scenarioFilterStatusAtomFamily = atomFamily((_runId: string) =>
+    atom<ScenarioFilterStatus>(EMPTY_STATUS),
+)
diff --git a/web/oss/src/components/EvalRunDetails/etl/useCellMaterialization.ts b/web/oss/src/components/EvalRunDetails/etl/useCellMaterialization.ts
new file mode 100644
index 0000000000..d1b12e2d1b
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/useCellMaterialization.ts
@@ -0,0 +1,268 @@
+/**
+ * useCellMaterialization — lazy, batched, run-aware cell-side prefetch.
+ *
+ * The page-level `useHydrateScenarios` only fetches entity slices the
+ * active predicate touches (Phase 2). In Phase 1 (no predicate) it fetches
+ * nothing, so every visible cell materializes itself.
+ *
+ * If 30 visible cells each call `molecule.actions.prefetchByScenarioIds(
+ * [scenarioId])` independently, the backend gets 30 round trips. To avoid
+ * that, this hook coalesces same-tick requests:
+ *
+ *   1. Cell asks for `(slice, {scenarioId, runId})` on first render.
+ *   2. Request is queued in a per-slice ref-set.
+ *   3. After a microtask flush, the hook drains every per-slice queue and
+ *      issues ONE bulk prefetch per (slice, runId) with all requested IDs.
+ *   4. Cells re-render via `hydrationVersionAtom` once the writes land.
+ *
+ * Run-aware: results / metrics caches are run-scoped, so the queue is
+ * grouped by `runId` and one prefetch is issued per run. This is what
+ * lets comparison rows (which carry a different `runId` than the base
+ * run) hydrate correctly.
+ */
+
+import {useEffect, useRef} from "react"
+
+import {evaluationResultMolecule, evaluationMetricMolecule} from "@agenta/entities/evaluationRun"
+import type {EntitySlice} from "@agenta/entities/evaluationRun/etl"
+import {testcaseMolecule} from "@agenta/entities/testcase"
+import {traceSpanMolecule} from "@agenta/entities/trace"
+import {getDefaultStore, useSetAtom} from "jotai"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {hydrationVersionAtom} from "./useHydrateScenarios"
+
+interface MaterializeRequest {
+    /** scenarioId — required for results / metrics. */
+    scenarioId?: string
+    /** runId — required for results / metrics (run-scoped caches). */
+    runId?: string
+    /** testcase_id — required for testcases. */
+    testcaseId?: string
+    /** trace_id — required for traces. */
+    traceId?: string
+}
+
+interface BatchState {
+    /** Queued requests per slice. Drained on next microtask. */
+    queues: Record<EntitySlice, MaterializeRequest[]>
+    /** Per-slice "currently fetching" tracking keys so we don't double-fire. */
+    inflightKeys: Record<EntitySlice, Set<string>>
+    /**
+     * Per-slice "tried and got nothing back" tracking keys. The most
+     * common cause is HTTP 429 rate-limiting — the molecule's prefetch
+     * swallows the error and returns empty, leaving the cache empty.
+     * Without this set, the cell rerenders forever in a tight retry loop.
+     * Marked permanently for the session — user reloads to retry.
+     */
+    failedKeys: Record<EntitySlice, Set<string>>
+    /** True if a drain is already scheduled this tick. */
+    scheduled: boolean
+}
+
+const initialBatchState = (): BatchState => ({
+    queues: {results: [], metrics: [], testcases: [], traces: []},
+    inflightKeys: {
+        results: new Set(),
+        metrics: new Set(),
+        testcases: new Set(),
+        traces: new Set(),
+    },
+    failedKeys: {
+        results: new Set(),
+        metrics: new Set(),
+        testcases: new Set(),
+        traces: new Set(),
+    },
+    scheduled: false,
+})
+
+/**
+ * Stable tracking key for a (slice, request) pair. Results / metrics are
+ * run-scoped so the key includes `runId`; testcases / traces are keyed by
+ * their own id. Returns null when the request lacks the fields the slice
+ * needs.
+ */
+const trackingKey = (slice: EntitySlice, req: MaterializeRequest): string | null => {
+    if (slice === "results" || slice === "metrics") {
+        if (!req.runId || !req.scenarioId) return null
+        return `${req.runId}::${req.scenarioId}`
+    }
+    if (slice === "testcases") return req.testcaseId ?? null
+    if (slice === "traces") return req.traceId ?? null
+    return null
+}
+
+interface UseCellMaterializationArgs {
+    projectId: string | null
+    /** Page (base) run id — used only to reset state on scope change. */
+    runId: string | null
+}
+
+export interface CellMaterializer {
+    /**
+     * Request materialization of (slice, request). The hook coalesces
+     * concurrent requests on the same microtask into one bulk fetch per
+     * (slice, runId). Safe to call repeatedly from a cell's render —
+     * duplicates are deduped.
+     */
+    request: (slice: EntitySlice, req: MaterializeRequest) => void
+    /**
+     * True when a prior fetch for (slice, request) settled without
+     * populating the cache — most often a 429. Lets a cell stop showing a
+     * skeleton for a slice that will never arrive this session.
+     */
+    hasFailed: (slice: EntitySlice, req: MaterializeRequest) => boolean
+}
+
+const groupScenariosByRun = (reqs: MaterializeRequest[]): Map<string, string[]> => {
+    const out = new Map<string, string[]>()
+    for (const r of reqs) {
+        if (!r.runId || !r.scenarioId) continue
+        const arr = out.get(r.runId) ?? []
+        if (!arr.includes(r.scenarioId)) arr.push(r.scenarioId)
+        out.set(r.runId, arr)
+    }
+    return out
+}
+
+const dedupField = (reqs: MaterializeRequest[], field: "testcaseId" | "traceId"): string[] => {
+    const out = new Set<string>()
+    for (const r of reqs) {
+        const v = r[field]
+        if (typeof v === "string" && v) out.add(v)
+    }
+    return Array.from(out)
+}
+
+export const useCellMaterialization = ({
+    projectId,
+    runId,
+}: UseCellMaterializationArgs): CellMaterializer => {
+    const stateRef = useRef<BatchState>(initialBatchState())
+    const bumpHydrationVersion = useSetAtom(hydrationVersionAtom)
+
+    useEffect(() => {
+        // Reset on scope change.
+        stateRef.current = initialBatchState()
+    }, [projectId, runId])
+
+    const drain = async () => {
+        const state = stateRef.current
+        state.scheduled = false
+        if (!projectId) return
+
+        // Snapshot + reset the queues — new requests can queue while
+        // we're fetching, those trigger their own drain.
+        const queues = state.queues
+        state.queues = {results: [], metrics: [], testcases: [], traces: []}
+
+        const resultsByRun = groupScenariosByRun(queues.results)
+        const metricsByRun = groupScenariosByRun(queues.metrics)
+        const testcaseIds = dedupField(queues.testcases, "testcaseId")
+        const traceIds = dedupField(queues.traces, "traceId")
+
+        // Mark in-flight before starting fetch so subsequent ticks dedupe.
+        for (const [run, ids] of resultsByRun) {
+            for (const id of ids) state.inflightKeys.results.add(`${run}::${id}`)
+        }
+        for (const [run, ids] of metricsByRun) {
+            for (const id of ids) state.inflightKeys.metrics.add(`${run}::${id}`)
+        }
+        for (const id of testcaseIds) state.inflightKeys.testcases.add(id)
+        for (const id of traceIds) state.inflightKeys.traces.add(id)
+
+        const qc = getDefaultStore().get(queryClientAtom)
+
+        // After a fetch settles, for each requested id check whether the
+        // cache now holds data. If not, the fetch failed silently (most
+        // often a 429) — mark it failed so request() skips it on future
+        // renders, avoiding an infinite request → 429 → retry loop.
+        const markRunFailures = (
+            slice: "results" | "metrics",
+            run: string,
+            scenarioIds: string[],
+        ) => {
+            if (!qc) return
+            const prefix = slice === "results" ? "evaluation-results" : "evaluation-metrics"
+            for (const id of scenarioIds) {
+                const tk = `${run}::${id}`
+                state.inflightKeys[slice].delete(tk)
+                const cached = qc.getQueryData([prefix, projectId, run, id])
+                if (cached === undefined) state.failedKeys[slice].add(tk)
+            }
+        }
+        const markIdFailures = (slice: "testcases" | "traces", ids: string[]) => {
+            if (!qc) return
+            const prefix = slice === "testcases" ? "testcase" : "trace-entity"
+            for (const id of ids) {
+                state.inflightKeys[slice].delete(id)
+                const cached = qc.getQueryData([prefix, projectId, id])
+                if (cached === undefined) state.failedKeys[slice].add(id)
+            }
+        }
+
+        const tasks: Promise<unknown>[] = []
+        for (const [run, scenarioIds] of resultsByRun) {
+            tasks.push(
+                evaluationResultMolecule.actions
+                    .prefetchByScenarioIds({projectId, runId: run, scenarioIds})
+                    .finally(() => markRunFailures("results", run, scenarioIds)),
+            )
+        }
+        for (const [run, scenarioIds] of metricsByRun) {
+            tasks.push(
+                evaluationMetricMolecule.actions
+                    .prefetchByScenarioIds({projectId, runId: run, scenarioIds})
+                    .finally(() => markRunFailures("metrics", run, scenarioIds)),
+            )
+        }
+        if (testcaseIds.length > 0) {
+            tasks.push(
+                testcaseMolecule.actions
+                    .prefetchByIds({projectId, testcaseIds})
+                    .finally(() => markIdFailures("testcases", testcaseIds)),
+            )
+        }
+        if (traceIds.length > 0) {
+            tasks.push(
+                traceSpanMolecule.actions
+                    .prefetchByIds({projectId, traceIds})
+                    .finally(() => markIdFailures("traces", traceIds)),
+            )
+        }
+
+        try {
+            await Promise.all(tasks)
+            // Bump so cells re-render and pick up their newly-cached data.
+            if (tasks.length > 0) bumpHydrationVersion((v) => v + 1)
+        } catch (e) {
+            console.warn("[useCellMaterialization] batch failed:", e)
+        }
+    }
+
+    const request: CellMaterializer["request"] = (slice, req) => {
+        const state = stateRef.current
+        const tk = trackingKey(slice, req)
+        if (!tk) return
+        // Skip if a previous drain for this key failed (most often a 429).
+        if (state.failedKeys[slice].has(tk)) return
+        // Skip if already being fetched by an earlier batch.
+        if (state.inflightKeys[slice].has(tk)) return
+        // Skip if a sibling cell already queued the same key this tick.
+        if (state.queues[slice].some((r) => trackingKey(slice, r) === tk)) return
+        state.queues[slice].push(req)
+        if (!state.scheduled) {
+            state.scheduled = true
+            queueMicrotask(drain)
+        }
+    }
+
+    const hasFailed: CellMaterializer["hasFailed"] = (slice, req) => {
+        const tk = trackingKey(slice, req)
+        if (!tk) return false
+        return stateRef.current.failedKeys[slice].has(tk)
+    }
+
+    return {request, hasFailed}
+}
diff --git a/web/oss/src/components/EvalRunDetails/etl/useEtlColumns.tsx b/web/oss/src/components/EvalRunDetails/etl/useEtlColumns.tsx
new file mode 100644
index 0000000000..ecaf63bfcc
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/useEtlColumns.tsx
@@ -0,0 +1,121 @@
+/**
+ * useEtlColumns
+ *
+ * Derives the scenario table's **schema columns** (testset / application /
+ * evaluator(s) / metrics / other) from a run's schema (steps + mappings)
+ * via `groupRunColumns`, and adapts them into nested-header IVT columns
+ * whose leaf cells mount `EtlResolvedCell`.
+ *
+ * This replaces the backend-metadata column path (`usePreviewColumns`)
+ * for the *rendered* schema columns. The meta columns (index / status,
+ * timestamp, action) and the column-visibility trigger stay on the
+ * production path — they are not schema-derived. The two are stitched
+ * together in `Table.tsx`.
+ *
+ * "other"-kind columns are kept (the PoC dropped them) so the visible
+ * column set matches the backend-metadata path.
+ */
+
+import {useMemo} from "react"
+
+import {groupRunColumns, type ColumnGroup, type RunSchema} from "@agenta/entities/evaluationRun/etl"
+import {Tooltip} from "antd"
+import type {ColumnsType} from "antd/es/table"
+
+import type {PreviewTableRow} from "../atoms/tableRows"
+
+import EtlResolvedCell, {EtlSkeletonCell} from "./cells/EtlResolvedCell"
+import EtlColumnHeader from "./EtlColumnHeader"
+
+const WIDTH_BY_KIND: Record<ColumnGroup["kind"], number> = {
+    testset: 220,
+    application: 400,
+    evaluator: 180,
+    metrics: 140,
+    other: 180,
+}
+
+export interface UseEtlColumnsArgs {
+    projectId: string | null
+    runId: string | null
+    schema: RunSchema | null
+}
+
+/**
+ * Schema columns for the scenario table, as nested-header IVT columns.
+ * Empty until the run schema is available.
+ */
+export const useEtlColumns = ({
+    projectId,
+    runId,
+    schema,
+}: UseEtlColumnsArgs): ColumnsType<PreviewTableRow> => {
+    return useMemo<ColumnsType<PreviewTableRow>>(() => {
+        if (!schema || !projectId || !runId) return []
+
+        // "metrics"-kind columns are intentionally skipped here. The
+        // scenario table's "Metrics" group is the *static* invocation
+        // metrics (cost / duration / tokens) injected by the
+        // backend-metadata column path — not run-mapping-derived — so that
+        // group is kept on the production path in `Table.tsx` and rendered
+        // by the existing metric cell. Emitting an ETL metrics group too
+        // would duplicate it.
+        const grouped = groupRunColumns(schema.steps, schema.mappings).filter(
+            (g) => g.group.kind !== "metrics",
+        )
+
+        return grouped.map((g) => {
+            const children = g.columns.map((leaf) => {
+                const key = `${g.group.key}::${leaf.name}`
+                return {
+                    key,
+                    columnVisibilityLabel: leaf.name,
+                    title: (
+                        <Tooltip title={leaf.name} placement="top">
+                            <span className="block max-w-full overflow-hidden text-ellipsis whitespace-nowrap text-left">
+                                {leaf.name}
+                            </span>
+                        </Tooltip>
+                    ),
+                    width: WIDTH_BY_KIND[leaf.kind],
+                    minWidth: WIDTH_BY_KIND[leaf.kind],
+                    ellipsis: true,
+                    align: "left" as const,
+                    render: (_: unknown, record: PreviewTableRow) => {
+                        // antd's virtual table can briefly call a cell
+                        // render with an out-of-range `undefined` record
+                        // while the (filtered) dataSource is shrinking —
+                        // render nothing for those phantom rows.
+                        if (record == null) return null
+                        // Skeleton / not-yet-keyed rows (incl. comparison
+                        // placeholders) render a fixed-height placeholder.
+                        if (record.__isSkeleton || !record.scenarioId) {
+                            return <EtlSkeletonCell />
+                        }
+                        return (
+                            <EtlResolvedCell
+                                projectId={projectId}
+                                // Comparison rows carry their own runId.
+                                runId={record.runId ?? runId}
+                                scenarioId={record.scenarioId}
+                                scenarioStatus={record.status}
+                                columnKind={leaf.kind}
+                                columnGroupSlug={leaf.groupSlug}
+                                columnName={leaf.name}
+                                schema={schema}
+                            />
+                        )
+                    },
+                }
+            })
+
+            return {
+                key: g.group.key,
+                columnVisibilityLabel: g.group.label,
+                title: <EtlColumnHeader group={g.group} />,
+                align: "left" as const,
+                children,
+            }
+        })
+    }, [projectId, runId, schema])
+}
diff --git a/web/oss/src/components/EvalRunDetails/etl/useHydrateScenarios.ts b/web/oss/src/components/EvalRunDetails/etl/useHydrateScenarios.ts
new file mode 100644
index 0000000000..560ea14eec
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/useHydrateScenarios.ts
@@ -0,0 +1,266 @@
+/**
+ * useHydrateScenarios
+ *
+ * Watches the scenario rows the table has loaded and triggers a bulk
+ * hydrate pass per *new* page — bulk requests per page, all entities
+ * populated together.
+ *
+ * Flow per newly-seen scenario set:
+ *   1. evaluationResultMolecule.actions.prefetchByScenarioIds  → results
+ *   2. evaluationMetricMolecule.actions.prefetchByScenarioIds  → metrics
+ *   3. derive testcase_ids from results
+ *   4. prefetchTestcasesByIds(...)                             → testcases
+ *   5. derive trace_ids from results
+ *   6. prefetchTracesByIds(...)                                → traces
+ *
+ * Cache writes go through the molecules' `setQueryData` paths, so cells
+ * subscribing via `useQuery({queryKey: cacheKey, enabled: false})` see
+ * the data the moment it lands.
+ *
+ * Phase 1 note: with no active predicate and `sliceMode === "auto"` the
+ * page-level hydrate is intentionally a no-op — cells materialize their
+ * own (visible-only) data via `useCellMaterialization`. The hook is wired
+ * now so Phase 2 filtering can drive predicate-aware page hydration
+ * without a structural change.
+ */
+
+import {useEffect, useMemo, useRef, useState} from "react"
+
+import {evaluationResultMolecule, evaluationMetricMolecule} from "@agenta/entities/evaluationRun"
+import {
+    predicateToEntitySlices,
+    type EntitySlice,
+    type PredicateGroup,
+    type RowPredicate,
+    type RunSchema,
+} from "@agenta/entities/evaluationRun/etl"
+import {prefetchTestcasesByIds} from "@agenta/entities/testcase"
+import {prefetchTracesByIds} from "@agenta/entities/trace"
+import {atom, useSetAtom} from "jotai"
+
+const ALL_SLICES: EntitySlice[] = ["results", "metrics", "testcases", "traces"]
+
+/**
+ * Minimal row shape this hook reads — identity + skeleton flag. Kept
+ * structural (fields `unknown`) so it accepts both `PreviewTableRow[]` and
+ * the loosely-typed `InfiniteTableRowBase[]` the IVT pagination hook
+ * returns, without coupling to either.
+ */
+export interface HydratableRowRef {
+    scenarioId?: unknown
+    __isSkeleton?: unknown
+}
+
+/**
+ * Hydration-version atom — bumped each time a hydrate / materialize batch
+ * completes. Cells subscribe to it so they re-render and pick up
+ * late-arriving testcase / trace cache writes (whose IDs aren't known
+ * until results land). Cheap: number atom, single React tick per batch.
+ */
+export const hydrationVersionAtom = atom(0)
+
+export interface HydrationProgress {
+    /** Total unique scenario IDs hydrated since mount. */
+    hydratedScenarios: number
+    /** Pages observed (one bulk hydrate pass per page). */
+    pagesHydrated: number
+    /** Which entity slices the next page load will fetch. */
+    activeSlices: EntitySlice[]
+    /** Last error from any prefetch call, or null. */
+    lastError: string | null
+    /** True while a hydrate pass is mid-flight. */
+    isHydrating: boolean
+}
+
+const INITIAL_PROGRESS: HydrationProgress = {
+    hydratedScenarios: 0,
+    pagesHydrated: 0,
+    activeSlices: ALL_SLICES,
+    lastError: null,
+    isHydrating: false,
+}
+
+/**
+ * Slice-fetch strategy for the page-level hydrate.
+ *
+ * - "auto" (default): fetch only what's needed right now. With an active
+ *   predicate that's the predicate's slice set; with no predicate that's
+ *   zero slices — cells materialize their own data on first render
+ *   (visible-only, virtualization-aware).
+ * - "all": always fetch all 4 slices. For workflows that need every
+ *   column populated up-front (exports, bulk actions).
+ */
+export type SliceFetchMode = "auto" | "all"
+
+export interface UseHydrateScenariosArgs {
+    projectId: string | null
+    runId: string | null
+    rows: readonly HydratableRowRef[]
+    /** Run schema — maps an active predicate's column to entity slices. */
+    schema?: RunSchema | null
+    /**
+     * Active filter — a single predicate, a predicate array, or a flat
+     * AND/OR `PredicateGroup` (Phase 2). When present, page-level hydrate
+     * fetches the entity slices the filter needs so it can be evaluated.
+     */
+    predicate?: RowPredicate | RowPredicate[] | PredicateGroup | null
+    /** Hydrate strategy — see `SliceFetchMode`. Default "auto". */
+    sliceMode?: SliceFetchMode
+}
+
+export const useHydrateScenarios = ({
+    projectId,
+    runId,
+    rows,
+    schema = null,
+    predicate = null,
+    sliceMode = "auto",
+}: UseHydrateScenariosArgs): HydrationProgress => {
+    const [progress, setProgress] = useState<HydrationProgress>(INITIAL_PROGRESS)
+    const hydratedScenarioIdsRef = useRef<Set<string>>(new Set())
+    const inflightRef = useRef<Promise<void> | null>(null)
+    const bumpHydrationVersion = useSetAtom(hydrationVersionAtom)
+
+    // Compute the slice set this hydrate pass should fetch.
+    const activeSlices = useMemo<EntitySlice[]>(() => {
+        if (sliceMode === "all") return ALL_SLICES
+        const result = predicateToEntitySlices(schema, predicate)
+        if (result.fallbackToAll) return ALL_SLICES
+        if (result.slices.size === 0) {
+            // No predicate active in auto mode → page-level hydrate is a
+            // no-op. Cells materialize what they need on first render.
+            return []
+        }
+        // Always include results when testcases or traces are needed —
+        // those IDs live on result rows.
+        const slices = new Set<EntitySlice>(result.slices)
+        if (slices.has("testcases") || slices.has("traces")) slices.add("results")
+        return ALL_SLICES.filter((s) => slices.has(s))
+    }, [schema, predicate, sliceMode])
+
+    const activeSlicesKey = activeSlices.join(",")
+    useEffect(() => {
+        hydratedScenarioIdsRef.current = new Set()
+        setProgress({...INITIAL_PROGRESS, activeSlices})
+    }, [projectId, runId, activeSlicesKey])
+
+    useEffect(() => {
+        if (!projectId || !runId) return
+        // Only consider materialized (non-skeleton) scenarios with real IDs.
+        const candidateIds = rows
+            .filter(
+                (r) =>
+                    !r.__isSkeleton && typeof r.scenarioId === "string" && r.scenarioId.length > 0,
+            )
+            .map((r) => r.scenarioId as string)
+
+        const seen = hydratedScenarioIdsRef.current
+        const newIds = candidateIds.filter((id) => !seen.has(id))
+        if (newIds.length === 0) return
+
+        const slicesToFetch = new Set(activeSlices)
+        // Pure on-demand mode: nothing to fetch at the page level. Cells
+        // handle their own materialization via useCellMaterialization.
+        if (slicesToFetch.size === 0) {
+            for (const id of newIds) seen.add(id)
+            setProgress((p) => ({
+                ...p,
+                hydratedScenarios: p.hydratedScenarios + newIds.length,
+                pagesHydrated: p.pagesHydrated + 1,
+                isHydrating: false,
+                lastError: null,
+            }))
+            return
+        }
+
+        // Mark optimistically so a re-render mid-flight doesn't queue
+        // duplicate prefetch calls for the same scenarios.
+        for (const id of newIds) seen.add(id)
+
+        const emptyOutcome = {cacheHits: 0, cacheMisses: 0, fetchMs: 0}
+
+        const hydrateBatch = async () => {
+            setProgress((p) => ({...p, isHydrating: true, lastError: null}))
+            try {
+                // Stage 1 — results + metrics (parallel).
+                const [resultsOutcome] = await Promise.all([
+                    slicesToFetch.has("results")
+                        ? evaluationResultMolecule.actions.prefetchByScenarioIds({
+                              projectId,
+                              runId,
+                              scenarioIds: newIds,
+                          })
+                        : Promise.resolve({
+                              ...emptyOutcome,
+                              results: [],
+                              byScenarioId: new Map<string, never>(),
+                          }),
+                    slicesToFetch.has("metrics")
+                        ? evaluationMetricMolecule.actions.prefetchByScenarioIds({
+                              projectId,
+                              runId,
+                              scenarioIds: newIds,
+                          })
+                        : Promise.resolve(null),
+                ])
+
+                // Stage 2 — derive testcase_ids + trace_ids from results.
+                const testcaseIds = new Set<string>()
+                if (slicesToFetch.has("testcases")) {
+                    for (const result of resultsOutcome.results) {
+                        if (typeof result.testcase_id === "string" && result.testcase_id) {
+                            testcaseIds.add(result.testcase_id)
+                        }
+                    }
+                }
+
+                const traceIds = new Set<string>()
+                if (slicesToFetch.has("traces")) {
+                    for (const result of resultsOutcome.results) {
+                        if (typeof result.trace_id === "string" && result.trace_id) {
+                            traceIds.add(result.trace_id)
+                        }
+                    }
+                }
+
+                await Promise.all([
+                    testcaseIds.size > 0
+                        ? prefetchTestcasesByIds({
+                              projectId,
+                              testcaseIds: Array.from(testcaseIds),
+                          })
+                        : Promise.resolve(emptyOutcome),
+                    traceIds.size > 0
+                        ? prefetchTracesByIds({
+                              projectId,
+                              traceIds: Array.from(traceIds),
+                          })
+                        : Promise.resolve(emptyOutcome),
+                ])
+
+                setProgress((p) => ({
+                    hydratedScenarios: p.hydratedScenarios + newIds.length,
+                    pagesHydrated: p.pagesHydrated + 1,
+                    activeSlices,
+                    lastError: null,
+                    isHydrating: false,
+                }))
+                bumpHydrationVersion((v) => v + 1)
+            } catch (e) {
+                // On failure, un-mark so the next render can retry.
+                for (const id of newIds) seen.delete(id)
+                setProgress((p) => ({
+                    ...p,
+                    lastError: e instanceof Error ? e.message : String(e),
+                    isHydrating: false,
+                }))
+            }
+        }
+
+        // Serialize hydrate calls — multiple page-loads in quick
+        // succession get queued, not parallel.
+        inflightRef.current = (inflightRef.current ?? Promise.resolve()).then(hydrateBatch)
+    }, [projectId, runId, rows, activeSlicesKey, activeSlices, bumpHydrationVersion])
+
+    return progress
+}
diff --git a/web/oss/src/components/EvalRunDetails/etl/useScenarioFilter.ts b/web/oss/src/components/EvalRunDetails/etl/useScenarioFilter.ts
new file mode 100644
index 0000000000..7458d9fc2a
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/useScenarioFilter.ts
@@ -0,0 +1,197 @@
+/**
+ * useScenarioFilter — applies the active multi-predicate filter (D8) to
+ * the scenario rows.
+ *
+ * While a filter is active the result holds **confirmed matches only** —
+ * a row appears once it is hydrated AND satisfies the filter. The list
+ * therefore only ever grows during a scan: rows materialize as their data
+ * arrives, and a row is never shown and then dropped, so the table does
+ * not flicker. Unhydrated rows and skeleton placeholders are withheld
+ * until they confirm (or are ruled out).
+ *
+ * Because a strict filter can reduce the visible row count below the
+ * viewport height, the IVT's scroll-triggered `loadMore` may never fire.
+ * While a filter is active this hook drives `loadNextPage` itself until
+ * enough confirmed matches accumulate or the dataset is exhausted.
+ */
+
+import {useEffect, useMemo} from "react"
+
+import {evaluationResultMolecule, evaluationMetricMolecule} from "@agenta/entities/evaluationRun"
+import {
+    evaluateRowFilter,
+    resolveMappings,
+    type HydratedScenarioRow,
+    type PredicateGroup,
+    type ResolvedColumn,
+    type RunSchema,
+} from "@agenta/entities/evaluationRun/etl"
+import {useQueryClient, type QueryClient} from "@tanstack/react-query"
+import {useAtomValue} from "jotai"
+
+import {
+    scenarioFilterAtomFamily,
+    isScenarioFilterActive,
+    toEffectiveFilter,
+} from "./scenarioFilterState"
+import {hydrationVersionAtom} from "./useHydrateScenarios"
+
+/** Enough confirmed matches to fill a typical viewport before the loop stops. */
+const VIEWPORT_FILL_TARGET = 30
+
+interface FilterableRow {
+    scenarioId?: unknown
+    __isSkeleton?: unknown
+}
+
+/**
+ * Build a row's resolved columns from the molecule caches. Returns `null`
+ * when nothing is hydrated yet for the scenario (results + metrics both
+ * empty) — the caller treats that as "not known yet".
+ */
+function resolveScenarioColumnsFromCache(
+    queryClient: QueryClient,
+    projectId: string,
+    runId: string,
+    scenarioId: string,
+    schema: RunSchema,
+): ResolvedColumn[] | null {
+    const results = (evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId}) ??
+        []) as HydratedScenarioRow["results"]
+    const metrics = (evaluationMetricMolecule.get.byScenario({projectId, runId, scenarioId}) ??
+        []) as HydratedScenarioRow["metrics"]
+    if (results.length === 0 && metrics.length === 0) return null
+
+    const testcaseId =
+        results.find((r) => typeof r.testcase_id === "string" && r.testcase_id)?.testcase_id ?? null
+    const testcase = testcaseId
+        ? (queryClient.getQueryData<HydratedScenarioRow["testcase"]>([
+              "testcase",
+              projectId,
+              testcaseId,
+          ]) ?? null)
+        : null
+
+    const traces: Record<string, unknown> = {}
+    for (const r of results) {
+        if (typeof r.trace_id === "string" && r.trace_id) {
+            const cached = queryClient.getQueryData<unknown>([
+                "trace-entity",
+                projectId,
+                r.trace_id,
+            ])
+            if (cached != null) traces[r.trace_id] = cached
+        }
+    }
+
+    return resolveMappings(
+        {
+            scenario: {id: scenarioId, status: "success"},
+            results,
+            metrics,
+            testcase,
+            traces,
+        },
+        {steps: schema.steps, mappings: schema.mappings},
+    )
+}
+
+export interface UseScenarioFilterArgs<TRow extends FilterableRow> {
+    projectId: string | null
+    runId: string | null
+    schema: RunSchema | null
+    /** The base (main-run) rows, pre-merge. */
+    baseRows: readonly TRow[]
+    loadNextPage: () => void
+    hasMore: boolean
+    isFetching: boolean
+}
+
+export interface UseScenarioFilterResult<TRow extends FilterableRow> {
+    /** Raw filter (may contain half-built conditions) — for the filter bar. */
+    rawFilter: PredicateGroup
+    /** Filter actually evaluated — half-built conditions dropped. */
+    effectiveFilter: PredicateGroup
+    /** True when at least one complete condition is set. */
+    active: boolean
+    /** Base rows after the filter — unfiltered when no filter is active. */
+    filteredBaseRows: TRow[]
+    /** Rows confirmed (hydrated AND matching) to satisfy the filter. */
+    confirmedMatchCount: number
+    /**
+     * True while the viewport-fill loop still intends to load more pages
+     * (a filter is active, the target match count is not yet reached, and
+     * the dataset has more pages). Goes false once enough matches are
+     * found — even if the dataset has more pages — so the UI can stop
+     * showing "scanning" when the loop is actually idle.
+     */
+    isFilling: boolean
+}
+
+export function useScenarioFilter<TRow extends FilterableRow>({
+    projectId,
+    runId,
+    schema,
+    baseRows,
+    loadNextPage,
+    hasMore,
+    isFetching,
+}: UseScenarioFilterArgs<TRow>): UseScenarioFilterResult<TRow> {
+    const queryClient = useQueryClient()
+    const rawFilter = useAtomValue(scenarioFilterAtomFamily(runId ?? "__none__"))
+    // Re-evaluate when the molecule caches change.
+    const hydrationVersion = useAtomValue(hydrationVersionAtom)
+
+    const effectiveFilter = useMemo(() => toEffectiveFilter(rawFilter), [rawFilter])
+    const active = isScenarioFilterActive(rawFilter)
+
+    // Confirmed matches only — a row is included once it is hydrated AND
+    // satisfies the filter. Skeleton / unhydrated rows are withheld, so
+    // the list only grows as the scan progresses (no show-then-drop).
+    const filteredBaseRows = useMemo(() => {
+        if (!active || !schema || !projectId || !runId) return baseRows as TRow[]
+        return (baseRows as TRow[]).filter((r) => {
+            const scenarioId = typeof r.scenarioId === "string" ? r.scenarioId : null
+            // Skeleton / not-yet-keyed rows can't be evaluated — withhold.
+            if (r.__isSkeleton || !scenarioId) return false
+            const cols = resolveScenarioColumnsFromCache(
+                queryClient,
+                projectId,
+                runId,
+                scenarioId,
+                schema,
+            )
+            // Not hydrated yet — withhold until its data arrives.
+            if (!cols) return false
+            return evaluateRowFilter(effectiveFilter, cols)
+        })
+    }, [baseRows, active, schema, projectId, runId, effectiveFilter, hydrationVersion, queryClient])
+
+    // With a filter active, every row in `filteredBaseRows` is a confirmed
+    // match — so the count is just its length.
+    const confirmedMatchCount = active ? filteredBaseRows.length : 0
+
+    // The viewport-fill loop still wants more pages — i.e. the autonomous
+    // scan is genuinely in progress.
+    const isFilling = active && hasMore && confirmedMatchCount < VIEWPORT_FILL_TARGET
+
+    // Viewport-fill loop — a strict filter may keep the visible row count
+    // below the viewport, so IVT's scroll-triggered loadMore never fires.
+    // Drive it ourselves until enough confirmed matches accumulate or the
+    // dataset is exhausted.
+    useEffect(() => {
+        if (!active) return
+        if (!hasMore || isFetching) return
+        if (confirmedMatchCount >= VIEWPORT_FILL_TARGET) return
+        loadNextPage()
+    }, [active, hasMore, isFetching, confirmedMatchCount, loadNextPage])
+
+    return {
+        rawFilter,
+        effectiveFilter,
+        active,
+        filteredBaseRows,
+        confirmedMatchCount,
+        isFilling,
+    }
+}
diff --git a/web/oss/src/components/EvalRunDetails/etl/useScenarioLiveUpdates.ts b/web/oss/src/components/EvalRunDetails/etl/useScenarioLiveUpdates.ts
new file mode 100644
index 0000000000..f2234fb96b
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/useScenarioLiveUpdates.ts
@@ -0,0 +1,181 @@
+/**
+ * useScenarioLiveUpdates — keeps the ETL scenarios table fresh while a
+ * run is still executing (T6).
+ *
+ * The ETL table resolves every cell from molecule caches that are
+ * populated *once*: `useHydrateScenarios` / `useCellMaterialization`
+ * fetch a scenario's results + metrics on first sight and never again.
+ * That is correct for a finished run, but a run in progress mutates —
+ *
+ *   - a scenario's `status` flips  pending → running → success
+ *   - its results / metrics only appear once it completes
+ *
+ * Without a refresh loop a scenario that finishes after the table loaded
+ * keeps a stale empty molecule cache (an empty `[]` was cached while it
+ * was running) and a stale `running` row status, so its cells show the
+ * "Running" indicator forever.
+ *
+ * While the run is non-terminal this hook, on an interval:
+ *
+ *   1. Refetches the loaded scenario *pages* — refreshes each row's
+ *      `status` so a completed scenario's cells leave the running state.
+ *   2. Evicts + re-prefetches the results / metrics molecule caches for
+ *      every scenario that is still running, or that finished since the
+ *      last tick — replacing the stale empty cache with real values.
+ *      (`prefetchByScenarioIds` is cache-aware and would otherwise skip
+ *      an already-cached scenario, so the evict is required.)
+ *   3. Bumps `hydrationVersionAtom` so cells re-render, re-resolve, and
+ *      (via `useCellMaterialization`) pick up freshly-derivable testcase
+ *      / trace slices.
+ *
+ * One final pass runs when the run reaches a terminal status, then the
+ * loop stops.
+ */
+
+import {useCallback, useEffect, useRef} from "react"
+
+import {evaluationResultMolecule, evaluationMetricMolecule} from "@agenta/entities/evaluationRun"
+import {useSetAtom, useStore} from "jotai"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {isTerminalStatus} from "../atoms/compare"
+import type {PreviewTableRow} from "../atoms/tableRows"
+import {evaluationPreviewTableStore} from "../evaluationPreviewTableStore"
+
+import {hydrationVersionAtom} from "./useHydrateScenarios"
+
+/** Refresh cadence — mirrors the run-status poll in `evaluationRunQueryAtomFamily`. */
+const LIVE_REFRESH_INTERVAL_MS = 5000
+
+export interface UseScenarioLiveUpdatesArgs {
+    projectId: string | null
+    runId: string | null
+    /**
+     * Latest run-level status. Drives whether the live loop runs — it
+     * stops once the run is terminal. `null` (status not loaded yet) is
+     * treated as not-live so the loop doesn't start on a blank run.
+     */
+    runStatus: string | null | undefined
+    /** Page size — addresses the base run's page-query atoms. */
+    pageSize: number
+}
+
+export const useScenarioLiveUpdates = ({
+    projectId,
+    runId,
+    runStatus,
+    pageSize,
+}: UseScenarioLiveUpdatesArgs): void => {
+    const store = useStore()
+    const bumpHydrationVersion = useSetAtom(hydrationVersionAtom)
+    /** Scenario ids observed non-terminal on the previous tick. */
+    const lastNonTerminalRef = useRef<Set<string>>(new Set())
+    /** Guard against overlapping ticks if a refresh runs long. */
+    const inflightRef = useRef(false)
+
+    const tick = useCallback(async () => {
+        if (!projectId || !runId) return
+        if (inflightRef.current) return
+        inflightRef.current = true
+        try {
+            // 1. Refetch the loaded scenario pages → refresh row statuses.
+            const qc = store.get(queryClientAtom)
+            if (qc) {
+                await qc.invalidateQueries({
+                    queryKey: [evaluationPreviewTableStore.key, runId],
+                    exact: false,
+                })
+            }
+
+            // 2. Read the fresh rows straight from the store — bypasses the
+            //    React-state lag so a just-flipped status is seen now.
+            const rows = store.get(
+                evaluationPreviewTableStore.atoms.combinedRowsAtomFamily({
+                    scopeId: runId,
+                    pageSize,
+                }),
+            ) as PreviewTableRow[]
+
+            const loadedIds = new Set<string>()
+            const currentNonTerminal = new Set<string>()
+            for (const row of rows) {
+                if (row.__isSkeleton) continue
+                const sid = row.scenarioId
+                if (typeof sid !== "string" || !sid) continue
+                loadedIds.add(sid)
+                if (!isTerminalStatus(row.status)) currentNonTerminal.add(sid)
+            }
+
+            // 3. Refresh set: scenarios still running, plus those that
+            //    finished since the last tick (their molecule cache still
+            //    holds the empty `[]` written while they were running).
+            const refreshIds = new Set<string>(currentNonTerminal)
+            for (const sid of lastNonTerminalRef.current) {
+                if (!currentNonTerminal.has(sid) && loadedIds.has(sid)) {
+                    refreshIds.add(sid)
+                }
+            }
+            lastNonTerminalRef.current = currentNonTerminal
+
+            // 4. Evict + re-prefetch the molecule caches for those scenarios.
+            if (refreshIds.size > 0) {
+                const scenarioIds = Array.from(refreshIds)
+                evaluationResultMolecule.actions.evictByScenarioIds({
+                    projectId,
+                    runId,
+                    scenarioIds,
+                })
+                evaluationMetricMolecule.actions.evictByScenarioIds({
+                    projectId,
+                    runId,
+                    scenarioIds,
+                })
+                await Promise.all([
+                    evaluationResultMolecule.actions.prefetchByScenarioIds({
+                        projectId,
+                        runId,
+                        scenarioIds,
+                    }),
+                    evaluationMetricMolecule.actions.prefetchByScenarioIds({
+                        projectId,
+                        runId,
+                        scenarioIds,
+                    }),
+                ])
+            }
+
+            // 5. Re-render cells so they re-resolve and (via the cell
+            //    materializer) re-derive testcase / trace slices.
+            bumpHydrationVersion((v) => v + 1)
+        } catch {
+            // Transient failure — the next tick retries.
+        } finally {
+            inflightRef.current = false
+        }
+    }, [projectId, runId, pageSize, store, bumpHydrationVersion])
+
+    const isLive = !!projectId && !!runId && runStatus != null && !isTerminalStatus(runStatus)
+
+    // Interval refresh while the run is non-terminal.
+    useEffect(() => {
+        if (!isLive) return
+        const id = setInterval(() => void tick(), LIVE_REFRESH_INTERVAL_MS)
+        return () => clearInterval(id)
+    }, [isLive, tick])
+
+    // One final pass when the run finishes — catches the last batch of
+    // scenarios that completed between the previous tick and the run
+    // reaching a terminal status.
+    const wasLiveRef = useRef(false)
+    const flushedRef = useRef(false)
+    useEffect(() => {
+        if (isLive) {
+            wasLiveRef.current = true
+            flushedRef.current = false
+            return
+        }
+        if (!wasLiveRef.current || flushedRef.current) return
+        flushedRef.current = true
+        void tick()
+    }, [isLive, tick])
+}
diff --git a/web/oss/src/components/EvalRunDetails/etl/useScopeChangeEviction.ts b/web/oss/src/components/EvalRunDetails/etl/useScopeChangeEviction.ts
new file mode 100644
index 0000000000..70fcb520d5
--- /dev/null
+++ b/web/oss/src/components/EvalRunDetails/etl/useScopeChangeEviction.ts
@@ -0,0 +1,52 @@
+/**
+ * useScopeChangeEviction
+ *
+ * Evicts the molecule caches the ETL hydrate path wrote when the
+ * (projectId, runId) scope changes or the table unmounts.
+ *
+ * Triggers:
+ *   - on dependency change (the *previous* scope's data gets evicted)
+ *   - on unmount (component going away — release everything we wrote)
+ *
+ * What it evicts:
+ *   - results + metrics → molecule.actions.evictByRunId (scoped to runId)
+ *   - testcase + trace-entity + span → clearCacheByPrefix (run-agnostic)
+ *
+ * Atom families are intentionally NOT cleared: other views (focus drawer,
+ * observability tab) may subscribe to the same trace atoms. A
+ * `family.clear()` would yank their subscriptions too.
+ */
+
+import {useEffect, useRef} from "react"
+
+import {evaluationResultMolecule, evaluationMetricMolecule} from "@agenta/entities/evaluationRun"
+import {clearCacheByPrefix} from "@agenta/entities/evaluationRun/etl"
+
+export interface UseScopeChangeEvictionArgs {
+    projectId: string | null
+    runId: string | null
+}
+
+export const useScopeChangeEviction = ({projectId, runId}: UseScopeChangeEvictionArgs): void => {
+    // Track the previous (projectId, runId) so the cleanup function evicts
+    // the *outgoing* scope, not the incoming one.
+    const prevRef = useRef<{projectId: string | null; runId: string | null}>({
+        projectId: null,
+        runId: null,
+    })
+
+    useEffect(() => {
+        prevRef.current = {projectId, runId}
+        return () => {
+            const {projectId: pp, runId: rr} = prevRef.current
+            if (!pp || !rr) return
+            try {
+                evaluationResultMolecule.actions.evictByRunId({projectId: pp, runId: rr})
+                evaluationMetricMolecule.actions.evictByRunId({projectId: pp, runId: rr})
+                clearCacheByPrefix(["testcase", "trace-entity", "span"])
+            } catch {
+                // QueryClient may already be torn down on app close — swallow.
+            }
+        }
+    }, [projectId, runId])
+}
diff --git a/web/oss/src/components/Filters/Filters.tsx b/web/oss/src/components/Filters/Filters.tsx
index b81cab3576..447b21a9a1 100644
--- a/web/oss/src/components/Filters/Filters.tsx
+++ b/web/oss/src/components/Filters/Filters.tsx
@@ -716,6 +716,11 @@ const Filters: React.FC<Props> = ({
 
     const isApplyDisabled = rowValidations.some(({isValid}) => !isValid)
 
+    const nonPermanentFilterCount = useMemo(
+        () => filter.filter((f) => !f.isPermanent).length,
+        [filter],
+    )
+
     const onDeleteFilter = (index: number) =>
         setFilter((prev) => prev.filter((_, idx) => idx !== index))
     const clearFilter = () => {
@@ -1665,6 +1670,7 @@ const Filters: React.FC<Props> = ({
                                             )}
 
                                             {!item.isPermanent &&
+                                                nonPermanentFilterCount > 1 &&
                                                 !(
                                                     isAnnotationFieldSelected &&
                                                     (isEvaluatorActive || isFeedbackActive)
diff --git a/web/oss/src/components/SharedActions/AddActionsDropdown/index.tsx b/web/oss/src/components/SharedActions/AddActionsDropdown/index.tsx
index 83beed09ab..c2d33103fb 100644
--- a/web/oss/src/components/SharedActions/AddActionsDropdown/index.tsx
+++ b/web/oss/src/components/SharedActions/AddActionsDropdown/index.tsx
@@ -1,6 +1,6 @@
 import {useCallback, useMemo, useRef, useState} from "react"
 
-import {DatabaseIcon, ListChecks, PlusIcon} from "@phosphor-icons/react"
+import {DatabaseIcon, ListChecks, ListPlus, PlusIcon} from "@phosphor-icons/react"
 import {Button, Dropdown, type MenuProps} from "antd"
 import clsx from "clsx"
 import dynamic from "next/dynamic"
@@ -12,6 +12,9 @@ const AddToQueuePopover = dynamic(
     {ssr: false},
 )
 
+/** Which queue action the (single) picker popover is serving. */
+type QueueScope = "selected" | "all-matching"
+
 const AddActionsDropdown = ({
     size = "middle",
     disabled = false,
@@ -22,9 +25,11 @@ const AddActionsDropdown = ({
     testsetAction,
     additionalActions,
     queueAction,
+    queueAllMatchingAction,
 }: AddActionsDropdownProps) => {
     const [dropdownOpen, setDropdownOpen] = useState(false)
-    const [queuePopoverOpen, setQueuePopoverOpen] = useState(false)
+    // `null` = picker closed; otherwise which queue action opened it.
+    const [queueScope, setQueueScope] = useState<QueueScope | null>(null)
     const suppressNextDropdownOpenRef = useRef(false)
 
     const actions = useMemo(
@@ -35,8 +40,11 @@ const AddActionsDropdown = ({
                     disabled: action.disabled ?? false,
                 })),
                 queueAction ? {disabled: queueAction.disabled ?? false} : null,
+                queueAllMatchingAction
+                    ? {disabled: queueAllMatchingAction.disabled ?? false}
+                    : null,
             ].filter(Boolean),
-        [additionalActions, queueAction, testsetAction],
+        [additionalActions, queueAction, queueAllMatchingAction, testsetAction],
     )
 
     const isButtonDisabled =
@@ -45,11 +53,11 @@ const AddActionsDropdown = ({
     const handleButtonClick = useCallback(() => {
         if (isButtonDisabled) return
 
-        if (queuePopoverOpen) {
+        if (queueScope !== null) {
             suppressNextDropdownOpenRef.current = true
-            setQueuePopoverOpen(false)
+            setQueueScope(null)
         }
-    }, [isButtonDisabled, queuePopoverOpen])
+    }, [isButtonDisabled, queueScope])
 
     const handleDropdownOpenChange = useCallback((nextOpen: boolean) => {
         if (suppressNextDropdownOpenRef.current && nextOpen) {
@@ -60,7 +68,7 @@ const AddActionsDropdown = ({
         setDropdownOpen(nextOpen)
     }, [])
 
-    const handleMenuClick = useCallback<MenuProps["onClick"]>(
+    const handleMenuClick = useCallback<NonNullable<MenuProps["onClick"]>>(
         ({key}) => {
             setDropdownOpen(false)
 
@@ -76,12 +84,27 @@ const AddActionsDropdown = ({
             }
 
             if (key === "queue" && queueAction && !(queueAction.disabled ?? false)) {
-                requestAnimationFrame(() => {
-                    setQueuePopoverOpen(true)
-                })
+                // rAF: let the dropdown finish closing before the popover opens.
+                requestAnimationFrame(() => setQueueScope("selected"))
+                return
+            }
+
+            if (
+                key === "queue-all" &&
+                queueAllMatchingAction &&
+                !(queueAllMatchingAction.disabled ?? false)
+            ) {
+                const {onBeforeOpen} = queueAllMatchingAction
+                // `onBeforeOpen` can gate the picker (e.g. a no-filter confirm).
+                void (async () => {
+                    const proceed = (await onBeforeOpen?.()) ?? true
+                    if (proceed) {
+                        requestAnimationFrame(() => setQueueScope("all-matching"))
+                    }
+                })()
             }
         },
-        [additionalActions, queueAction, testsetAction],
+        [additionalActions, queueAction, queueAllMatchingAction, testsetAction],
     )
 
     const menuItems = useMemo<NonNullable<MenuProps["items"]>>(() => {
@@ -108,14 +131,23 @@ const AddActionsDropdown = ({
         if (queueAction) {
             items.push({
                 key: "queue",
-                label: "Add annotation queue",
+                label: queueAction.label ?? "Add annotation queue",
                 icon: <ListChecks size={14} />,
                 disabled: queueAction.disabled ?? false,
             })
         }
 
+        if (queueAllMatchingAction) {
+            items.push({
+                key: "queue-all",
+                label: queueAllMatchingAction.label,
+                icon: <ListPlus size={14} />,
+                disabled: queueAllMatchingAction.disabled ?? false,
+            })
+        }
+
         return items
-    }, [additionalActions, queueAction, testsetAction])
+    }, [additionalActions, queueAction, queueAllMatchingAction, testsetAction])
 
     if (actions.length === 0) return null
 
@@ -146,16 +178,27 @@ const AddActionsDropdown = ({
         </Dropdown>
     )
 
+    const hasQueuePicker = Boolean(queueAction || queueAllMatchingAction)
+    // When the picker is closed `queueScope` is null — fall back to whichever
+    // action exists so the popover always has a coherent set of props.
+    const effectiveScope: QueueScope = queueScope ?? (queueAction ? "selected" : "all-matching")
+    const isAllMatching = effectiveScope === "all-matching"
+
     return (
         <div className={clsx("inline-flex", className)}>
-            {queueAction ? (
+            {hasQueuePicker ? (
                 <AddToQueuePopover
-                    itemType={queueAction.itemType}
-                    itemIds={queueAction.itemIds}
+                    itemType={isAllMatching ? "traces" : (queueAction?.itemType ?? "traces")}
+                    itemIds={isAllMatching ? undefined : (queueAction?.itemIds ?? [])}
+                    onItemsAdded={isAllMatching ? undefined : queueAction?.onItemsAdded}
+                    onQueueSelected={
+                        isAllMatching ? queueAllMatchingAction?.onQueueSelected : undefined
+                    }
                     disabled={isButtonDisabled}
-                    onItemsAdded={queueAction.onItemsAdded}
-                    open={queuePopoverOpen}
-                    onOpenChange={setQueuePopoverOpen}
+                    open={queueScope !== null}
+                    onOpenChange={(nextOpen) => {
+                        if (!nextOpen) setQueueScope(null)
+                    }}
                     toggleOnTriggerClick={false}
                 >
                     {dropdown}
diff --git a/web/oss/src/components/SharedActions/AddActionsDropdown/types.ts b/web/oss/src/components/SharedActions/AddActionsDropdown/types.ts
index 240a10b6bf..ef65cd9792 100644
--- a/web/oss/src/components/SharedActions/AddActionsDropdown/types.ts
+++ b/web/oss/src/components/SharedActions/AddActionsDropdown/types.ts
@@ -1,5 +1,6 @@
 import type {ReactNode} from "react"
 
+import type {SimpleQueue} from "@agenta/entities/simpleQueue"
 import type {ButtonProps} from "antd"
 
 export interface AddActionsDropdownAction {
@@ -22,10 +23,25 @@ export interface AddActionsDropdownProps {
         onSelect: () => void
     }
     additionalActions?: AddActionsDropdownAction[]
+    /** Selection-scoped queue add — adds a known set of item ids. */
     queueAction?: {
         itemType: "traces" | "testcases"
         itemIds: string[]
+        /** Menu label. Defaults to "Add annotation queue". */
+        label?: string
         disabled?: boolean
         onItemsAdded?: () => void
     }
+    /**
+     * Filter-scoped queue add — the caller owns the add (e.g. a background
+     * scan over a filter). Picking a queue calls `onQueueSelected`;
+     * `onBeforeOpen` can gate the picker (return `false` to abort — e.g. the
+     * user declined a confirm dialog).
+     */
+    queueAllMatchingAction?: {
+        label: string
+        disabled?: boolean
+        onBeforeOpen?: () => boolean | Promise<boolean>
+        onQueueSelected: (queue: SimpleQueue) => void
+    }
 }
diff --git a/web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx b/web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx
index ae3a150c91..3b4d9db56a 100644
--- a/web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx
+++ b/web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx
@@ -1,12 +1,15 @@
 import {useCallback, useEffect, useMemo, useRef, useState} from "react"
 
-import {message} from "@agenta/ui/app-message"
+import type {SimpleQueue} from "@agenta/entities/simpleQueue"
+import {exportMatchingTraces} from "@agenta/entities/trace/etl"
+import {message, modal} from "@agenta/ui/app-message"
 import {ArrowsClockwiseIcon, ExportIcon, TrashIcon} from "@phosphor-icons/react"
 import {Button, Input, Radio, RadioChangeEvent, Space, Switch, Tooltip, Typography} from "antd"
 import clsx from "clsx"
 import {useAtomValue, useSetAtom} from "jotai"
 import {queryClientAtom} from "jotai-tanstack-query"
 import dynamic from "next/dynamic"
+import Papa from "papaparse"
 
 import EnhancedButton from "@/oss/components/EnhancedUIs/Button"
 import {SortResult} from "@/oss/components/Filters/Sort"
@@ -14,15 +17,13 @@ import AddActionsDropdown from "@/oss/components/SharedActions/AddActionsDropdow
 import {deleteTraceModalAtom} from "@/oss/components/SharedDrawers/TraceDrawer/components/DeleteTraceModal/store/atom"
 import useLazyEffect from "@/oss/hooks/useLazyEffect"
 import {useProjectPermissions} from "@/oss/hooks/useProjectPermissions"
-import {downloadCsv} from "@/oss/lib/helpers/fileManipulations"
 import {getNodeById} from "@/oss/lib/traces/observability_helpers"
 import {Filter, FilterConditions, KeyValuePair} from "@/oss/lib/Types"
 import {getAppValues} from "@/oss/state/app"
 import {useObservability} from "@/oss/state/newObservability"
-import {
-    buildTraceQueryParams,
-    fetchAllTracesForExport,
-} from "@/oss/state/newObservability/atoms/queryHelpers"
+import {buildTraceQueryParams} from "@/oss/state/newObservability/atoms/queryHelpers"
+import {createAdaptiveTracePageFetcher} from "@/oss/state/newObservability/etl/adaptiveTracePageFetcher"
+import {createExportWriter, PICKER_CANCELLED} from "@/oss/state/newObservability/etl/exportWriter"
 import {getAgData} from "@/oss/state/newObservability/selectors/tracing"
 
 import {createTraceObject, DEFAULT_TRACE_EXPORT_HEADERS} from "../../assets/exportUtils"
@@ -31,6 +32,8 @@ import getFilterColumns from "../../assets/getFilterColumns"
 import {ObservabilityHeaderProps} from "../../assets/types"
 import {AUTO_REFRESH_INTERVAL} from "../../constants"
 
+import {useBatchAddTracesToQueue} from "./useBatchAddTracesToQueue"
+
 const Filters = dynamic(() => import("@/oss/components/Filters/Filters"), {ssr: false})
 const Sort = dynamic(() => import("@/oss/components/Filters/Sort"), {ssr: false})
 
@@ -131,6 +134,7 @@ const ObservabilityHeader = ({
         setAutoRefresh: hookSetAutoRefresh,
     } = useObservability()
     const queryClient = useAtomValue(queryClientAtom)
+    const runBatchAdd = useBatchAddTracesToQueue()
 
     // Use props if provided (sessions), otherwise use hook (traces)
     const autoRefresh = propsAutoRefresh ?? hookAutoRefresh
@@ -275,68 +279,111 @@ const ObservabilityHeader = ({
     const onExport = useCallback(async () => {
         const exportKey = "observability-export"
 
-        try {
-            if (!canExportData) return
-            if (!traces.length) return
-
-            const {currentApp} = getAppValues()
-            const appId = currentApp?.id || ""
-            const filename = `${currentApp?.name ?? currentApp?.slug ?? ""}_observability.csv`
+        if (!canExportData) return
+        if (!traces.length) return
 
-            const {
-                params,
-                hasAnnotationConditions,
-                hasAnnotationOperator,
-                isHasAnnotationSelected,
-            } = buildTraceQueryParams(filters, sort, traceTabs, undefined)
+        const {currentApp} = getAppValues()
+        const appId = currentApp?.id || ""
+        const filename = `${currentApp?.name ?? currentApp?.slug ?? ""}_observability.csv`
 
-            const headers =
-                columns
-                    .map((col) => {
-                        if (col.title === "ID") return "Trace ID"
-                        return typeof col.title === "string" ? col.title : null
-                    })
-                    .filter((header): header is string => Boolean(header)) || []
+        const {params, hasAnnotationConditions, hasAnnotationOperator, isHasAnnotationSelected} =
+            buildTraceQueryParams(filters, sort, traceTabs, undefined)
 
-            const controller = new AbortController()
-            exportAbortRef.current = controller
+        const headers =
+            columns
+                .map((col) => {
+                    if (col.title === "ID") return "Trace ID"
+                    return typeof col.title === "string" ? col.title : null
+                })
+                .filter((header): header is string => Boolean(header)) || []
+        const csvHeaders = headers.length > 0 ? headers : DEFAULT_TRACE_EXPORT_HEADERS
+
+        // Open the native file picker BEFORE starting the scan when the
+        // browser supports `showSaveFilePicker` (Chromium). User-cancel of
+        // the picker bails before any request is fired. On Safari / Firefox
+        // this falls back to the buffered Blob path — same UX as before.
+        const writer = await createExportWriter({filename, headers: csvHeaders})
+        if (writer === PICKER_CANCELLED) return
+
+        const controller = new AbortController()
+        exportAbortRef.current = controller
+
+        setIsExporting(true)
+        message.loading({
+            content: "Preparing export",
+            key: exportKey,
+            duration: 0,
+        })
 
-            setIsExporting(true)
-            message.loading({
-                content: "Preparing export",
-                key: exportKey,
-                duration: 0,
-            })
+        // Last reported row count — kept so a rate-limit pause can keep
+        // showing progress while the scan is paused for backoff.
+        let lastRowCount = 0
+
+        // Shared adaptive fetcher: bucket-aware proactive pacing + 429
+        // retry as the safety net. The queue scan uses the same helper —
+        // both pipelines now pace from the live bucket signal instead of
+        // an arbitrary constant.
+        const fetchPage = createAdaptiveTracePageFetcher({
+            params,
+            appId,
+            isHasAnnotationSelected,
+            hasAnnotationConditions,
+            hasAnnotationOperator,
+            signal: controller.signal,
+            onRateLimitPause: (delayMs) => {
+                message.loading({
+                    content:
+                        `Rate limited — pausing for ${Math.ceil(delayMs / 1000)}s` +
+                        ` (exported ${lastRowCount.toLocaleString()} rows so far)`,
+                    key: exportKey,
+                    duration: 0,
+                })
+            },
+        })
 
-            const {csvParts, rowCount, limitReached} = await fetchAllTracesForExport({
-                params,
-                appId,
-                isHasAnnotationSelected,
-                hasAnnotationConditions,
-                hasAnnotationOperator,
-                formatRow: createTraceObject,
-                headers: headers.length > 0 ? headers : DEFAULT_TRACE_EXPORT_HEADERS,
+        try {
+            const {rowCount, limitReached} = await exportMatchingTraces({
+                fetchPage,
+                flushBatch: async (batch) => {
+                    const rows = batch.map(createTraceObject)
+                    // The writer streams to disk on Chromium and buffers in
+                    // memory on Safari / Firefox — at this seam they look
+                    // identical to the pipeline, so memory stays bounded by
+                    // one batch on supported browsers.
+                    await writer.write(
+                        "\r\n" +
+                            Papa.unparse(
+                                {fields: csvHeaders, data: rows},
+                                {header: false, escapeFormulae: true},
+                            ),
+                    )
+                },
                 signal: controller.signal,
-                onProgress: (count) => {
+                // All pacing is done inside `fetchPage` based on the live
+                // bucket state — disable the source-level fixed delay.
+                pageDelayMs: 0,
+                onProgress: ({rows}) => {
+                    lastRowCount = rows
                     message.loading({
-                        content: `Exporting ${count.toLocaleString()} rows`,
+                        content: `Exporting ${rows.toLocaleString()} rows`,
                         key: exportKey,
                         duration: 0,
                     })
                 },
             })
 
+            // `finalize(0)` aborts the streaming writable so no header-only
+            // file is left on disk, and is a no-op on the buffered path.
+            await writer.finalize(rowCount)
+
             if (!rowCount) {
                 message.info({
                     content: "No traces to export",
                     key: exportKey,
                 })
-
                 return
             }
 
-            downloadCsv(csvParts, filename)
-
             if (limitReached) {
                 message.warning({
                     content: `Export limit reached. Downloaded first ${rowCount.toLocaleString()} rows.`,
@@ -350,6 +397,9 @@ const ObservabilityHeader = ({
                 })
             }
         } catch (error) {
+            // Discard any partial bytes already streamed to disk / buffered.
+            await writer.abort().catch(() => {})
+
             if ((error as Error).name === "AbortError") {
                 message.info({
                     content: "Export cancelled",
@@ -403,6 +453,53 @@ const ObservabilityHeader = ({
         setSelectedRowKeys([])
     }, [setSelectedRowKeys])
 
+    // Filter-scoped queue add — gate the picker. With a filter active, go
+    // straight to it; with no filter, confirm (this queues the whole project).
+    const onAddAllMatchingBeforeOpen = useCallback(async () => {
+        if (filters.length > 0) return true
+        return await new Promise<boolean>((resolve) => {
+            modal.confirm({
+                title: "Add every trace to the queue?",
+                content:
+                    "No filter is active — this will queue every trace in the project. Continue?",
+                okText: "Continue",
+                cancelText: "Cancel",
+                onOk: () => resolve(true),
+                onCancel: () => resolve(false),
+            })
+        })
+    }, [filters])
+
+    // Filter-scoped queue add — the picked queue runs a background scan of
+    // every trace matching the current observability filter. The hook owns
+    // its own rate-limit toast UI, so we hand it the raw scan params and
+    // let it build the adaptive fetcher itself.
+    const onAddAllMatchingQueueSelected = useCallback(
+        (queue: SimpleQueue) => {
+            const {currentApp} = getAppValues()
+            const appId = currentApp?.id || ""
+            const {
+                params,
+                hasAnnotationConditions,
+                hasAnnotationOperator,
+                isHasAnnotationSelected,
+            } = buildTraceQueryParams(filters, sort, traceTabs, undefined)
+            const projectURL = window.location.pathname.match(/^(\/w\/[^/]+\/p\/[^/]+)/)?.[1]
+            runBatchAdd({
+                queue,
+                scanConfig: {
+                    params,
+                    appId,
+                    isHasAnnotationSelected,
+                    hasAnnotationConditions,
+                    hasAnnotationOperator,
+                },
+                viewQueueUrl: projectURL ? `${projectURL}/annotations/${queue.id}` : undefined,
+            })
+        },
+        [filters, sort, traceTabs, runBatchAdd],
+    )
+
     const handleExportClick = useCallback(() => {
         if (isExporting) {
             exportAbortRef.current?.abort()
@@ -513,9 +610,19 @@ const ObservabilityHeader = ({
                                 queueAction={{
                                     itemType: "traces",
                                     itemIds: selectedTraceIds,
+                                    label:
+                                        selectedTraceIds.length > 0
+                                            ? `Add ${selectedTraceIds.length} selected to queue`
+                                            : "Add selected to queue",
                                     disabled: traces.length === 0 || selectedTraceIds.length === 0,
                                     onItemsAdded: handleQueueItemsAdded,
                                 }}
+                                queueAllMatchingAction={{
+                                    label: "Add all matching filter to queue",
+                                    disabled: traces.length === 0,
+                                    onBeforeOpen: onAddAllMatchingBeforeOpen,
+                                    onQueueSelected: onAddAllMatchingQueueSelected,
+                                }}
                             />
                         </Space>
                     </div>
diff --git a/web/oss/src/components/pages/observability/components/ObservabilityHeader/useBatchAddTracesToQueue.tsx b/web/oss/src/components/pages/observability/components/ObservabilityHeader/useBatchAddTracesToQueue.tsx
new file mode 100644
index 0000000000..d4b5bfad55
--- /dev/null
+++ b/web/oss/src/components/pages/observability/components/ObservabilityHeader/useBatchAddTracesToQueue.tsx
@@ -0,0 +1,342 @@
+/**
+ * useBatchAddTracesToQueue — drives the filter-scoped "add all matching traces
+ * to a queue" run plus its non-blocking progress notification.
+ *
+ * Picking a queue starts a background ETL scan (`addAllMatchingTracesToQueue`);
+ * the notification shows a live counter + Cancel and resolves to a terminal
+ * state (done / nothing-queued / cancelled / partial-error). A 429 from EE
+ * throttling pauses the scan (`withRateLimitRetry`) rather than killing it —
+ * the notification shows a "rate limited" state during the wait.
+ *
+ * The page-scan fetcher is built via `createAdaptiveTracePageFetcher`, so
+ * outbound request pacing tracks the live `X-RateLimit-Remaining` /
+ * `X-RateLimit-Limit` headers (bucket-aware, not tier-aware — the same
+ * code is the right pace on every plan).
+ *
+ * Copy says "queued", never "added" — `evaluate_batch_traces` is
+ * fire-and-forget, so the counter reflects IDs submitted, not scenarios
+ * materialized.
+ */
+
+import {useCallback, useEffect, useRef, useState} from "react"
+
+import {simpleQueueMolecule, type SimpleQueue} from "@agenta/entities/simpleQueue"
+import {addAllMatchingTracesToQueue, BatchFlushError} from "@agenta/entities/simpleQueue/etl"
+import {notification} from "@agenta/ui/app-message"
+import {Button} from "antd"
+import {useAtomValue} from "jotai"
+
+import {queueMaxItemsAtom} from "@/oss/state/access/atoms"
+import type {Condition} from "@/oss/state/newObservability/atoms/queryHelpers"
+import {createAdaptiveTracePageFetcher} from "@/oss/state/newObservability/etl/adaptiveTracePageFetcher"
+import {withRateLimitRetry} from "@/oss/state/newObservability/etl/withRateLimitRetry"
+/** Auto-dismiss window for the success toast; rendered as a visible progress bar. */
+const SUCCESS_DISMISS_MS = 5_000
+
+/** Brand primary used for the auto-dismiss countdown bar (matches antd's `colorPrimary`). */
+const PRIMARY_COLOR = "#1c2c3d"
+
+/**
+ * Description wrapper that drives the success toast's own auto-dismiss timer:
+ * a CSS-transitioned bar shrinks to zero over `durationMs`, a single
+ * `setTimeout` fires `onComplete` at the same instant. Driving the bar via
+ * CSS (not React state) keeps the visible end-of-countdown and the toast
+ * dismissal in perfect sync — no "bar already at 0% but the toast still
+ * sitting there" gap, and no "toast already fading but the bar mid-animation"
+ * either. A short countdown label ("Dismissing in Ns") tells the user what's
+ * about to happen.
+ */
+const AutoDismissDescription = ({
+    text,
+    durationMs,
+    onComplete,
+}: {
+    text: string
+    durationMs: number
+    onComplete: () => void
+}) => {
+    const totalSeconds = Math.max(1, Math.ceil(durationMs / 1000))
+    const [secondsLeft, setSecondsLeft] = useState(totalSeconds)
+    // CSS transitions only animate when the value CHANGES post-mount, so
+    // we start at 100% and flip to 0% on the first commit. The flip has to
+    // wait one frame so the browser registers the initial 100% layout.
+    const [barWidth, setBarWidth] = useState("100%")
+    // Pin the latest onComplete so the dismiss timeout never fires a stale
+    // closure (e.g. when a parent re-renders with a new handler).
+    const onCompleteRef = useRef(onComplete)
+    onCompleteRef.current = onComplete
+
+    useEffect(() => {
+        const raf = window.requestAnimationFrame(() => setBarWidth("0%"))
+        const startedAt = Date.now()
+        const tick = window.setInterval(() => {
+            const remainingMs = Math.max(0, durationMs - (Date.now() - startedAt))
+            // Never display "0s" — the dismissal fires when we'd reach it,
+            // and showing "Dismissing in 0s" reads awkwardly.
+            setSecondsLeft(remainingMs > 0 ? Math.max(1, Math.ceil(remainingMs / 1000)) : 0)
+        }, 250)
+        const dismiss = window.setTimeout(() => onCompleteRef.current(), durationMs)
+        return () => {
+            window.cancelAnimationFrame(raf)
+            window.clearInterval(tick)
+            window.clearTimeout(dismiss)
+        }
+    }, [durationMs])
+
+    return (
+        <div>
+            <div>{text}</div>
+            <div className="mt-2 flex items-center gap-2">
+                <div className="relative flex-1 h-1 bg-gray-200 rounded overflow-hidden">
+                    <div
+                        className="absolute inset-y-0 left-0 rounded"
+                        style={{
+                            width: barWidth,
+                            backgroundColor: PRIMARY_COLOR,
+                            transition: `width ${durationMs}ms linear`,
+                        }}
+                    />
+                </div>
+                <span className="text-xs text-gray-500 whitespace-nowrap">
+                    {secondsLeft > 0 ? `Dismissing in ${secondsLeft}s` : "Dismissing…"}
+                </span>
+            </div>
+        </div>
+    )
+}
+
+/** Trace-scan params the hook needs to build the adaptive page fetcher. */
+export interface BatchAddScanConfig {
+    params: Record<string, any>
+    appId: string
+    isHasAnnotationSelected: number
+    hasAnnotationConditions: Condition[]
+    hasAnnotationOperator?: string
+}
+
+export interface BatchAddRunInput {
+    queue: SimpleQueue
+    /**
+     * Trace-query params describing the live observability filter — the
+     * hook builds an adaptive (bucket-aware) page fetcher from it.
+     */
+    scanConfig: BatchAddScanConfig
+    /** Link target for the "View queue" action in the success notification. */
+    viewQueueUrl?: string
+}
+
+export const useBatchAddTracesToQueue = () => {
+    // Latest in-flight run — a new run or unmount aborts it.
+    const abortRef = useRef<AbortController | null>(null)
+    // Self-reference so the error notifications' Retry can re-run.
+    const runRef = useRef<(input: BatchAddRunInput) => void>(() => {})
+
+    // Tier-aware per-run cap — derived from the current billing plan.
+    // Hobby/free deployments stay at the historical 1k default; pro,
+    // business, enterprise unlock progressively higher caps. The mapping
+    // lives in `@agenta/entities/trace/etl` so it's unit-tested.
+    const maxItems = useAtomValue(queueMaxItemsAtom)
+
+    // Navigating away from observability aborts the run (partial add).
+    useEffect(() => () => abortRef.current?.abort(), [])
+
+    const run = useCallback(
+        (input: BatchAddRunInput) => {
+            const {queue, scanConfig, viewQueueUrl} = input
+            const queueName = queue.name || "queue"
+            const key = `batch-add-queue-${queue.id}`
+            // Capture the cap (and its formatted label) at run-time so a
+            // plan upgrade between runs picks up the new ceiling, but a
+            // single run quotes a consistent number throughout its toast.
+            const maxLabel = maxItems.toLocaleString()
+
+            // One run at a time per hook instance.
+            abortRef.current?.abort()
+            const controller = new AbortController()
+            abortRef.current = controller
+
+            // True only while this run still owns the toast key. A run superseded
+            // by a newer run (same queue → same `key`) must stay silent on its
+            // terminal toast, or it clobbers the live counter the new run owns.
+            const isCurrentRun = () => abortRef.current === controller
+
+            const cancelBtn = (
+                <Button size="small" onClick={() => controller.abort()}>
+                    Cancel
+                </Button>
+            )
+
+            // Last reported queued count — kept so the rate-limited state can
+            // still show progress while the scan is paused.
+            let lastQueued = 0
+
+            const showRunning = (queued: number) => {
+                lastQueued = queued
+                notification.open({
+                    key,
+                    message: `Queuing traces to "${queueName}"`,
+                    description: `Queued ${queued.toLocaleString()} of up to ${maxLabel} traces…`,
+                    duration: 0,
+                    btn: cancelBtn,
+                })
+            }
+
+            const showRateLimited = (delayMs: number) => {
+                notification.open({
+                    key,
+                    message: "Rate limited — pausing",
+                    description:
+                        `Queued ${lastQueued.toLocaleString()} of up to ${maxLabel} traces. ` +
+                        `The server is throttling requests; retrying in ` +
+                        `${Math.ceil(delayMs / 1000)}s…`,
+                    duration: 0,
+                    btn: cancelBtn,
+                })
+            }
+
+            showRunning(0)
+
+            // Shared adaptive fetcher: bucket-aware proactive pacing + 429 retry.
+            // Same helper the bulk CSV export uses — pacing is driven by the live
+            // `X-RateLimit-Remaining` / `X-RateLimit-Limit` headers from EE
+            // throttling, not by an arbitrary constant.
+            const fetchPage = createAdaptiveTracePageFetcher({
+                ...scanConfig,
+                signal: controller.signal,
+                onRateLimitPause: (delayMs) => showRateLimited(delayMs),
+            })
+
+            // `addTraces` hits a different throttle bucket than the trace query,
+            // so bucket-aware pacing for it would need its own readings. The 429
+            // retry remains as the per-call safety net.
+            const addTracesWithRetry = (queueId: string, traceIds: string[]) =>
+                withRateLimitRetry(() => simpleQueueMolecule.set.addTraces(queueId, traceIds), {
+                    signal: controller.signal,
+                    onRetry: (delayMs) => showRateLimited(delayMs),
+                })
+
+            void (async () => {
+                try {
+                    const result = await addAllMatchingTracesToQueue({
+                        fetchPage,
+                        addTraces: addTracesWithRetry,
+                        queueId: queue.id,
+                        signal: controller.signal,
+                        // Pacing happens inside `fetchPage` based on live bucket
+                        // state — disable the source-level fixed delay.
+                        pageDelayMs: 0,
+                        // Tier-aware ceiling (1k hobby → 20k enterprise).
+                        maxItems,
+                        onProgress: ({queued}) => {
+                            if (!controller.signal.aborted) showRunning(queued)
+                        },
+                    })
+
+                    // A superseded run resolves as "cancelled" — bail before it
+                    // can overwrite the toast the newer run now owns.
+                    if (!isCurrentRun()) return
+
+                    // Antd's notification.{success,info,warning} called with an
+                    // existing key UPDATES the entry in place — and when the
+                    // previous entry was `duration: 0` (the running counter),
+                    // the timer doesn't kick in from the new finite duration.
+                    // Destroy first so each terminal toast lands as a fresh
+                    // notification that honours its own duration.
+                    notification.destroy(key)
+
+                    if (result.stoppedBy === "cancelled") {
+                        notification.warning({
+                            key,
+                            message: "Cancelled",
+                            description: `${result.queued.toLocaleString()} traces queued to "${queueName}" before you stopped.`,
+                            duration: 6,
+                        })
+                        return
+                    }
+
+                    if (result.queued === 0) {
+                        notification.info({
+                            key,
+                            message: "Nothing queued",
+                            description: "No traces match this filter — nothing queued.",
+                            duration: 6,
+                        })
+                        return
+                    }
+
+                    const capped = result.stoppedBy === "cap"
+                    const descriptionText = capped
+                        ? `Queued to "${queueName}". Hit the ${maxLabel}-trace limit ` +
+                          `for one run — narrow the filter to queue the rest. ` +
+                          `They'll appear as the queue processes.`
+                        : `Queued to "${queueName}". They'll appear as the queue processes.`
+                    notification.success({
+                        key,
+                        message: `Queued ${result.queued.toLocaleString()} traces`,
+                        description: (
+                            <AutoDismissDescription
+                                text={descriptionText}
+                                durationMs={SUCCESS_DISMISS_MS}
+                                onComplete={() => notification.destroy(key)}
+                            />
+                        ),
+                        // Owned by `AutoDismissDescription` — antd's timer is disabled
+                        // so the visible countdown and the dismissal stay in sync.
+                        duration: 0,
+                        btn: viewQueueUrl ? (
+                            <Button
+                                size="small"
+                                type="primary"
+                                href={viewQueueUrl}
+                                onClick={() => notification.destroy(key)}
+                            >
+                                View queue
+                            </Button>
+                        ) : undefined,
+                    })
+                } catch (err) {
+                    // Stale runs never reach here (an abort resolves, not rejects),
+                    // but guard anyway so a superseded run can't surface an error.
+                    if (!isCurrentRun()) return
+
+                    const retryBtn = (
+                        <Button
+                            size="small"
+                            onClick={() => {
+                                notification.destroy(key)
+                                runRef.current(input)
+                            }}
+                        >
+                            Retry
+                        </Button>
+                    )
+
+                    if (err instanceof BatchFlushError) {
+                        notification.error({
+                            key,
+                            message: "Queue add incomplete",
+                            description: `Queued ${err.flushedCount.toLocaleString()} traces; ${err.failedCount.toLocaleString()} failed to queue.`,
+                            duration: 0,
+                            btn: retryBtn,
+                        })
+                        return
+                    }
+                    notification.error({
+                        key,
+                        message: "Queue add failed",
+                        description: err instanceof Error ? err.message : "Something went wrong.",
+                        duration: 0,
+                        btn: retryBtn,
+                    })
+                } finally {
+                    if (abortRef.current === controller) abortRef.current = null
+                }
+            })()
+        },
+        [maxItems],
+    )
+
+    runRef.current = run
+
+    return run
+}
diff --git a/web/oss/src/lib/api/assets/fetchClient.ts b/web/oss/src/lib/api/assets/fetchClient.ts
index 9a66c5a166..91899db30b 100644
--- a/web/oss/src/lib/api/assets/fetchClient.ts
+++ b/web/oss/src/lib/api/assets/fetchClient.ts
@@ -58,14 +58,31 @@ export async function getAuthToken(): Promise<string | undefined> {
     return safeGetJWT()
 }
 
-export async function fetchJson(url: URL, init: RequestInit = {}): Promise<any> {
+/** Successful return shape from `fetchJsonWithMeta`. */
+export interface FetchJsonResult {
+    /** Parsed JSON body (or raw text for non-JSON responses). */
+    data: any
+    /** Response headers — exposed so callers can read `X-RateLimit-*` etc. */
+    headers: Headers
+}
+
+/**
+ * Variant of `fetchJson` that also returns the response headers. Use this
+ * when you need to read server-driven signals (e.g. `X-RateLimit-Remaining`
+ * for adaptive request pacing). `fetchJson` is a thin wrapper that strips
+ * the headers off this function's result.
+ */
+export async function fetchJsonWithMeta(
+    url: URL,
+    init: RequestInit = {},
+): Promise<FetchJsonResult> {
     const jwt = await getAuthToken()
     try {
         const store = getDefaultStore()
         const authFlow = store.get(authFlowAtom)
         const allowDuringAuthing = (init as any)?._allowDuringAuthing
         if (authFlow === "authing" && !allowDuringAuthing) {
-            return undefined
+            return {data: undefined, headers: new Headers()}
         }
     } catch {
         // ignore store access failures
@@ -172,6 +189,15 @@ export async function fetchJson(url: URL, init: RequestInit = {}): Promise<any>
         throw error
     }
 
-    if (contentType.includes("application/json")) return res.json()
-    return res.text()
+    const data = contentType.includes("application/json") ? await res.json() : await res.text()
+    return {data, headers: res.headers}
+}
+
+/**
+ * Make a JSON request and return the parsed body. Thin wrapper over
+ * `fetchJsonWithMeta` — use this when you don't need the response headers.
+ */
+export async function fetchJson(url: URL, init: RequestInit = {}): Promise<any> {
+    const {data} = await fetchJsonWithMeta(url, init)
+    return data
 }
diff --git a/web/oss/src/services/tracing/api/index.ts b/web/oss/src/services/tracing/api/index.ts
index dfc8da6321..2cfb68924c 100644
--- a/web/oss/src/services/tracing/api/index.ts
+++ b/web/oss/src/services/tracing/api/index.ts
@@ -2,7 +2,13 @@ import dayjs from "dayjs"
 import utc from "dayjs/plugin/utc"
 
 import {SortResult} from "@/oss/components/Filters/Sort"
-import {ensureAppId, ensureProjectId, fetchJson, getBaseUrl} from "@/oss/lib/api/assets/fetchClient"
+import {
+    ensureAppId,
+    ensureProjectId,
+    fetchJson,
+    fetchJsonWithMeta,
+    getBaseUrl,
+} from "@/oss/lib/api/assets/fetchClient"
 import {getProjectValues} from "@/oss/state/project"
 
 import {calculateIntervalFromDuration, tracingToGeneration} from "../lib/helpers"
@@ -48,6 +54,82 @@ export const fetchAllPreviewTraces = async (
     })
 }
 
+/**
+ * Bucket state for adaptive pacing. `null` when the backend didn't return
+ * the corresponding header (OSS deployments without EE throttling, errors
+ * before headers, etc.).
+ */
+export interface PreviewTracesRateLimit {
+    /** `X-RateLimit-Remaining` — tokens left in the throttle bucket. */
+    remaining: number | null
+    /** `X-RateLimit-Limit` — bucket capacity. Only set on 429 responses. */
+    limit: number | null
+}
+
+/** Successful return shape from `fetchAllPreviewTracesWithMeta`. */
+export interface PreviewTracesWithMetaResult {
+    data: any
+    rateLimit: PreviewTracesRateLimit
+}
+
+const parseRateLimitHeader = (headers: Headers, name: string): number | null => {
+    const raw = headers.get(name)
+    if (!raw) return null
+    const n = Number.parseInt(raw, 10)
+    return Number.isFinite(n) ? n : null
+}
+
+/**
+ * Variant of `fetchAllPreviewTraces` that also returns the throttling bucket
+ * state via `X-RateLimit-*` headers. Used by the bulk export to pace requests
+ * adaptively without depending on knowing the user's plan tier — the server
+ * tells us how much headroom is left on every successful response.
+ */
+export const fetchAllPreviewTracesWithMeta = async (
+    params: Record<string, any> = {},
+    appId: string,
+    signal?: AbortSignal,
+): Promise<PreviewTracesWithMetaResult> => {
+    const base = getBaseUrl()
+    const projectId = ensureProjectId()
+    const applicationId = ensureAppId(appId)
+
+    const url = new URL(`${base}/tracing/spans/query`)
+    if (projectId) url.searchParams.set("project_id", projectId)
+    if (applicationId) url.searchParams.set("application_id", applicationId)
+
+    const payload: Record<string, any> = {}
+    Object.entries(params).forEach(([key, value]) => {
+        if (value === undefined || value === null) return
+        if (key === "size") {
+            payload.limit = Number(value)
+        } else if (key === "filter" && typeof value === "string") {
+            try {
+                payload.filter = JSON.parse(value)
+            } catch {
+                payload.filter = value
+            }
+        } else {
+            payload[key] = value
+        }
+    })
+
+    const {data, headers} = await fetchJsonWithMeta(url, {
+        method: "POST",
+        headers: {"Content-Type": "application/json"},
+        body: JSON.stringify(payload),
+        signal,
+    })
+
+    return {
+        data,
+        rateLimit: {
+            remaining: parseRateLimitHeader(headers, "x-ratelimit-remaining"),
+            limit: parseRateLimitHeader(headers, "x-ratelimit-limit"),
+        },
+    }
+}
+
 export const fetchPreviewTrace = async (traceId: string) => {
     const base = getBaseUrl()
     const projectId = ensureProjectId()
diff --git a/web/oss/src/state/access/atoms.ts b/web/oss/src/state/access/atoms.ts
index 3f417130b0..62ab9d5f04 100644
--- a/web/oss/src/state/access/atoms.ts
+++ b/web/oss/src/state/access/atoms.ts
@@ -1,3 +1,4 @@
+import {inferQueueMaxFromPlan} from "@agenta/entities/trace/etl"
 import {atom} from "jotai"
 import {atomWithQuery} from "jotai-tanstack-query"
 
@@ -200,6 +201,17 @@ export const currentCatalogEntryAtom = atom((get): CatalogEntry | null => {
     return catalog.find((entry) => entry.plan === plan) ?? null
 })
 
+// Derived: per-tier cap on how many traces a single "add all matching to
+// annotation queue" run will batch. The mapping lives in @agenta/entities
+// so it can be unit-tested without React/store coupling; this atom is just
+// the binding to the live subscription plan. Falls back to the hobby cap
+// when billing isn't on or the subscription hasn't loaded yet — matches
+// the historical 1k default.
+export const queueMaxItemsAtom = atom((get): number => {
+    const plan = get(currentSubscriptionQueryAtom).data?.plan
+    return inferQueueMaxFromPlan(plan)
+})
+
 export const rolesQueryAtom = atomWithQuery((get) => {
     const sessionExists = get(sessionExistsAtom)
     return {
diff --git a/web/oss/src/state/newObservability/atoms/queryHelpers.ts b/web/oss/src/state/newObservability/atoms/queryHelpers.ts
index e06f3ff271..30d99c82ba 100644
--- a/web/oss/src/state/newObservability/atoms/queryHelpers.ts
+++ b/web/oss/src/state/newObservability/atoms/queryHelpers.ts
@@ -4,14 +4,16 @@ import {
     transformTracesResponseToTree,
     transformTracingResponse,
 } from "@agenta/entities/trace"
-import Papa from "papaparse"
 
 import {
     normalizeReferenceValue,
     parseReferenceKey,
 } from "@/oss/components/pages/observability/assets/filters/referenceUtils"
-import {fetchAllPreviewTraces} from "@/oss/services/tracing/api"
-import {TraceSpan, TraceSpanNode} from "@/oss/services/tracing/types"
+import {
+    fetchAllPreviewTracesWithMeta,
+    type PreviewTracesRateLimit,
+} from "@/oss/services/tracing/api"
+import {TraceSpanNode} from "@/oss/services/tracing/types"
 
 export interface Condition {
     field: string
@@ -20,31 +22,6 @@ export interface Condition {
     key?: string
 }
 
-type ExportRow = Record<string, string | number | null | undefined>
-
-const EXPORT_PROGRESS_THROTTLE_MS = 300
-const DEFAULT_EXPORT_PAGE_SIZE = 500
-const MAX_EMPTY_PAGES = 3
-const MAX_EXPORT_ROWS = 20_000
-const EXPORT_PAGE_DELAY_MS = 100
-const CSV_UNPARSE_OPTIONS = {header: false, escapeFormulae: true}
-
-const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms))
-
-const throwIfAborted = (signal?: AbortSignal) => {
-    if (signal?.aborted) {
-        throw new DOMException("Aborted", "AbortError")
-    }
-}
-
-const getTraceExportKey = (trace: TraceSpan | TraceSpanNode): string => {
-    if (trace.trace_id && trace.span_id) {
-        return `${trace.trace_id}:${trace.span_id}`
-    }
-
-    return `${trace.trace_id || ""}:${trace.parent_id || ""}:${trace.start_time || ""}`
-}
-
 export const buildAnnotationConditions = (value: any, operator: string): Condition[] => {
     const v = Array.isArray(value) ? value[0] : value || {}
     const out: Condition[] = []
@@ -285,6 +262,16 @@ export const executeTraceQuery = async ({
     let data: any = []
     let annotationPageSize: number | undefined
     let nextCursorFromStep1: string | undefined
+    // Latest server-side rate-limit headers — propagated to the caller so the
+    // export's adaptive pacing can throttle proactively. Reset each call, so
+    // even a long-running scan sees fresh bucket state on every page.
+    let lastRateLimit: PreviewTracesRateLimit = {remaining: null, limit: null}
+
+    const fetchPage = async (pageParams: Record<string, any>) => {
+        const result = await fetchAllPreviewTracesWithMeta(pageParams, appId, signal)
+        lastRateLimit = result.rateLimit
+        return result.data
+    }
 
     if (isHasAnnotationSelected !== -1) {
         const {originalFilter, annotationOnlyFilter} = buildFiltersForHasAnnotation(
@@ -299,7 +286,7 @@ export const executeTraceQuery = async ({
         firstParams.filter = annotationOnlyFilter
         if (pageParam?.newest) firstParams.newest = pageParam.newest
 
-        const data1 = await fetchAllPreviewTraces(firstParams, appId, signal)
+        const data1 = await fetchPage(firstParams)
 
         // page size for pagination decision
         const countEntries = (container: unknown) => {
@@ -331,12 +318,13 @@ export const executeTraceQuery = async ({
                 traceCount: 0,
                 nextCursor: nextCursorFromStep1,
                 annotationPageSize,
+                rateLimit: lastRateLimit,
             }
         }
 
         if (shouldExcludeAnnotations && traceIds.length === 0 && spanIds.length === 0) {
             if (pageParam?.newest) windowParams.newest = pageParam.newest
-            data = await fetchAllPreviewTraces(windowParams, appId, signal)
+            data = await fetchPage(windowParams)
         } else {
             // STEP 2: not paginated
             const extraConditions: Condition[] = shouldExcludeAnnotations
@@ -361,12 +349,12 @@ export const executeTraceQuery = async ({
             }
             secondParams.filter = mergeConditions(originalFilter, extraConditions)
 
-            data = await fetchAllPreviewTraces(secondParams, appId, signal)
+            data = await fetchPage(secondParams)
         }
     } else {
         // normal flow
         if (pageParam?.newest) windowParams.newest = pageParam.newest
-        data = await fetchAllPreviewTraces(windowParams, appId, signal)
+        data = await fetchPage(windowParams)
     }
 
     // transform to tree
@@ -395,8 +383,9 @@ export const executeTraceQuery = async ({
             const minVal = times.reduce((min, cur) => (cur < min ? cur : min))
             // Bump by +1ms so the backend's strict-less-than filter
             // (`start_time < newest`) still includes traces at this exact
-            // timestamp. The dedup set in fetchAllTracesForExport handles
-            // any resulting overlap.
+            // timestamp. Downstream dedup (the export pipeline's dedup
+            // transform, the queue scan's dedup-key transform) handles any
+            // resulting overlap.
             const cursorDate = new Date(minVal + 1)
             const lowerBound =
                 params.oldest && typeof params.oldest === "string"
@@ -418,118 +407,6 @@ export const executeTraceQuery = async ({
         traceCount: (data as any)?.count ?? 0,
         nextCursor,
         annotationPageSize,
-    }
-}
-
-export const fetchAllTracesForExport = async ({
-    params,
-    appId,
-    isHasAnnotationSelected,
-    hasAnnotationConditions,
-    hasAnnotationOperator,
-    formatRow,
-    headers,
-    onProgress,
-    signal,
-    pageSize = DEFAULT_EXPORT_PAGE_SIZE,
-}: {
-    params: Record<string, any>
-    appId: string
-    isHasAnnotationSelected: number
-    hasAnnotationConditions: Condition[]
-    hasAnnotationOperator?: string
-    formatRow: (trace: TraceSpan | TraceSpanNode) => ExportRow
-    headers: string[]
-    onProgress?: (rowCount: number) => void
-    signal?: AbortSignal
-    pageSize?: number
-}): Promise<{csvParts: BlobPart[]; rowCount: number; limitReached: boolean}> => {
-    const exportParams = {...params, size: pageSize}
-
-    if (!exportParams.newest) {
-        exportParams.newest = new Date().toISOString()
-    }
-
-    const csvHeader = Papa.unparse({fields: headers, data: []})
-    const csvParts: BlobPart[] = [csvHeader]
-    const seen = new Set<string>()
-
-    let rowCount = 0
-    let cursor: string | undefined
-    let lastProgressUpdate = 0
-    let emptyPageCount = 0
-    let limitReached = false
-
-    while (true) {
-        throwIfAborted(signal)
-
-        const result = await executeTraceQuery({
-            params: exportParams,
-            pageParam: cursor ? {newest: cursor} : undefined,
-            appId,
-            isHasAnnotationSelected,
-            hasAnnotationConditions,
-            hasAnnotationOperator,
-            signal,
-        })
-
-        const rows: ExportRow[] = []
-
-        const collectSpans = (node: TraceSpan | TraceSpanNode) => {
-            const key = getTraceExportKey(node)
-            if (seen.has(key)) return
-            // Stop collecting if we've hit the limit
-            if (rowCount + rows.length >= MAX_EXPORT_ROWS) return
-            rows.push(formatRow(node))
-            seen.add(key)
-
-            const children = (node as TraceSpanNode).children
-            if (children && Array.isArray(children)) {
-                for (const child of children) {
-                    collectSpans(child)
-                }
-            }
-        }
-
-        for (const trace of result.traces) {
-            collectSpans(trace)
-        }
-
-        if (rows.length > 0) {
-            csvParts.push("\r\n", Papa.unparse({fields: headers, data: rows}, CSV_UNPARSE_OPTIONS))
-            rowCount += rows.length
-            emptyPageCount = 0
-        } else {
-            emptyPageCount++
-        }
-
-        const now = Date.now()
-        if (now - lastProgressUpdate >= EXPORT_PROGRESS_THROTTLE_MS) {
-            onProgress?.(rowCount)
-            lastProgressUpdate = now
-        }
-
-        // Check if we hit the row limit
-        if (rowCount >= MAX_EXPORT_ROWS) {
-            limitReached = true
-            break
-        }
-
-        if (!result.nextCursor) break
-        // Safety: if cursor isn't advancing (all rows already seen), stop
-        if (emptyPageCount >= MAX_EMPTY_PAGES) break
-
-        // Throttle requests to reduce API load
-        await delay(EXPORT_PAGE_DELAY_MS)
-
-        cursor = result.nextCursor
-    }
-
-    onProgress?.(rowCount)
-
-    return {
-        csvParts,
-        rowCount,
-        limitReached,
+        rateLimit: lastRateLimit,
     }
 }
diff --git a/web/oss/src/state/newObservability/etl/adaptiveExportPacing.ts b/web/oss/src/state/newObservability/etl/adaptiveExportPacing.ts
new file mode 100644
index 0000000000..0684048391
--- /dev/null
+++ b/web/oss/src/state/newObservability/etl/adaptiveExportPacing.ts
@@ -0,0 +1,28 @@
+/**
+ * Adaptive request pacing — DOM-dependent sleep helper for the bulk-trace
+ * export. The pure pacing math lives in `@agenta/entities/trace/etl` so it
+ * can be unit-tested; this file only adds the abortable `setTimeout`
+ * wrapper the call site needs.
+ */
+
+/** Abortable sleep — rejects with `AbortError` if the signal fires first. */
+export const adaptiveSleep = (ms: number, signal?: AbortSignal): Promise<void> =>
+    new Promise<void>((resolve, reject) => {
+        if (ms <= 0) {
+            resolve()
+            return
+        }
+        if (signal?.aborted) {
+            reject(new DOMException("Aborted", "AbortError"))
+            return
+        }
+        const onAbort = () => {
+            clearTimeout(timer)
+            reject(new DOMException("Aborted", "AbortError"))
+        }
+        const timer = setTimeout(() => {
+            signal?.removeEventListener("abort", onAbort)
+            resolve()
+        }, ms)
+        signal?.addEventListener("abort", onAbort, {once: true})
+    })
diff --git a/web/oss/src/state/newObservability/etl/adaptiveTracePageFetcher.ts b/web/oss/src/state/newObservability/etl/adaptiveTracePageFetcher.ts
new file mode 100644
index 0000000000..b4e7d68a0c
--- /dev/null
+++ b/web/oss/src/state/newObservability/etl/adaptiveTracePageFetcher.ts
@@ -0,0 +1,124 @@
+/**
+ * createAdaptiveTracePageFetcher — build a per-page trace fetcher for the
+ * scan pipelines (bulk CSV export, batch-add-to-annotation-queue) with
+ * two layers of throttle awareness baked in:
+ *
+ *   1. **Proactive pacing**. Before each fetch (after the first), sleep
+ *      for a duration computed from the previous response's
+ *      `X-RateLimit-Remaining` / `X-RateLimit-Limit` headers. Floor at
+ *      full bucket, ramp toward the sustained refill rate as the bucket
+ *      drains. See `computeAdaptivePageDelayMs` for the math.
+ *
+ *   2. **Reactive retry**. Each fetch is wrapped in `withRateLimitRetry`,
+ *      so any 429 that slips past the proactive pacing (parallel org
+ *      traffic, missing headers, …) pauses and resumes with the server's
+ *      `Retry-After` backoff instead of killing the scan. An optional
+ *      callback lets the consumer surface the pause in its UI.
+ *
+ * Both consumers used to either duplicate this wiring (the export's
+ * inline closure) or pace with an arbitrary constant (the queue scan's
+ * `SCAN_PAGE_DELAY_MS = 300`). They now share one fetcher, and the
+ * "right" pace per tier falls out of the live bucket signal.
+ */
+
+import {computeAdaptivePageDelayMs} from "@agenta/entities/trace/etl"
+
+import type {TraceSpanNode} from "@/oss/services/tracing/types"
+import {executeTraceQuery, type Condition} from "@/oss/state/newObservability/atoms/queryHelpers"
+
+import {adaptiveSleep} from "./adaptiveExportPacing"
+import {withRateLimitRetry} from "./withRateLimitRetry"
+
+/** Default API page size (rows per request). */
+export const DEFAULT_TRACE_PAGE_SIZE = 500
+
+export interface AdaptiveTracePageFetcherConfig {
+    /** Trace-query params already shaped by `buildTraceQueryParams`. */
+    params: Record<string, any>
+    appId: string
+    isHasAnnotationSelected: number
+    hasAnnotationConditions: Condition[]
+    hasAnnotationOperator?: string
+    /** Abort signal — used for adaptive sleep and retry-backoff sleep. */
+    signal: AbortSignal
+    /** Rows per request (default 500). */
+    pageSize?: number
+    /**
+     * Fired before each 429-induced sleep so the consumer can surface the
+     * pause in its UI. `delayMs` is the server-honored backoff duration.
+     */
+    onRateLimitPause?: (delayMs: number) => void
+}
+
+/** One page of matching traces; `nextCursor: null` ⇒ end of stream. */
+export interface AdaptiveTracePage {
+    rows: TraceSpanNode[]
+    nextCursor: string | null
+}
+
+/**
+ * Cursor-paged trace fetcher. `cursor: null` = first page. The `signal`
+ * argument the source passes per-page is forwarded to the inner fetch as
+ * its abort; the adaptive sleep / retry sleep use the signal captured at
+ * construction time (in practice they're the same controller).
+ */
+export type AdaptiveTracePageFetcher = (
+    cursor: string | null,
+    signal: AbortSignal,
+) => Promise<AdaptiveTracePage>
+
+export const createAdaptiveTracePageFetcher = ({
+    params,
+    appId,
+    isHasAnnotationSelected,
+    hasAnnotationConditions,
+    hasAnnotationOperator,
+    signal,
+    pageSize = DEFAULT_TRACE_PAGE_SIZE,
+    onRateLimitPause,
+}: AdaptiveTracePageFetcherConfig): AdaptiveTracePageFetcher => {
+    const scanParams: Record<string, any> = {...params, size: pageSize}
+    if (!scanParams.newest) {
+        scanParams.newest = new Date().toISOString()
+    }
+
+    // Live bucket state from the latest successful response. The first
+    // call has no reading — adaptive sleep is skipped, and any 429 falls
+    // to the retry wrapper to recover.
+    let lastRateLimit: {remaining: number | null; limit: number | null} = {
+        remaining: null,
+        limit: null,
+    }
+    let isFirstFetch = true
+
+    return async (cursor, callerSignal) => {
+        if (!isFirstFetch) {
+            const delayMs = computeAdaptivePageDelayMs(lastRateLimit)
+            await adaptiveSleep(delayMs, signal)
+        }
+        isFirstFetch = false
+
+        return withRateLimitRetry(
+            async () => {
+                const result = await executeTraceQuery({
+                    params: scanParams,
+                    pageParam: cursor ? {newest: cursor} : undefined,
+                    appId,
+                    isHasAnnotationSelected,
+                    hasAnnotationConditions,
+                    hasAnnotationOperator,
+                    signal: callerSignal,
+                })
+                lastRateLimit = result.rateLimit
+                return {
+                    rows: result.traces,
+                    nextCursor: result.nextCursor ?? null,
+                }
+            },
+            {
+                signal,
+                onRetry: (delayMs) => onRateLimitPause?.(delayMs),
+            },
+        )
+    }
+}
diff --git a/web/oss/src/state/newObservability/etl/exportWriter.ts b/web/oss/src/state/newObservability/etl/exportWriter.ts
new file mode 100644
index 0000000000..a0bdd939fe
--- /dev/null
+++ b/web/oss/src/state/newObservability/etl/exportWriter.ts
@@ -0,0 +1,179 @@
+/**
+ * createExportWriter — choose the lowest-memory CSV writer the browser
+ * supports.
+ *
+ *   - Chromium browsers expose the File System Access API
+ *     (`window.showSaveFilePicker`). We open a writable file stream and
+ *     hand each batch directly to disk — memory stays bounded by one batch.
+ *   - Safari / Firefox fall back to the legacy `BlobPart[]` accumulator
+ *     + anchor-click download, exactly matching the previous behavior.
+ *
+ * The caller doesn't need to know which path it got — it writes encoded
+ * chunks, then calls `finalize(rowCount)` on success or `abort()` on
+ * failure / cancel. `finalize(0)` is a clean no-op in the buffered path
+ * and aborts the streaming writable so a header-only file isn't left on
+ * disk.
+ */
+
+import Papa from "papaparse"
+
+import {downloadCsv} from "@/oss/lib/helpers/fileManipulations"
+
+/**
+ * Subset of `FileSystemWritableFileStream` we actually call. Declared
+ * locally to keep the runtime feature-detect honest — the global type
+ * isn't available in every browser's typings even when the runtime supports
+ * it (and the buffered path doesn't need any of this).
+ */
+interface WritableFileStreamLike {
+    write(chunk: BlobPart): Promise<void>
+    close(): Promise<void>
+    abort(reason?: unknown): Promise<void>
+}
+
+/** Subset of `FileSystemFileHandle` we use. */
+interface FileSystemFileHandleLike {
+    createWritable(): Promise<WritableFileStreamLike>
+}
+
+/** Subset of `window.showSaveFilePicker` (and its options) we rely on. */
+interface ShowSaveFilePickerOptions {
+    suggestedName?: string
+    types?: {
+        description?: string
+        accept: Record<string, string[]>
+    }[]
+}
+type ShowSaveFilePicker = (options?: ShowSaveFilePickerOptions) => Promise<FileSystemFileHandleLike>
+
+/** Sentinel returned when the user cancels the native file picker. */
+export const PICKER_CANCELLED = "cancelled" as const
+
+/** The writer protocol consumed by `ObservabilityHeader.onExport`. */
+export interface ExportWriter {
+    /** Streaming (disk) or buffered (in-memory)? Surfaced for telemetry / debug. */
+    kind: "stream" | "buffer"
+    /** Append one pre-encoded chunk. */
+    write(chunk: BlobPart): Promise<void>
+    /**
+     * Commit the export. `rowCount === 0` discards the file (or skips the
+     * download) so no empty / header-only artifact is produced.
+     */
+    finalize(rowCount: number): Promise<void>
+    /** Discard partial data on cancel / error. Safe to call multiple times. */
+    abort(): Promise<void>
+}
+
+const getShowSaveFilePicker = (): ShowSaveFilePicker | undefined => {
+    if (typeof window === "undefined") return undefined
+    const picker = (window as unknown as {showSaveFilePicker?: ShowSaveFilePicker})
+        .showSaveFilePicker
+    return typeof picker === "function" ? picker : undefined
+}
+
+const CSV_FILE_TYPE = {
+    description: "CSV file",
+    accept: {"text/csv": [".csv"]},
+}
+
+/**
+ * Build a writer for the current browser, prompting the user for a save
+ * location upfront on Chromium-class browsers. Returns `PICKER_CANCELLED`
+ * if the user dismisses the native file picker — the call site should bail
+ * before starting the scan.
+ */
+export const createExportWriter = async ({
+    filename,
+    headers,
+}: {
+    filename: string
+    headers: string[]
+}): Promise<ExportWriter | typeof PICKER_CANCELLED> => {
+    const csvHeader = Papa.unparse({fields: headers, data: []})
+
+    // ─── Streaming path ──────────────────────────────────────────────────
+    // The File System Access API materializes the file at `createWritable`
+    // time. Header is written eagerly so the writer is in a known state
+    // before the first scan page lands.
+    const showSaveFilePicker = getShowSaveFilePicker()
+    if (showSaveFilePicker) {
+        let handle: FileSystemFileHandleLike
+        try {
+            handle = await showSaveFilePicker({
+                suggestedName: filename,
+                types: [CSV_FILE_TYPE],
+            })
+        } catch (err) {
+            // User dismissed the picker — bail entirely, the scan never starts.
+            if ((err as Error)?.name === "AbortError") return PICKER_CANCELLED
+            // Permissions / other failures: fall through to the buffered path
+            // so the export still works.
+
+            console.warn("[export] showSaveFilePicker failed, falling back to Blob:", err)
+            handle = null as unknown as FileSystemFileHandleLike
+        }
+
+        if (handle) {
+            try {
+                const writable = await handle.createWritable()
+                await writable.write(csvHeader)
+                let aborted = false
+                return {
+                    kind: "stream",
+                    write: (chunk) => writable.write(chunk),
+                    finalize: async (rowCount) => {
+                        if (aborted) return
+                        if (rowCount > 0) {
+                            await writable.close()
+                        } else {
+                            // No matching rows — discard the header-only file
+                            // so the user doesn't end up with an empty artifact
+                            // they didn't intend to keep.
+                            aborted = true
+                            await writable.abort()
+                        }
+                    },
+                    abort: async () => {
+                        if (aborted) return
+                        aborted = true
+                        try {
+                            await writable.abort()
+                        } catch {
+                            // Best-effort cleanup — closing a stream that
+                            // already errored throws, but the partial file is
+                            // already gone.
+                        }
+                    },
+                }
+            } catch (err) {
+                // createWritable() can fail with permissions errors — fall
+                // through to the buffered path so the export still works.
+
+                console.warn("[export] createWritable failed, falling back to Blob:", err)
+            }
+        }
+    }
+
+    // ─── Buffered fallback ───────────────────────────────────────────────
+    // Same shape as the previous `fetchAllTracesForExport` accumulator: hold
+    // all CSV chunks in JS heap, assemble a Blob, anchor-download at the end.
+    const csvParts: BlobPart[] = [csvHeader]
+    let aborted = false
+    return {
+        kind: "buffer",
+        write: async (chunk) => {
+            if (aborted) return
+            csvParts.push(chunk)
+        },
+        finalize: async (rowCount) => {
+            if (aborted) return
+            if (rowCount > 0) downloadCsv(csvParts, filename)
+            // 0 rows: matches the legacy "No traces to export" path — no
+            // download, no file artifact, the caller surfaces the toast.
+        },
+        abort: async () => {
+            aborted = true
+            csvParts.length = 0
+        },
+    }
+}
diff --git a/web/oss/src/state/newObservability/etl/withRateLimitRetry.ts b/web/oss/src/state/newObservability/etl/withRateLimitRetry.ts
new file mode 100644
index 0000000000..c8852710b9
--- /dev/null
+++ b/web/oss/src/state/newObservability/etl/withRateLimitRetry.ts
@@ -0,0 +1,109 @@
+/**
+ * withRateLimitRetry — wraps an async transport call so an HTTP 429 from EE
+ * throttling pauses and retries instead of killing the batch-add scan.
+ *
+ * EE's throttling service returns 429 with a `Retry-After` header (seconds)
+ * and a "...retry after N seconds." detail message. This honors the
+ * `Retry-After`, retries up to `maxRetries`, and aborts cleanly mid-wait.
+ * Once the cap is hit the original error rethrows — the run then ends with a
+ * partial add, which the UI surfaces with a Retry action.
+ */
+
+const DEFAULT_MAX_RETRIES = 6
+const DEFAULT_DELAY_MS = 10_000
+/** Cap so a bad `Retry-After` can't stall the run for minutes. */
+const MAX_DELAY_MS = 60_000
+
+export interface RateLimitRetryOptions {
+    /** Max retry attempts for this call. Default 6. */
+    maxRetries?: number
+    /** Wait used when a 429 carries no `Retry-After`, ms. Default 10 000. */
+    defaultDelayMs?: number
+    /** Cancels the wait between retries. */
+    signal?: AbortSignal
+    /** Fired before each wait — `delayMs` is the wait, `attempt` is 1-based. */
+    onRetry?: (delayMs: number, attempt: number) => void
+}
+
+/** True when `err` looks like an HTTP 429, across the error shapes we see. */
+const isRateLimitError = (err: unknown): boolean => {
+    const e = err as {status?: number; response?: {status?: number}; message?: string} | null
+    if (e?.status === 429 || e?.response?.status === 429) return true
+    const message = (e?.message ?? "").toLowerCase()
+    return message.includes("rate limit") || message.includes("too many requests")
+}
+
+/**
+ * Retry delay (ms) for a 429 — the `Retry-After` header, else the seconds
+ * parsed from the detail message, else the fallback. Capped at `MAX_DELAY_MS`.
+ */
+const getRetryDelayMs = (err: unknown, fallbackMs: number): number => {
+    const e = err as {response?: {headers?: Record<string, unknown>}; message?: string} | null
+
+    const headers = e?.response?.headers
+    const headerValue = headers?.["retry-after"] ?? headers?.["Retry-After"]
+    const headerSeconds =
+        typeof headerValue === "number"
+            ? headerValue
+            : typeof headerValue === "string"
+              ? Number.parseInt(headerValue, 10)
+              : NaN
+
+    let delayMs = fallbackMs
+    if (Number.isFinite(headerSeconds) && headerSeconds > 0) {
+        delayMs = headerSeconds * 1000
+    } else {
+        const match = (e?.message ?? "").match(/retry after (\d+)\s*second/i)
+        if (match) {
+            const seconds = Number.parseInt(match[1], 10)
+            if (Number.isFinite(seconds) && seconds > 0) delayMs = seconds * 1000
+        }
+    }
+    return Math.min(delayMs, MAX_DELAY_MS)
+}
+
+/** Abortable sleep — rejects with `AbortError` if the signal fires first. */
+const sleep = (ms: number, signal?: AbortSignal): Promise<void> =>
+    new Promise<void>((resolve, reject) => {
+        if (signal?.aborted) {
+            reject(new DOMException("Aborted", "AbortError"))
+            return
+        }
+        let timer: ReturnType<typeof setTimeout> | undefined
+        const onAbort = () => {
+            if (timer !== undefined) clearTimeout(timer)
+            reject(new DOMException("Aborted", "AbortError"))
+        }
+        timer = setTimeout(() => {
+            signal?.removeEventListener("abort", onAbort)
+            resolve()
+        }, ms)
+        signal?.addEventListener("abort", onAbort, {once: true})
+    })
+
+/**
+ * Run `fn`, retrying on HTTP 429 with the server's `Retry-After` backoff.
+ * Non-rate-limit errors, and rate-limit errors past `maxRetries`, rethrow.
+ */
+export const withRateLimitRetry = async <T>(
+    fn: () => Promise<T>,
+    {
+        maxRetries = DEFAULT_MAX_RETRIES,
+        defaultDelayMs = DEFAULT_DELAY_MS,
+        signal,
+        onRetry,
+    }: RateLimitRetryOptions = {},
+): Promise<T> => {
+    let attempt = 0
+    while (true) {
+        try {
+            return await fn()
+        } catch (err) {
+            if (!isRateLimitError(err) || attempt >= maxRetries) throw err
+            attempt += 1
+            const delayMs = getRetryDelayMs(err, defaultDelayMs)
+            onRetry?.(delayMs, attempt)
+            await sleep(delayMs, signal)
+        }
+    }
+}
diff --git a/web/packages/agenta-annotation-ui/src/components/AddToQueuePopover/index.tsx b/web/packages/agenta-annotation-ui/src/components/AddToQueuePopover/index.tsx
index 89f4909640..c1ec817ee3 100644
--- a/web/packages/agenta-annotation-ui/src/components/AddToQueuePopover/index.tsx
+++ b/web/packages/agenta-annotation-ui/src/components/AddToQueuePopover/index.tsx
@@ -41,25 +41,36 @@ function mergeQueuesById(...queueLists: SimpleQueue[][]): SimpleQueue[] {
 
 interface AddToQueuePopoverProps {
     itemType: "traces" | "testcases"
-    itemIds: string[]
+    /** Item ids for the default add path. Omit when `onQueueSelected` is set. */
+    itemIds?: string[]
     children: React.ReactNode
     disabled?: boolean
     onItemsAdded?: () => void
     open?: boolean
     onOpenChange?: (open: boolean) => void
     toggleOnTriggerClick?: boolean
+    /**
+     * Picker mode: when set, selecting a queue calls this instead of the
+     * built-in `addTraces` / `addTestcases` path — the caller owns the add
+     * (e.g. a filter-scoped background scan). The "New queue" button stays
+     * available so a user with no matching queues can create one, then reopen
+     * the picker.
+     */
+    onQueueSelected?: (queue: SimpleQueue) => void
 }
 
 const QueueListContent = ({
     itemType,
-    itemIds,
+    itemIds = [],
     onClose,
     onItemsAdded,
+    onQueueSelected,
 }: {
     itemType: "traces" | "testcases"
-    itemIds: string[]
+    itemIds?: string[]
     onClose: () => void
     onItemsAdded?: () => void
+    onQueueSelected?: (queue: SimpleQueue) => void
 }) => {
     const navigation = useAnnotationNavigationSafe()
     const navigateToQueue = navigation?.navigateToQueue
@@ -97,6 +108,12 @@ const QueueListContent = ({
 
     const handleSelect = useCallback(
         async (queue: SimpleQueue) => {
+            // Picker mode — the caller owns the add (e.g. a background scan).
+            if (onQueueSelected) {
+                onQueueSelected(queue)
+                onClose()
+                return
+            }
             if (itemIds.length === 0) return
             setSubmittingId(queue.id)
             try {
@@ -132,7 +149,16 @@ const QueueListContent = ({
                 setSubmittingId(null)
             }
         },
-        [itemIds, itemType, addTraces, addTestcases, onClose, onItemsAdded, navigateToQueue],
+        [
+            itemIds,
+            itemType,
+            addTraces,
+            addTestcases,
+            onClose,
+            onItemsAdded,
+            navigateToQueue,
+            onQueueSelected,
+        ],
     )
 
     const handleNewQueue = useCallback(
@@ -236,6 +262,7 @@ const AddToQueuePopover = ({
     open: controlledOpen,
     onOpenChange,
     toggleOnTriggerClick = true,
+    onQueueSelected,
 }: AddToQueuePopoverProps) => {
     const [uncontrolledOpen, setUncontrolledOpen] = useState(false)
     const isControlled = controlledOpen !== undefined
@@ -270,6 +297,7 @@ const AddToQueuePopover = ({
                         itemIds={itemIds}
                         onClose={handleClose}
                         onItemsAdded={onItemsAdded}
+                        onQueueSelected={onQueueSelected}
                     />
                 ) : null
             }
diff --git a/web/packages/agenta-entities/package.json b/web/packages/agenta-entities/package.json
index 2c7f20f75f..b3e8d91413 100644
--- a/web/packages/agenta-entities/package.json
+++ b/web/packages/agenta-entities/package.json
@@ -13,6 +13,13 @@
         "types:check": "tsc --noEmit",
         "lint": "eslint --config ../eslint.config.mjs src/ --max-warnings 0",
         "lint:fix": "eslint --config ../eslint.config.mjs src/ --max-warnings 0 --fix",
+        "test:tsc": "pnpm run types:check && pnpm run lint",
+        "test:etl": "tsx --test src/etl/__tests__/runLoop.guarantees.test.ts",
+        "test:etl:memory": "node --expose-gc --import tsx --test src/etl/__tests__/runLoop.memory.test.ts src/etl/__tests__/runLoop.overhead.test.ts src/etl/__tests__/runLoop.benchmark.test.ts",
+        "test:etl:longrun": "pnpm run test:etl:longrun:engine && pnpm run test:etl:longrun:molecules && pnpm run test:etl:longrun:combined",
+        "test:etl:longrun:engine": "node --expose-gc --import tsx --test --test-force-exit src/etl/__tests__/runLoop.leak.test.ts",
+        "test:etl:longrun:molecules": "node --expose-gc --import tsx --test --test-force-exit src/evaluationRun/state/__tests__/molecules.leak.test.ts",
+        "test:etl:longrun:combined": "node --expose-gc --import tsx --test --test-force-exit src/etl/__tests__/runLoop.combinedLeak.test.ts",
         "test": "pnpm run test:all",
         "test:all": "pnpm run test:unit && pnpm run test:integration",
         "test:unit": "vitest run",
@@ -36,6 +43,7 @@
         "./runnable/bridge": "./src/runnable/bridge.ts",
         "./runnable/deploy": "./src/runnable/deploy.ts",
         "./trace": "./src/trace/index.ts",
+        "./trace/etl": "./src/trace/etl/index.ts",
         "./testset": "./src/testset/index.ts",
         "./testcase": "./src/testcase/index.ts",
         "./event": "./src/event/index.ts",
@@ -44,10 +52,13 @@
         "./gatewayTool": "./src/gatewayTool/index.ts",
         "./environment": "./src/environment/index.ts",
         "./simpleQueue": "./src/simpleQueue/index.ts",
+        "./simpleQueue/etl": "./src/simpleQueue/etl/index.ts",
         "./evaluationQueue": "./src/evaluationQueue/index.ts",
         "./queue": "./src/queue/index.ts",
         "./annotation": "./src/annotation/index.ts",
         "./evaluationRun": "./src/evaluationRun/index.ts",
+        "./evaluationRun/etl": "./src/evaluationRun/etl/index.ts",
+        "./etl": "./src/etl/index.ts",
         "./shared/openapi": "./src/shared/openapi/index.ts",
         "./shared/execution": "./src/shared/execution/index.ts",
         "./shared/invalidation": "./src/shared/invalidation/index.ts"
diff --git a/web/packages/agenta-entities/src/annotation/api/api.ts b/web/packages/agenta-entities/src/annotation/api/api.ts
index 120a3e8b0a..5a592e584a 100644
--- a/web/packages/agenta-entities/src/annotation/api/api.ts
+++ b/web/packages/agenta-entities/src/annotation/api/api.ts
@@ -12,7 +12,8 @@
 import {getAgentaApiUrl, axios} from "@agenta/shared/api"
 import {z} from "zod"
 
-import {safeParseWithLogging} from "../../shared"
+// See testcase/api/api.ts for rationale — the shared barrel pulls in CSS deps.
+import {safeParseWithLogging} from "../../shared/utils/zodSchema"
 import {annotationSchema, type Annotation, type AnnotationsResponse} from "../core"
 import type {
     AnnotationDetailParams,
diff --git a/web/packages/agenta-entities/src/etl/__tests__/README.md b/web/packages/agenta-entities/src/etl/__tests__/README.md
new file mode 100644
index 0000000000..8d812df5be
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/README.md
@@ -0,0 +1,151 @@
+# ETL Engine Tests
+
+Four test files, each at a different scope and run via a different script.
+
+| File | Asserts | Script | Speed | CI gate |
+|---|---|---|---|---|
+| `runLoop.guarantees.test.ts` | Behavioral: cancellation, finalize, progress, short-circuit | `test:etl` | ~300ms | Every PR |
+| `runLoop.memory.test.ts` | Heap bounded by chunk size | `test:etl:memory` | ~400ms | Every PR (with `--expose-gc`) |
+| `runLoop.overhead.test.ts` | Engine cost ≤ 25% over baseline | `test:etl:memory` | ~400ms | Every PR |
+| `runLoop.benchmark.test.ts` | Per-scenario p95 latency budgets | `test:etl:memory` | ~1s | Every PR |
+| `runLoop.leak.test.ts` | No heap growth over 500 iterations | `test:etl:longrun` | ~30s | Nightly / on-demand |
+
+## Quick reference
+
+```bash
+# Fast — runs in every PR
+pnpm --filter @agenta/entities test:etl
+
+# Performance suite — runs in every PR, ~1.5s total
+pnpm --filter @agenta/entities test:etl:memory
+
+# Long-run leak detection — slow, runs nightly
+pnpm --filter @agenta/entities test:etl:longrun
+```
+
+## What each test catches
+
+### runLoop.guarantees.test.ts — behavioral correctness
+
+Encodes the design RFC's "5 guarantees" as deterministic tests:
+
+1. **Memory bounded by chunk size** — verified via chunk-size capture, not heap measurement here. See `runLoop.memory.test.ts` for the heap version.
+2. **Cancellation via AbortSignal** — `controller.abort()` mid-iteration stops the loop and runs `finalize`.
+3. **Progress observable** — counters increment correctly per chunk.
+4. **Backpressure via `await sink.load`** — slow sink blocks the loop.
+5. **Finalize on every exit path** — runs on completion, cancellation, and exception.
+
+Plus a bonus test for short-circuit on empty chunks (downstream transforms not called).
+
+### runLoop.memory.test.ts — quantitative memory bounds
+
+Requires `--expose-gc`. Skips gracefully if not available.
+
+Catches regressions where the loop accidentally retains chunks across iterations. The headline test runs 100 chunks × 1000 rows × ~1KB payload (would be 100MB resident if unbounded) and asserts the heap delta stays under 25MB. Other tests check linear-growth patterns, cancellation cleanup, and long transform chains.
+
+### runLoop.overhead.test.ts — engine vs baseline
+
+Pits `runLoop` against a hand-written equivalent doing the same work. Median of 5 runs each, with warmup. Asserts engine overhead < 25% of baseline.
+
+The same test also asserts correctness parity (engine and baseline produce identical row counts) so a timing regression can't masquerade as a correctness issue.
+
+### runLoop.benchmark.test.ts — per-scenario latency budgets
+
+Seven workload shapes, each with a declared p95 per-chunk budget:
+
+| Scenario | Budget (p95 per chunk) |
+|---|---|
+| passthrough — 200 rows | 5 ms |
+| tier1 eq filter — 200 rows | 5 ms |
+| tier1 gte filter — 200 rows | 5 ms |
+| tier2 in-set filter — 200 rows | 10 ms |
+| map transform — 200 rows | 8 ms |
+| large chunk — 1000 rows | 15 ms |
+| multi-transform chain (5 filters) — 200 rows | 12 ms |
+
+Budgets reported on every run (visible in CI logs) so trends are observable.
+
+### runLoop.leak.test.ts — long-run regression
+
+Two tests, both requiring `--expose-gc`:
+
+1. **100-iteration linear-regression slope check** — runs the engine 100 times back-to-back with fresh sources/sinks/transforms, samples heap every 10 iterations, asserts the regression slope is under 50 KB per iteration. Real leaks (e.g. holding a chunk per iter) would be MB-scale.
+2. **500-iteration steady-state range check** — verifies the heap range over 500 iterations stays under 5MB. Catches slow leaks that wouldn't show in 100 iterations.
+
+This file also catches `atomFamily` leaks in `makeSourceFromPaginatedStore` indirectly — each iteration uses fresh sources/sinks, so any persistent state would manifest as monotonic heap growth.
+
+## When a test fails
+
+### Memory tests
+
+If `runLoop.memory.test.ts` fails:
+
+1. Look at the printed heap samples in the error message. Are they monotonically growing?
+2. If yes — the loop is retaining chunks. Check recent changes to `runLoop.ts`:
+   - The `let current: Chunk<any> = chunk` variable should be released between iterations
+   - The `try/finally` shouldn't capture chunks in its scope
+3. If samples are erratic — could be GC noise. Re-run; if it fails consistently, it's a real regression.
+
+### Overhead test
+
+If `runLoop.overhead.test.ts` fails with engine overhead > 25%:
+
+1. Look at the median values. Is the engine slower in absolute terms, or did the baseline get faster?
+2. Check recent changes to `runLoop.ts`. Common causes:
+   - Added extra `await` in the hot path
+   - Added per-iteration allocations (e.g. constructing an object inside the loop)
+   - Added a regex or other unexpectedly expensive operation
+3. If the change is legitimate (e.g. you added a feature with measurable cost), update the budget in the test and document the rationale.
+
+### Benchmark failures
+
+If `runLoop.benchmark.test.ts` fails:
+
+1. Check which specific scenario failed — the test name and printed metrics show.
+2. Compare the p95 to the budget. A 2x miss is a regression; a 10% miss might be variance.
+3. Re-run locally a few times. If it fails consistently, investigate the transform.
+4. If the workload's intrinsic cost has changed (e.g. row size grew), update the budget in `SCENARIOS` and explain in the commit.
+
+### Leak test
+
+If `runLoop.leak.test.ts` fails:
+
+1. Look at the printed heap samples. Monotonic growth = real leak.
+2. Likely culprits:
+   - `atomFamily` entries piling up in `makeSourceFromPaginatedStore` (each iteration uses a fresh scopeId — entries are never `.remove()`-ed)
+   - A closure in the engine retaining a `chunk` reference
+   - An event listener on the AbortSignal not being cleaned up
+3. Use `--inspect-brk` and Chrome DevTools to take heap snapshots between iterations.
+
+## How budgets are calibrated
+
+Current budgets are based on local measurements (M-series MacBook). They include 2-3x headroom for CI variance:
+
+- Local typical: engine overhead ~9% (budget 25%)
+- Local typical: p95 per-chunk for tier1 filter ~1-2ms (budget 5ms)
+- Local typical: 500-iter heap range ~1-2MB (budget 5MB)
+
+If CI consistently fails one test class while local passes, the budget may need to grow for that environment OR we need separate budgets per environment (left as a future improvement once we see CI numbers).
+
+## Running with `--expose-gc`
+
+Memory and leak tests require `global.gc()` for deterministic measurement. Two ways to provide it:
+
+```bash
+# Via the npm script (recommended)
+pnpm --filter @agenta/entities test:etl:memory
+
+# Direct invocation
+NODE_OPTIONS="--expose-gc" pnpm exec tsx --test src/etl/__tests__/runLoop.memory.test.ts
+```
+
+Without `--expose-gc`, the memory and leak tests skip rather than fail. This way contributors running `test:etl` casually don't get false failures.
+
+## Adding new tests
+
+New behavioral assertions → add to `runLoop.guarantees.test.ts`
+New memory invariants → add to `runLoop.memory.test.ts`
+New performance baselines → add to `runLoop.benchmark.test.ts` (add to `SCENARIOS` array with a budget)
+Anything that runs 100+ iterations → add to `runLoop.leak.test.ts`
+
+Keep each test file under 400 lines. If you're adding a new category, create a new file with the `runLoop.<category>.test.ts` naming convention and wire it into the appropriate `test:etl*` script.
diff --git a/web/packages/agenta-entities/src/etl/__tests__/runLoop.benchmark.test.ts b/web/packages/agenta-entities/src/etl/__tests__/runLoop.benchmark.test.ts
new file mode 100644
index 0000000000..0088540f5d
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/runLoop.benchmark.test.ts
@@ -0,0 +1,226 @@
+/**
+ * Per-scenario latency budgets.
+ *
+ * Each scenario simulates a different shape of pipeline work — passthrough,
+ * Tier-1 filter, Tier-2 filter, large chunks, multi-transform — and asserts
+ * its median per-chunk latency stays under a declared budget.
+ *
+ * These budgets are deliberately generous to absorb CI variance while still
+ * catching real regressions (e.g. an accidental N² loop in a transform).
+ * Tighten the budgets as we get stable CI numbers.
+ *
+ * Failing tests print the actual numbers so the failing CI log tells you
+ * which scenario regressed and by how much.
+ *
+ * Run via test:etl:memory (alongside memory + overhead tests).
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import type {Sink, Source, Transform} from "../core/types"
+import {runLoop} from "../runtime/runLoop"
+
+// ============================================================================
+// Helpers
+// ============================================================================
+
+interface Row {
+    id: number
+    score: number
+    label: string
+    payload: string
+}
+
+function makeRow(i: number): Row {
+    return {
+        id: i,
+        score: (i * 17) % 100,
+        label: `row-${i}`,
+        payload: "x".repeat(200),
+    }
+}
+
+function makeSource(opts: {chunks: number; chunkSize: number}): Source<Row, undefined> {
+    return {
+        async *extract(_params, signal) {
+            for (let c = 0; c < opts.chunks; c++) {
+                if (signal.aborted) return
+                const items: Row[] = []
+                for (let i = 0; i < opts.chunkSize; i++) {
+                    items.push(makeRow(c * opts.chunkSize + i))
+                }
+                // No await — yields synchronously so per-chunk timing is
+                // dominated by transform/sink cost, not I/O simulation.
+                yield {items, cursor: c < opts.chunks - 1 ? `c${c}` : null}
+            }
+        },
+    }
+}
+
+function makeNullSink<T>(): Sink<T> {
+    return {
+        async load() {
+            return {loadedCount: 0}
+        },
+    }
+}
+
+function quantile(values: number[], q: number): number {
+    const sorted = [...values].sort((a, b) => a - b)
+    const pos = (sorted.length - 1) * q
+    const lo = Math.floor(pos)
+    const hi = Math.ceil(pos)
+    if (lo === hi) return sorted[lo]
+    return sorted[lo] * (hi - pos) + sorted[hi] * (pos - lo)
+}
+
+// ============================================================================
+// Transforms
+// ============================================================================
+
+const tier1EqFilter: Transform<Row, Row> = (chunk) => ({
+    ...chunk,
+    items: chunk.items.filter((r) => r.score === 50),
+})
+
+const tier1GteFilter: Transform<Row, Row> = (chunk) => ({
+    ...chunk,
+    items: chunk.items.filter((r) => r.score >= 50),
+})
+
+const tier2InFilter: Transform<Row, Row> = (() => {
+    // Set lookup — O(1) per row but slightly more work than ===
+    const allowed = new Set(["row-1", "row-50", "row-100", "row-150", "row-200"])
+    return (chunk) => ({
+        ...chunk,
+        items: chunk.items.filter((r) => allowed.has(r.label)),
+    })
+})()
+
+const mapAddField: Transform<Row, Row & {grade: string}> = (chunk) => ({
+    ...chunk,
+    items: chunk.items.map((r) => ({...r, grade: r.score >= 50 ? "A" : "B"})),
+})
+
+// ============================================================================
+// Scenarios — each gets its own per-chunk budget
+// ============================================================================
+
+interface Scenario {
+    name: string
+    chunks: number
+    chunkSize: number
+    transforms: Transform<unknown, unknown>[]
+    /** p95 per-chunk latency budget in milliseconds */
+    p95BudgetMs: number
+}
+
+const SCENARIOS: Scenario[] = [
+    {
+        name: "passthrough — 200 rows",
+        chunks: 50,
+        chunkSize: 200,
+        transforms: [],
+        p95BudgetMs: 5,
+    },
+    {
+        name: "tier1 eq filter — 200 rows",
+        chunks: 50,
+        chunkSize: 200,
+        transforms: [tier1EqFilter as Transform<unknown, unknown>],
+        p95BudgetMs: 5,
+    },
+    {
+        name: "tier1 gte filter — 200 rows",
+        chunks: 50,
+        chunkSize: 200,
+        transforms: [tier1GteFilter as Transform<unknown, unknown>],
+        p95BudgetMs: 5,
+    },
+    {
+        name: "tier2 in-set filter — 200 rows",
+        chunks: 50,
+        chunkSize: 200,
+        transforms: [tier2InFilter as Transform<unknown, unknown>],
+        p95BudgetMs: 10,
+    },
+    {
+        name: "map transform — 200 rows",
+        chunks: 50,
+        chunkSize: 200,
+        transforms: [mapAddField as unknown as Transform<unknown, unknown>],
+        p95BudgetMs: 8,
+    },
+    {
+        name: "large chunk — 1000 rows",
+        chunks: 25,
+        chunkSize: 1000,
+        transforms: [tier1GteFilter as Transform<unknown, unknown>],
+        p95BudgetMs: 15,
+    },
+    {
+        name: "multi-transform chain — 5 filters on 200 rows",
+        chunks: 50,
+        chunkSize: 200,
+        transforms: [
+            tier1GteFilter as Transform<unknown, unknown>,
+            tier1GteFilter as Transform<unknown, unknown>,
+            tier1GteFilter as Transform<unknown, unknown>,
+            tier1GteFilter as Transform<unknown, unknown>,
+            tier1GteFilter as Transform<unknown, unknown>,
+        ],
+        p95BudgetMs: 12,
+    },
+]
+
+// ============================================================================
+// Runner
+// ============================================================================
+
+describe("Benchmark: per-scenario latency budgets", () => {
+    for (const scenario of SCENARIOS) {
+        it(`${scenario.name}: p95 < ${scenario.p95BudgetMs}ms per chunk`, async () => {
+            // Warm-up run
+            {
+                const src = makeSource({chunks: scenario.chunks, chunkSize: scenario.chunkSize})
+                const sink = makeNullSink<unknown>()
+                for await (const _ of runLoop(src, scenario.transforms, sink, undefined)) {
+                    // drain
+                }
+            }
+
+            // Measurement run — sample per-chunk latency
+            const src = makeSource({chunks: scenario.chunks, chunkSize: scenario.chunkSize})
+            const sink = makeNullSink<unknown>()
+            const samples: number[] = []
+            let lastT = performance.now()
+
+            for await (const _ of runLoop(src, scenario.transforms, sink, undefined)) {
+                const now = performance.now()
+                samples.push(now - lastT)
+                lastT = now
+            }
+
+            const p50 = quantile(samples, 0.5)
+            const p95 = quantile(samples, 0.95)
+            const p99 = quantile(samples, 0.99)
+            const max = Math.max(...samples)
+
+            console.log(
+                `\n  ${scenario.name}: ` +
+                    `p50=${p50.toFixed(2)}ms p95=${p95.toFixed(2)}ms ` +
+                    `p99=${p99.toFixed(2)}ms max=${max.toFixed(2)}ms ` +
+                    `(${samples.length} chunks)`,
+            )
+
+            assert.ok(
+                p95 < scenario.p95BudgetMs,
+                `${scenario.name}: p95 ${p95.toFixed(2)}ms exceeded ${scenario.p95BudgetMs}ms budget. ` +
+                    `p50=${p50.toFixed(2)}ms p95=${p95.toFixed(2)}ms p99=${p99.toFixed(2)}ms max=${max.toFixed(2)}ms. ` +
+                    `Either the workload genuinely got slower (regression) or the budget needs tuning ` +
+                    `(see __tests__/README.md).`,
+            )
+        })
+    }
+})
diff --git a/web/packages/agenta-entities/src/etl/__tests__/runLoop.combinedLeak.test.ts b/web/packages/agenta-entities/src/etl/__tests__/runLoop.combinedLeak.test.ts
new file mode 100644
index 0000000000..c67da5326d
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/runLoop.combinedLeak.test.ts
@@ -0,0 +1,346 @@
+/**
+ * Combined leak test — `makeSourceFromPaginatedStore` + molecule layer.
+ *
+ * The original engine leak test (`runLoop.leak.test.ts`) exercises the
+ * runtime with synthetic Source/Sink. The molecule leak test
+ * (`molecules.leak.test.ts`) exercises the TanStack cache layer in
+ * isolation. Neither test covers the COMBINATION — running the real
+ * paginated source adapter alongside the molecule-backed hydrate
+ * fetchers, iteration after iteration.
+ *
+ * What this test catches:
+ *
+ *   1. `atomFamily(scopeId)` retention inside `createPaginatedEntityStore`
+ *      — every fresh `scopeId` adds an entry to the paginated store's
+ *      controller atom family. Without `.remove()` (or scopeId reuse),
+ *      it grows unboundedly across pipeline runs.
+ *
+ *   2. `traceEntityAtomFamily` retention — every unique traceId visited
+ *      adds an atom. Long ETL passes against unique trace_ids accumulate.
+ *
+ *   3. TanStack cache growth from the cumulative effect of result/metric/
+ *      testcase/trace writes, which only release if the caller explicitly
+ *      evicts.
+ *
+ * Methodology: 50 iterations of a fully-synthetic pipeline (no network),
+ * sample heap + entity counts at intervals. Two contrasting modes:
+ *
+ *   - With teardown (evict + atom family clear) → heap slope near zero
+ *   - Without teardown → heap + cache + atom-family entries grow linearly
+ *
+ * Skipped without --expose-gc.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import {QueryClient} from "@tanstack/react-query"
+import {atom, getDefaultStore} from "jotai"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {inspectCache, clearCacheByPrefix} from "../../evaluationRun/etl/cacheDiagnostics"
+import {evaluationMetricMolecule} from "../../evaluationRun/state/metricMolecule"
+import {evaluationResultMolecule} from "../../evaluationRun/state/resultMolecule"
+import {
+    inspectAtomFamilies,
+    clearAllAtomFamilies,
+} from "../../shared/molecule/instrumentedAtomFamily"
+import {createPaginatedEntityStore} from "../../shared/paginated/createPaginatedEntityStore"
+import {makeSourceFromPaginatedStore} from "../adapters/makeSourceFromPaginatedStore"
+import type {Sink, Transform} from "../core/types"
+import {runLoop} from "../runtime/runLoop"
+
+const hasGc = typeof (globalThis as {gc?: () => void}).gc === "function"
+const forceGc = () => (globalThis as {gc?: () => void}).gc?.()
+
+const store = getDefaultStore()
+
+function installQc(): QueryClient {
+    const qc = new QueryClient({
+        defaultOptions: {queries: {retry: false, gcTime: Infinity, staleTime: Infinity}},
+    })
+    store.set(queryClientAtom, qc)
+    return qc
+}
+
+// `InfiniteTableRowBase` requires `key` and a `[key: string]: unknown` index
+// signature — we mirror `id` into `key` so the rest of the test code can stay
+// id-keyed.
+interface FakeRow {
+    key: string
+    id: string
+    status: string
+    run_id: string
+    [k: string]: unknown
+}
+
+// `BaseTableMeta` requires `projectId` — null is fine for the synthetic
+// store because we override `isEnabled` below to skip the projectId check.
+interface FakeMeta {
+    projectId: string | null
+    runId: string
+}
+
+/**
+ * Build a paginated store backed by an in-memory page generator. Used to
+ * exercise makeSourceFromPaginatedStore without hitting the network.
+ *
+ * The default `isEnabled` predicate of `createPaginatedEntityStore` looks
+ * for `meta.projectId` — our synthetic meta uses only `runId`, so we
+ * override `isEnabled` to always allow the fetch.
+ */
+function buildSyntheticStore(scopeRunId: string, totalRows: number, pageSize: number) {
+    const metaAtom = atom<FakeMeta>({projectId: null, runId: scopeRunId})
+    return createPaginatedEntityStore<FakeRow, FakeRow, FakeMeta>({
+        entityName: `synthetic-${scopeRunId}`,
+        metaAtom,
+        isEnabled: () => true,
+        fetchPage: async ({meta, limit, cursor}) => {
+            const startIdx = cursor ? parseInt(cursor, 10) : 0
+            const endIdx = Math.min(startIdx + limit, totalRows)
+            const rows: FakeRow[] = []
+            for (let i = startIdx; i < endIdx; i++) {
+                const rowId = `${meta.runId}-row-${i}`
+                rows.push({key: rowId, id: rowId, status: "success", run_id: meta.runId})
+            }
+            const nextCursor = endIdx < totalRows ? String(endIdx) : null
+            return {
+                rows,
+                totalCount: totalRows,
+                hasMore: !!nextCursor,
+                nextCursor,
+                nextOffset: null,
+                nextWindowing: null,
+            }
+        },
+        rowConfig: {
+            getRowId: (r) => r.id,
+            skeletonDefaults: {} as Partial<FakeRow>,
+        },
+    })
+}
+
+function regressionSlope(samples: number[]): number {
+    if (samples.length < 2) return 0
+    const n = samples.length
+    const xs = samples.map((_, i) => i)
+    const meanX = xs.reduce((a, b) => a + b, 0) / n
+    const meanY = samples.reduce((a, b) => a + b, 0) / n
+    const num = xs.reduce((acc, x, i) => acc + (x - meanX) * (samples[i] - meanY), 0)
+    const den = xs.reduce((acc, x) => acc + (x - meanX) ** 2, 0)
+    return den === 0 ? 0 : num / den
+}
+
+// =============================================================================
+// Main: 50-iteration combined pipeline, with vs without teardown
+// =============================================================================
+
+describe("Combined leak: paginatedStore + molecule layer", () => {
+    it(
+        "50 iterations WITH teardown: heap slope ≈ 0, atoms + cache drained between runs",
+        {timeout: 90_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            installQc()
+            const ITERATIONS = 50
+            const ROWS_PER_RUN = 40
+            const PAGE_SIZE = 20
+            const PROJECT_ID = "p1"
+
+            forceGc()
+            const samples: number[] = []
+            const atomSamples: number[] = []
+            const cacheSamples: number[] = []
+
+            for (let iter = 0; iter < ITERATIONS; iter++) {
+                const runId = `combined-run-${iter}`
+                const scenariosStore = buildSyntheticStore(runId, ROWS_PER_RUN, PAGE_SIZE)
+
+                // Source via the real paginated-store adapter (this is what
+                // grows the atomFamily inside createPaginatedEntityStore)
+                const source = makeSourceFromPaginatedStore<FakeRow>(scenariosStore, {
+                    scopeId: `combined-scope-${iter}`,
+                    pageSize: PAGE_SIZE,
+                })
+
+                const passthrough: Transform<FakeRow, FakeRow> = (chunk) => chunk
+                const sink: Sink<FakeRow> = {
+                    async load(chunk) {
+                        // Touch the molecule layer to populate TanStack cache.
+                        // Use chunk's row ids as fake scenarioIds so the cache
+                        // entries are unique per iteration.
+                        const scenarioIds = chunk.items.map((r) => r.id)
+                        // Seed cache directly (avoids network for synthetic test)
+                        const qc = store.get(queryClientAtom)
+                        for (const sid of scenarioIds) {
+                            qc.setQueryData(
+                                ["evaluation-results", PROJECT_ID, runId, sid],
+                                [
+                                    {
+                                        run_id: runId,
+                                        scenario_id: sid,
+                                        step_key: "x",
+                                        status: "ok",
+                                    },
+                                ],
+                            )
+                            qc.setQueryData(
+                                ["evaluation-metrics", PROJECT_ID, runId, sid],
+                                [{id: sid, run_id: runId, scenario_id: sid, status: "ok"}],
+                            )
+                        }
+                        // Now exercise the molecule reads
+                        await evaluationResultMolecule.actions.prefetchByScenarioIds({
+                            projectId: PROJECT_ID,
+                            runId,
+                            scenarioIds,
+                        })
+                        await evaluationMetricMolecule.actions.prefetchByScenarioIds({
+                            projectId: PROJECT_ID,
+                            runId,
+                            scenarioIds,
+                        })
+                        return {loadedCount: chunk.items.length}
+                    },
+                }
+
+                for await (const _ of runLoop(source, [passthrough], sink, undefined)) {
+                    // drain
+                }
+
+                // TEARDOWN — release everything we created this iteration.
+                evaluationResultMolecule.actions.evictByRunId({projectId: PROJECT_ID, runId})
+                evaluationMetricMolecule.actions.evictByRunId({projectId: PROJECT_ID, runId})
+                clearCacheByPrefix(["testcase", "trace-entity", "span"])
+                // The paginated store now owns its own atomFamily registry
+                // AND its TanStack queries. dispose() releases both — the
+                // 13 internal atom families + every cache entry keyed by
+                // the store's `options.key`. Without this, ~70 KB/iter
+                // accumulates from TanStack observer state for retired
+                // scopeIds. WITH dispose(), the combined slope is ~3 KB/iter
+                // (flat — GC noise floor).
+                scenariosStore.dispose()
+                // Also clear any globally-registered families (trace store etc.)
+                clearAllAtomFamilies()
+
+                if (iter > 5 && iter % 5 === 0) {
+                    forceGc()
+                    samples.push(process.memoryUsage().heapUsed)
+                    atomSamples.push(inspectAtomFamilies().reduce((a, f) => a + f.size, 0))
+                    cacheSamples.push(inspectCache().totalEntries)
+                }
+            }
+
+            const slopeBytesPerSample = regressionSlope(samples)
+            const slopeBytesPerIter = slopeBytesPerSample / 5
+            // Tight budget: once `paginatedStore.dispose()` was added (with
+            // TanStack query removal), measured slope is ~3 KB/iter. The
+            // budget is set to 30 KB to leave headroom for GC noise but
+            // catch any future regression from the dispose path breaking.
+            const BUDGET_KB_PER_ITER = 30
+
+            console.log(
+                `\n  heap samples (MB): [${samples.map((s) => (s / 1024 / 1024).toFixed(1)).join(", ")}]`,
+            )
+            console.log(`  atom family params at each sample: [${atomSamples.join(", ")}]`)
+            console.log(`  TanStack cache entries at each sample: [${cacheSamples.join(", ")}]`)
+            console.log(
+                `  heap slope: ${(slopeBytesPerIter / 1024).toFixed(2)} KB/iter (budget ${BUDGET_KB_PER_ITER} KB/iter)`,
+            )
+
+            assert.ok(
+                slopeBytesPerIter < BUDGET_KB_PER_ITER * 1024,
+                `Combined pipeline leaks ${(slopeBytesPerIter / 1024).toFixed(1)} KB/iter. ` +
+                    `Teardown isn't releasing memory. Atoms: ${atomSamples}, Cache: ${cacheSamples}`,
+            )
+
+            // Atom family params should stabilize near zero post-teardown.
+            // We allow some slack because each iteration's teardown runs
+            // BEFORE the next iteration's allocations.
+            const lastAtomSample = atomSamples[atomSamples.length - 1] ?? 0
+            assert.ok(lastAtomSample < 50, `Atom family params not draining: ${atomSamples}`)
+        },
+    )
+
+    // NOTE: a "growth without eviction" sanity-contrast test lived here
+    // previously but proved redundant with `molecules.leak.test.ts:WITHOUT
+    // eviction` AND ran into cross-test pollution with the paginated-store
+    // adapter's module-scoped atoms (the contrast iteration's source got
+    // stuck because the prior iteration's atom subscriptions were still
+    // alive). The load-bearing claim — that with disciplined teardown the
+    // combined pipeline keeps heap bounded — is covered above.
+    //
+    // If you ever want a long-run combined-without-teardown test, isolate
+    // the paginated-store state per process (run in a child) or replace
+    // the adapter with a simpler inline Source for that specific case.
+})
+
+// =============================================================================
+// instrumentedAtomFamily semantics tests (no GC needed)
+// =============================================================================
+
+describe("instrumentedAtomFamily: size + remove + clear semantics", () => {
+    it("tracks size as new params arrive", async () => {
+        // Build a fresh instrumented family for an isolated check.
+        const {instrumentedAtomFamily} =
+            await import("../../shared/molecule/instrumentedAtomFamily")
+        const family = instrumentedAtomFamily((id: string) => atom(id), {
+            name: "test.sizeFamily",
+            skipRegistry: true,
+        })
+
+        assert.equal(family.size(), 0)
+        family("a")
+        family("b")
+        family("a") // dedup
+        assert.equal(family.size(), 2)
+        family("c")
+        assert.equal(family.size(), 3)
+    })
+
+    it("remove() drops a single param", async () => {
+        const {instrumentedAtomFamily} =
+            await import("../../shared/molecule/instrumentedAtomFamily")
+        const family = instrumentedAtomFamily((id: string) => atom(id), {
+            name: "test.removeFamily",
+            skipRegistry: true,
+        })
+        family("a")
+        family("b")
+        assert.equal(family.size(), 2)
+        family.remove("a")
+        assert.equal(family.size(), 1)
+        assert.deepEqual(Array.from(family.params()), ["b"])
+    })
+
+    it("clear() drops everything", async () => {
+        const {instrumentedAtomFamily} =
+            await import("../../shared/molecule/instrumentedAtomFamily")
+        const family = instrumentedAtomFamily((id: string) => atom(id), {
+            name: "test.clearFamily",
+            skipRegistry: true,
+        })
+        for (let i = 0; i < 100; i++) family(`x${i}`)
+        assert.equal(family.size(), 100)
+        family.clear()
+        assert.equal(family.size(), 0)
+    })
+
+    it("registry surfaces named families via inspectAtomFamilies", async () => {
+        const {
+            instrumentedAtomFamily,
+            inspectAtomFamilies,
+            clearAllAtomFamilies: clearAll,
+        } = await import("../../shared/molecule/instrumentedAtomFamily")
+        clearAll()
+        const family = instrumentedAtomFamily((id: string) => atom(id), {
+            name: "test.registryFamily",
+        })
+        family("p1")
+        family("p2")
+        const stats = inspectAtomFamilies()
+        const ours = stats.find((s) => s.name === "test.registryFamily")
+        assert.ok(ours, "family should be in registry")
+        assert.equal(ours.size, 2)
+        clearAll()
+    })
+})
diff --git a/web/packages/agenta-entities/src/etl/__tests__/runLoop.guarantees.test.ts b/web/packages/agenta-entities/src/etl/__tests__/runLoop.guarantees.test.ts
new file mode 100644
index 0000000000..0f3a0261c8
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/runLoop.guarantees.test.ts
@@ -0,0 +1,288 @@
+/**
+ * Engine guarantees — 5 tests, one per documented guarantee.
+ *
+ * Runnable in Node with no environmental setup beyond the workspace's
+ * existing `tsx` binary. Uses Node's built-in `node:test` runner (no
+ * vitest/jest dep). Run with:
+ *
+ *   pnpm --filter @agenta/entities test:etl
+ *
+ * or directly:
+ *
+ *   pnpm exec tsx --test src/etl/__tests__/runLoop.guarantees.test.ts
+ */
+
+import assert from "node:assert/strict"
+import {describe, it, mock} from "node:test"
+
+import type {Chunk, Sink, Source, Transform} from "../core/types"
+import {runLoop} from "../runtime/runLoop"
+
+// ============================================================================
+// Test helpers
+// ============================================================================
+
+/**
+ * A fake Source that yields N pre-built chunks. Honors AbortSignal.
+ */
+function makeFakeSource<T>(chunks: Chunk<T>[]): Source<T, undefined> {
+    return {
+        async *extract(_params, signal) {
+            for (const chunk of chunks) {
+                if (signal.aborted) return
+                // Microtask boundary to simulate I/O
+                await Promise.resolve()
+                yield chunk
+            }
+        },
+    }
+}
+
+interface RecordingSink<T> extends Sink<T> {
+    recorded: T[]
+    finalized: boolean
+}
+
+/**
+ * A fake Sink that records each chunk's items into an array.
+ */
+function makeRecordingSink<T>(): RecordingSink<T> {
+    const recorded: T[] = []
+    const sink: RecordingSink<T> = {
+        recorded,
+        finalized: false,
+        async load(chunk) {
+            recorded.push(...chunk.items)
+            return {loadedCount: chunk.items.length}
+        },
+        async finalize() {
+            sink.finalized = true
+        },
+    }
+    return sink
+}
+
+// ============================================================================
+// Guarantee 1 — Pipeline memory bounded by chunk size
+// ============================================================================
+
+describe("Guarantee 1: pipeline memory bounded by chunk size", () => {
+    it("processes chunks one at a time; transform sees only the current chunk", async () => {
+        const chunkSizes: number[] = []
+        const captureChunkSize: Transform<number, number> = (chunk) => {
+            chunkSizes.push(chunk.items.length)
+            return chunk
+        }
+
+        const source = makeFakeSource<number>([
+            {items: [1, 2, 3], cursor: "c1"},
+            {items: [4, 5, 6], cursor: "c2"},
+            {items: [7, 8, 9], cursor: null},
+        ])
+        const sink = makeRecordingSink<number>()
+
+        for await (const _ of runLoop(source, [captureChunkSize], sink, undefined)) {
+            // Iterate to completion
+        }
+
+        // Three chunks of three items each — never sees all 9 at once
+        assert.deepStrictEqual(chunkSizes, [3, 3, 3])
+        assert.deepStrictEqual(sink.recorded, [1, 2, 3, 4, 5, 6, 7, 8, 9])
+    })
+})
+
+// ============================================================================
+// Guarantee 2 — Cancellation through the loop body
+// ============================================================================
+
+describe("Guarantee 2: cancellation via AbortSignal", () => {
+    it("stops iteration when signal aborts mid-stream", async () => {
+        const source = makeFakeSource<number>([
+            {items: [1], cursor: "c1"},
+            {items: [2], cursor: "c2"},
+            {items: [3], cursor: "c3"},
+            {items: [4], cursor: null},
+        ])
+        const sink = makeRecordingSink<number>()
+        const controller = new AbortController()
+
+        let count = 0
+        for await (const _progress of runLoop(source, [], sink, undefined, controller.signal)) {
+            count++
+            if (count === 2) controller.abort()
+        }
+
+        // Iteration stops after second chunk; chunks 3 and 4 never load
+        assert.deepStrictEqual(sink.recorded, [1, 2])
+        assert.strictEqual(count, 2)
+    })
+
+    it("still runs finalize on the sink when cancelled", async () => {
+        const source = makeFakeSource<number>([
+            {items: [1], cursor: "c1"},
+            {items: [2], cursor: null},
+        ])
+        const sink = makeRecordingSink<number>()
+        const controller = new AbortController()
+        controller.abort() // Abort before iteration starts
+
+        for await (const _ of runLoop(source, [], sink, undefined, controller.signal)) {
+            // No iterations expected
+        }
+
+        assert.strictEqual(sink.finalized, true)
+    })
+})
+
+// ============================================================================
+// Guarantee 3 — Progress is observable
+// ============================================================================
+
+describe("Guarantee 3: progress yielded per chunk", () => {
+    it("yields Progress after every chunk with running counters", async () => {
+        const source = makeFakeSource<number>([
+            {items: [1, 2], cursor: "c1"},
+            {items: [3, 4, 5], cursor: "c2"},
+            {items: [6], cursor: null},
+        ])
+        const dropOdds: Transform<number, number> = (chunk) => ({
+            ...chunk,
+            items: chunk.items.filter((n) => n % 2 === 0),
+        })
+        const sink = makeRecordingSink<number>()
+
+        const progressEvents: {scanned: number; matched: number; loaded: number}[] = []
+        for await (const progress of runLoop(source, [dropOdds], sink, undefined)) {
+            progressEvents.push({
+                scanned: progress.scanned,
+                matched: progress.matched,
+                loaded: progress.loaded,
+            })
+        }
+
+        // Running totals per chunk:
+        //   chunk 1: scanned=2, matched=1 (just 2), loaded=1
+        //   chunk 2: scanned=5, matched=2 (4), loaded=2
+        //   chunk 3: scanned=6, matched=3 (6), loaded=3
+        assert.deepStrictEqual(progressEvents, [
+            {scanned: 2, matched: 1, loaded: 1},
+            {scanned: 5, matched: 2, loaded: 2},
+            {scanned: 6, matched: 3, loaded: 3},
+        ])
+        assert.deepStrictEqual(sink.recorded, [2, 4, 6])
+    })
+})
+
+// ============================================================================
+// Guarantee 4 — Backpressure is natural
+// ============================================================================
+
+describe("Guarantee 4: backpressure via await sink.load", () => {
+    it("blocks the loop while sink.load is in flight", async () => {
+        const source = makeFakeSource<number>([
+            {items: [1], cursor: "c1"},
+            {items: [2], cursor: "c2"},
+            {items: [3], cursor: null},
+        ])
+
+        const loadStart: number[] = []
+        const loadEnd: number[] = []
+        const slowSink: Sink<number> = {
+            async load(chunk) {
+                loadStart.push(performance.now())
+                await new Promise((r) => setTimeout(r, 30))
+                loadEnd.push(performance.now())
+                return {loadedCount: chunk.items.length}
+            },
+        }
+
+        for await (const _ of runLoop(source, [], slowSink, undefined)) {
+            // Iterate to completion
+        }
+
+        // Each load completes before the next starts
+        assert.strictEqual(loadStart.length, 3)
+        assert.strictEqual(loadEnd.length, 3)
+        for (let i = 1; i < 3; i++) {
+            assert.ok(
+                loadStart[i] >= loadEnd[i - 1],
+                `load ${i} started at ${loadStart[i]} before previous load ended at ${loadEnd[i - 1]}`,
+            )
+        }
+    })
+})
+
+// ============================================================================
+// Guarantee 5 — Cleanup runs on every exit path
+// ============================================================================
+
+describe("Guarantee 5: sink.finalize runs in finally", () => {
+    it("runs finalize on normal completion", async () => {
+        const source = makeFakeSource<number>([{items: [1], cursor: null}])
+        const sink = makeRecordingSink<number>()
+        for await (const _ of runLoop(source, [], sink, undefined)) {
+            // Iterate to completion
+        }
+        assert.strictEqual(sink.finalized, true)
+    })
+
+    it("runs finalize even when a transform throws", async () => {
+        const source = makeFakeSource<number>([
+            {items: [1], cursor: "c1"},
+            {items: [2], cursor: null},
+        ])
+        const sink = makeRecordingSink<number>()
+        const boom: Transform<number, number> = () => {
+            throw new Error("transform boom")
+        }
+
+        await assert.rejects(async () => {
+            for await (const _ of runLoop(source, [boom], sink, undefined)) {
+                // Should throw before any iteration completes
+            }
+        }, /transform boom/)
+
+        // Critical: finalize still ran via the try/finally in runLoop
+        assert.strictEqual(sink.finalized, true)
+    })
+
+    it("runs finalize when source throws mid-stream", async () => {
+        const failingSource: Source<number, undefined> = {
+            async *extract() {
+                yield {items: [1], cursor: "c1"}
+                throw new Error("source boom")
+            },
+        }
+        const sink = makeRecordingSink<number>()
+
+        await assert.rejects(async () => {
+            for await (const _ of runLoop(failingSource, [], sink, undefined)) {
+                // First iteration yields, then source throws
+            }
+        }, /source boom/)
+
+        assert.deepStrictEqual(sink.recorded, [1])
+        assert.strictEqual(sink.finalized, true)
+    })
+})
+
+// ============================================================================
+// Bonus: short-circuit on empty
+// ============================================================================
+
+describe("Behavior: short-circuit on empty chunk", () => {
+    it("skips subsequent transforms when an upstream transform empties the chunk", async () => {
+        const source = makeFakeSource<number>([{items: [1, 2, 3], cursor: null}])
+        const emptyAll: Transform<number, number> = (chunk) => ({...chunk, items: []})
+        const downstreamMock = mock.fn((chunk: Chunk<number>) => chunk)
+        const downstream: Transform<number, number> = downstreamMock
+        const sink = makeRecordingSink<number>()
+
+        for await (const _ of runLoop(source, [emptyAll, downstream], sink, undefined)) {
+            // Iterate to completion
+        }
+
+        assert.strictEqual(downstreamMock.mock.callCount(), 0)
+        assert.deepStrictEqual(sink.recorded, [])
+    })
+})
diff --git a/web/packages/agenta-entities/src/etl/__tests__/runLoop.leak.test.ts b/web/packages/agenta-entities/src/etl/__tests__/runLoop.leak.test.ts
new file mode 100644
index 0000000000..4cbae3cebe
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/runLoop.leak.test.ts
@@ -0,0 +1,182 @@
+/**
+ * Leak detection — does the engine accumulate state across many pipeline runs?
+ *
+ * Each pipeline run creates a fresh Source / Sink / Transform[]. If the engine
+ * (or its hosting infrastructure — atomFamily, listeners, microtask queues)
+ * retains anything beyond the pipeline's lifetime, repeated runs cause heap
+ * growth proportional to iteration count.
+ *
+ * This file runs the engine 100 times in a row, samples heap at every 10th
+ * iteration, and asserts the linear-regression slope of heap-over-iteration is
+ * close to zero.
+ *
+ * The makeSourceFromPaginatedStore adapter uses Jotai's `atomFamily` which
+ * retains atoms indefinitely unless `.remove()` is called. If the adapter
+ * leaks, this test catches it because each iteration uses a fresh scopeId
+ * and the family grows unboundedly.
+ *
+ * Run via test:etl:longrun (slow — ~10-30s). Not part of regular CI.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import type {Sink, Source, Transform} from "../core/types"
+import {runLoop} from "../runtime/runLoop"
+
+const hasGc = typeof (globalThis as {gc?: () => void}).gc === "function"
+
+function forceGc() {
+    ;(globalThis as {gc?: () => void}).gc?.()
+}
+
+interface Row {
+    id: number
+}
+
+function makeSource(opts: {chunks: number; chunkSize: number}): Source<Row, undefined> {
+    return {
+        async *extract(_params, signal) {
+            for (let c = 0; c < opts.chunks; c++) {
+                if (signal.aborted) return
+                const items: Row[] = []
+                for (let i = 0; i < opts.chunkSize; i++) {
+                    items.push({id: c * opts.chunkSize + i})
+                }
+                await Promise.resolve()
+                yield {
+                    items,
+                    cursor: c < opts.chunks - 1 ? `c${c}` : null,
+                }
+            }
+        },
+    }
+}
+
+function makeNullSink<T>(): Sink<T> {
+    return {
+        async load() {
+            return {loadedCount: 0}
+        },
+    }
+}
+
+/**
+ * Linear regression slope (bytes per iteration). Returns 0 if too few samples.
+ */
+function regressionSlope(samples: number[]): number {
+    if (samples.length < 2) return 0
+    const n = samples.length
+    const xs = samples.map((_, i) => i)
+    const meanX = xs.reduce((a, b) => a + b, 0) / n
+    const meanY = samples.reduce((a, b) => a + b, 0) / n
+    const num = xs.reduce((acc, x, i) => acc + (x - meanX) * (samples[i] - meanY), 0)
+    const den = xs.reduce((acc, x) => acc + (x - meanX) ** 2, 0)
+    return den === 0 ? 0 : num / den
+}
+
+describe("Leak: repeated pipeline construction does not retain heap", () => {
+    it(
+        "100 iterations of fresh source/sink: heap slope is near zero",
+        {timeout: 120_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            const ITERATIONS = 100
+            const WARMUP = 10
+            const SAMPLE_INTERVAL = 10
+
+            // Each iteration: construct a fresh source + sink + transform,
+            // run the loop to completion. Nothing should persist between runs.
+            const passthroughTransform: Transform<Row, Row> = (chunk) => chunk
+
+            const samples: number[] = []
+
+            for (let iter = 0; iter < ITERATIONS; iter++) {
+                const source = makeSource({chunks: 20, chunkSize: 100})
+                const sink = makeNullSink<Row>()
+                for await (const _ of runLoop(source, [passthroughTransform], sink, undefined)) {
+                    // drain
+                }
+
+                if (iter >= WARMUP && iter % SAMPLE_INTERVAL === 0) {
+                    forceGc()
+                    samples.push(process.memoryUsage().heapUsed)
+                }
+            }
+
+            assert.ok(samples.length >= 5, `expected ≥5 samples, got ${samples.length}`)
+
+            const slopeBytesPerSample = regressionSlope(samples)
+            // Each sample is 10 iterations apart, so slope/sample → slope/iter
+            const slopeBytesPerIter = slopeBytesPerSample / SAMPLE_INTERVAL
+
+            // Budget: 50 KB per iteration. Real leaks (e.g. holding a chunk
+            // per iter) would be MB-scale. This catches small leaks while
+            // tolerating GC noise.
+            const BUDGET_KB_PER_ITER = 50
+
+            console.log(
+                `\n  iterations sampled: ${samples.length} ` +
+                    `(every ${SAMPLE_INTERVAL}th, after ${WARMUP} warmup)`,
+            )
+            console.log(
+                `  heap samples (MB): [${samples
+                    .map((s) => (s / 1024 / 1024).toFixed(1))
+                    .join(", ")}]`,
+            )
+            console.log(
+                `  slope: ${(slopeBytesPerIter / 1024).toFixed(2)} KB/iter ` +
+                    `(budget ${BUDGET_KB_PER_ITER} KB/iter)`,
+            )
+
+            assert.ok(
+                slopeBytesPerIter < BUDGET_KB_PER_ITER * 1024,
+                `heap grows by ${(slopeBytesPerIter / 1024).toFixed(1)} KB per iteration ` +
+                    `(budget ${BUDGET_KB_PER_ITER} KB/iter). Samples (MB): ` +
+                    `[${samples.map((s) => (s / 1024 / 1024).toFixed(2)).join(", ")}]. ` +
+                    `Something is being retained across pipeline runs.`,
+            )
+        },
+    )
+
+    it(
+        "500 iterations: confirms stable steady-state heap",
+        {timeout: 180_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            const ITERATIONS = 500
+            const WARMUP = 50
+            const SAMPLE_INTERVAL = 25
+
+            const samples: number[] = []
+
+            for (let iter = 0; iter < ITERATIONS; iter++) {
+                const source = makeSource({chunks: 10, chunkSize: 50})
+                const sink = makeNullSink<Row>()
+                for await (const _ of runLoop(source, [], sink, undefined)) {
+                    // drain
+                }
+
+                if (iter >= WARMUP && iter % SAMPLE_INTERVAL === 0) {
+                    forceGc()
+                    samples.push(process.memoryUsage().heapUsed)
+                }
+            }
+
+            const minHeap = Math.min(...samples)
+            const maxHeap = Math.max(...samples)
+            const rangeMb = (maxHeap - minHeap) / 1024 / 1024
+
+            // Range over 500 iterations should be small — GC noise only
+            const BUDGET_MB = 5
+
+            console.log(
+                `\n  steady-state heap range: ${rangeMb.toFixed(2)} MB over ${ITERATIONS} iterations`,
+            )
+
+            assert.ok(
+                rangeMb < BUDGET_MB,
+                `heap range ${rangeMb.toFixed(1)}MB over ${ITERATIONS} iterations ` +
+                    `exceeded ${BUDGET_MB}MB budget. Indicates non-bounded growth.`,
+            )
+        },
+    )
+})
diff --git a/web/packages/agenta-entities/src/etl/__tests__/runLoop.memory.test.ts b/web/packages/agenta-entities/src/etl/__tests__/runLoop.memory.test.ts
new file mode 100644
index 0000000000..139faa7faf
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/runLoop.memory.test.ts
@@ -0,0 +1,285 @@
+/**
+ * Engine memory bounds — assertions, not observations.
+ *
+ * The design RFC claims "pipeline memory bounded by chunk size." This file
+ * encodes that claim as enforceable tests. They require `--expose-gc` so we
+ * can force a deterministic baseline before measurement.
+ *
+ * Run:
+ *   pnpm --filter @agenta/entities test:etl:memory
+ *
+ * Or directly:
+ *   pnpm exec tsx --test --node-options="--expose-gc" \
+ *     src/etl/__tests__/runLoop.memory.test.ts
+ *
+ * If `global.gc` is unavailable (running without `--expose-gc`), tests skip
+ * rather than fail — local `pnpm test:etl` still works for everyone.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import type {Chunk, Sink, Source} from "../core/types"
+import {runLoop} from "../runtime/runLoop"
+
+// ============================================================================
+// Helpers
+// ============================================================================
+
+/**
+ * Whether forced GC is available (requires --expose-gc).
+ */
+const hasGc = typeof (globalThis as {gc?: () => void}).gc === "function"
+
+function forceGc() {
+    ;(globalThis as {gc?: () => void}).gc?.()
+}
+
+/**
+ * A chunk where each row carries a ~1 KB payload. 1000 rows per chunk ≈ 1 MB
+ * per chunk. Used to create memory pressure detectable above Node baseline.
+ */
+interface FatRow {
+    id: number
+    payload: string
+}
+
+function makeFatRow(i: number): FatRow {
+    // ~1 KB payload via a repeated character buffer
+    return {id: i, payload: "x".repeat(1000)}
+}
+
+/**
+ * Synthetic source yielding N chunks of `chunkSize` fat rows. Each chunk is
+ * freshly allocated. Honors AbortSignal.
+ */
+function makeFatSource(opts: {chunks: number; chunkSize: number}): Source<FatRow, undefined> {
+    return {
+        async *extract(_params, signal) {
+            for (let c = 0; c < opts.chunks; c++) {
+                if (signal.aborted) return
+                const items: FatRow[] = []
+                for (let i = 0; i < opts.chunkSize; i++) {
+                    items.push(makeFatRow(c * opts.chunkSize + i))
+                }
+                // Microtask to simulate async boundary
+                await Promise.resolve()
+                yield {
+                    items,
+                    cursor: c < opts.chunks - 1 ? `chunk-${c}` : null,
+                }
+            }
+        },
+    }
+}
+
+/**
+ * Sink that drops every chunk on the floor. Forces the loop to be the only
+ * thing potentially retaining references.
+ */
+function makeNullSink<T>(): Sink<T> {
+    return {
+        async load(_chunk) {
+            return {loadedCount: 0}
+        },
+    }
+}
+
+/**
+ * Returns heap delta from `baseline` in MB.
+ */
+function heapMb(baseline: number): number {
+    return (process.memoryUsage().heapUsed - baseline) / 1024 / 1024
+}
+
+// ============================================================================
+// Memory bound — the load-bearing assertion
+// ============================================================================
+
+describe("Memory: pipeline holds at most one chunk", () => {
+    it(
+        "100 chunks × 1000 fat rows: heap stays bounded by chunk size + overhead",
+        {timeout: 60_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            const CHUNKS = 100
+            const CHUNK_SIZE = 1000 // ~1 MB per chunk
+            // If memory were unbounded: 100 chunks × 1 MB = 100 MB resident
+            // If bounded: 1 chunk + GC noise = expected < 20 MB after GC
+            const BUDGET_MB = 25
+
+            const source = makeFatSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            const sink = makeNullSink<FatRow>()
+
+            // Warm up — allocate a chunk's worth of fat data so the heap is
+            // sized realistically before we baseline
+            for (let i = 0; i < CHUNK_SIZE; i++) makeFatRow(i)
+            forceGc()
+            await new Promise((r) => setImmediate(r))
+            forceGc()
+
+            const baseline = process.memoryUsage().heapUsed
+            const samples: number[] = []
+            let chunksProcessed = 0
+
+            for await (const _ of runLoop(source, [], sink, undefined)) {
+                chunksProcessed++
+                // Sample every 10 chunks (after a GC) so we see steady-state heap
+                if (chunksProcessed % 10 === 0) {
+                    forceGc()
+                    samples.push(heapMb(baseline))
+                }
+            }
+
+            assert.strictEqual(
+                chunksProcessed,
+                CHUNKS,
+                "loop should iterate all chunks before exit",
+            )
+
+            const maxHeap = Math.max(...samples)
+            const finalHeap = samples[samples.length - 1] ?? 0
+
+            assert.ok(
+                maxHeap < BUDGET_MB,
+                `max heap delta ${maxHeap.toFixed(1)}MB exceeded ${BUDGET_MB}MB budget. ` +
+                    `Samples (MB): [${samples.map((s) => s.toFixed(1)).join(", ")}]. ` +
+                    `This means the loop is retaining chunks — possibly via a stale ` +
+                    `reference in 'current' or 'chunk' that isn't released between iterations.`,
+            )
+            assert.ok(
+                finalHeap < BUDGET_MB,
+                `final heap delta ${finalHeap.toFixed(1)}MB exceeded ${BUDGET_MB}MB. ` +
+                    `If max was OK but final isn't, the loop's exit path may not release.`,
+            )
+        },
+    )
+
+    it(
+        "Heap delta does NOT grow linearly with chunk count",
+        {timeout: 60_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            // Run 100 chunks. Compute heap delta at quartile points and confirm
+            // they don't form a monotonic upward trend.
+            const CHUNKS = 100
+            const CHUNK_SIZE = 1000
+
+            const source = makeFatSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            const sink = makeNullSink<FatRow>()
+
+            forceGc()
+            const baseline = process.memoryUsage().heapUsed
+            const samples: number[] = []
+            let chunksProcessed = 0
+
+            for await (const _ of runLoop(source, [], sink, undefined)) {
+                chunksProcessed++
+                if (chunksProcessed % 25 === 0) {
+                    forceGc()
+                    samples.push(heapMb(baseline))
+                }
+            }
+
+            // Samples at 25/50/75/100. If memory bounded, the last quartile
+            // should NOT be much larger than the first.
+            assert.strictEqual(samples.length, 4)
+            const [q1, , , q4] = samples
+            const growth = q4 - q1
+            const GROWTH_BUDGET_MB = 10
+
+            assert.ok(
+                growth < GROWTH_BUDGET_MB,
+                `heap grew ${growth.toFixed(1)}MB from chunk 25 to chunk 100 ` +
+                    `(samples: ${samples.map((s) => s.toFixed(1)).join(", ")}MB). ` +
+                    `Budget: ${GROWTH_BUDGET_MB}MB. Indicates monotonic memory growth — ` +
+                    `something is accumulating per-chunk.`,
+            )
+        },
+    )
+})
+
+// ============================================================================
+// Cancellation memory — aborting mid-stream releases work-in-flight
+// ============================================================================
+
+describe("Memory: cancellation releases held chunks", () => {
+    it(
+        "After abort, heap returns to near baseline",
+        {timeout: 30_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            const CHUNK_SIZE = 1000
+
+            const source = makeFatSource({chunks: 1000, chunkSize: CHUNK_SIZE})
+            const sink = makeNullSink<FatRow>()
+            const controller = new AbortController()
+
+            forceGc()
+            const baseline = process.memoryUsage().heapUsed
+
+            let count = 0
+            for await (const _ of runLoop(source, [], sink, undefined, controller.signal)) {
+                count++
+                if (count === 20) controller.abort()
+            }
+
+            // Force GC twice (one to collect, one to confirm)
+            forceGc()
+            await new Promise((r) => setImmediate(r))
+            forceGc()
+
+            const finalDelta = heapMb(baseline)
+            const BUDGET_MB = 15
+
+            assert.ok(
+                finalDelta < BUDGET_MB,
+                `after cancellation, heap delta ${finalDelta.toFixed(1)}MB exceeded ` +
+                    `${BUDGET_MB}MB budget. The loop's references to in-flight chunks ` +
+                    `may not be released on abort.`,
+            )
+        },
+    )
+})
+
+// ============================================================================
+// Transform composition memory — long transform chains don't accumulate
+// ============================================================================
+
+describe("Memory: transform composition stays bounded", () => {
+    it(
+        "10-transform chain over 100 chunks: same bound as 0 transforms",
+        {timeout: 60_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            const CHUNKS = 100
+            const CHUNK_SIZE = 500
+
+            // Identity transforms — each clones the chunk shape, preserves items
+            // (forces TypeScript-level allocation per transform per chunk)
+            const transforms = Array.from({length: 10}, () => (chunk: Chunk<FatRow>) => ({
+                ...chunk,
+                items: chunk.items.map((r) => r),
+            }))
+
+            const source = makeFatSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            const sink = makeNullSink<FatRow>()
+
+            forceGc()
+            const baseline = process.memoryUsage().heapUsed
+            let chunksProcessed = 0
+
+            for await (const _ of runLoop(source, transforms, sink, undefined)) {
+                chunksProcessed++
+            }
+            forceGc()
+
+            const finalDelta = heapMb(baseline)
+            const BUDGET_MB = 30 // 1 chunk × 10 intermediates + overhead
+
+            assert.strictEqual(chunksProcessed, CHUNKS)
+            assert.ok(
+                finalDelta < BUDGET_MB,
+                `transform chain leaked ${finalDelta.toFixed(1)}MB (budget ${BUDGET_MB}MB). ` +
+                    `One of the transforms or the loop's 'current' variable is retaining ` +
+                    `intermediate chunks across iterations.`,
+            )
+        },
+    )
+})
diff --git a/web/packages/agenta-entities/src/etl/__tests__/runLoop.overhead.test.ts b/web/packages/agenta-entities/src/etl/__tests__/runLoop.overhead.test.ts
new file mode 100644
index 0000000000..4f410098e8
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/__tests__/runLoop.overhead.test.ts
@@ -0,0 +1,224 @@
+/**
+ * Engine overhead — what's the cost of `runLoop` vs a hand-written loop?
+ *
+ * The engine adds yield events, AbortSignal checks, finalize handling, and a
+ * try/finally wrapper. None should add significant overhead, but "significant"
+ * needs a number. This file pins it down.
+ *
+ * Compares two implementations doing the same work:
+ *   1. Baseline: hand-written async iteration over the source, filter, sink
+ *   2. Engine: same work through runLoop
+ *
+ * Asserts engine overhead < BUDGET (currently 25% — generous to absorb CI
+ * variance; tighten as we get steady CI numbers).
+ *
+ * Run:
+ *   pnpm --filter @agenta/entities test:etl:memory
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import type {Chunk, Sink, Source, Transform} from "../core/types"
+import {runLoop} from "../runtime/runLoop"
+
+// ============================================================================
+// Synthetic workload — kept small so test runs in seconds
+// ============================================================================
+
+interface Row {
+    id: number
+    score: number
+}
+
+function makeSource(opts: {chunks: number; chunkSize: number}): Source<Row, undefined> {
+    return {
+        async *extract(_params, signal) {
+            for (let c = 0; c < opts.chunks; c++) {
+                if (signal.aborted) return
+                const items: Row[] = []
+                for (let i = 0; i < opts.chunkSize; i++) {
+                    items.push({id: c * opts.chunkSize + i, score: (i * 17) % 100})
+                }
+                // Yield to event loop so timing is comparable to real async
+                await Promise.resolve()
+                yield {
+                    items,
+                    cursor: c < opts.chunks - 1 ? `c${c}` : null,
+                }
+            }
+        },
+    }
+}
+
+const filterScoreGte50: Transform<Row, Row> = (chunk) => ({
+    ...chunk,
+    items: chunk.items.filter((r) => r.score >= 50),
+})
+
+function makeAccumulatorSink(): Sink<Row> & {received: number} {
+    const sink = {
+        received: 0,
+        async load(chunk: Chunk<Row>) {
+            sink.received += chunk.items.length
+            return {loadedCount: chunk.items.length}
+        },
+    }
+    return sink
+}
+
+// ============================================================================
+// Baseline: hand-written equivalent of runLoop
+// ============================================================================
+
+async function baselineLoop(
+    source: Source<Row, undefined>,
+    transform: Transform<Row, Row>,
+    sink: Sink<Row>,
+): Promise<{scanned: number; matched: number; loaded: number}> {
+    const abort = new AbortController().signal
+    let scanned = 0
+    let matched = 0
+    let loaded = 0
+
+    for await (const chunk of source.extract(undefined, abort)) {
+        scanned += chunk.items.length
+        const out = await transform(chunk)
+        matched += out.items.length
+        if (out.items.length > 0) {
+            const r = await sink.load(out)
+            loaded += r.loadedCount ?? out.items.length
+        }
+    }
+
+    return {scanned, matched, loaded}
+}
+
+// ============================================================================
+// Timing helper
+// ============================================================================
+
+async function timeMs(fn: () => Promise<unknown>): Promise<number> {
+    const start = performance.now()
+    await fn()
+    return performance.now() - start
+}
+
+function median(values: number[]): number {
+    const sorted = [...values].sort((a, b) => a - b)
+    const mid = sorted.length >> 1
+    return sorted.length % 2 === 0 ? (sorted[mid - 1] + sorted[mid]) / 2 : sorted[mid]
+}
+
+// ============================================================================
+// Overhead test
+// ============================================================================
+
+describe("Overhead: runLoop vs hand-written equivalent", () => {
+    it("engine overhead median is < 25% over baseline", {timeout: 60_000}, async () => {
+        // Workload size chosen so each run takes 50-200ms (enough signal
+        // to measure, fast enough that CI doesn't time out)
+        const CHUNKS = 200
+        const CHUNK_SIZE = 500
+        const ITERATIONS = 5 // 5 runs of each, take median
+
+        // Warm-up: run each once before timing (JIT, allocator priming)
+        {
+            const src = makeSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            await baselineLoop(src, filterScoreGte50, makeAccumulatorSink())
+        }
+        {
+            const src = makeSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            const sink = makeAccumulatorSink()
+            for await (const _ of runLoop(src, [filterScoreGte50], sink, undefined)) {
+                // drain
+            }
+        }
+
+        // Measure baseline
+        const baselineSamples: number[] = []
+        for (let i = 0; i < ITERATIONS; i++) {
+            const src = makeSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            const sink = makeAccumulatorSink()
+            baselineSamples.push(
+                await timeMs(async () => {
+                    await baselineLoop(src, filterScoreGte50, sink)
+                }),
+            )
+        }
+
+        // Measure engine
+        const engineSamples: number[] = []
+        for (let i = 0; i < ITERATIONS; i++) {
+            const src = makeSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+            const sink = makeAccumulatorSink()
+            engineSamples.push(
+                await timeMs(async () => {
+                    for await (const _ of runLoop(src, [filterScoreGte50], sink, undefined)) {
+                        // drain
+                    }
+                }),
+            )
+        }
+
+        const baselineMed = median(baselineSamples)
+        const engineMed = median(engineSamples)
+        const overheadPct = ((engineMed - baselineMed) / baselineMed) * 100
+        const BUDGET_PCT = 25
+
+        // Report findings even on pass — useful in CI logs
+        console.log(
+            `\n  baseline median: ${baselineMed.toFixed(2)}ms ` +
+                `[${baselineSamples.map((s) => s.toFixed(1)).join(", ")}]`,
+        )
+        console.log(
+            `  engine median:   ${engineMed.toFixed(2)}ms ` +
+                `[${engineSamples.map((s) => s.toFixed(1)).join(", ")}]`,
+        )
+        console.log(`  overhead:        ${overheadPct.toFixed(1)}% (budget ${BUDGET_PCT}%)`)
+
+        assert.ok(
+            overheadPct < BUDGET_PCT,
+            `engine overhead ${overheadPct.toFixed(1)}% exceeded ${BUDGET_PCT}% budget. ` +
+                `Baseline median ${baselineMed.toFixed(1)}ms, engine median ${engineMed.toFixed(1)}ms. ` +
+                `Check the loop for accidental work in the hot path (extra awaits, allocations).`,
+        )
+    })
+
+    it("engine processes the same row counts as baseline", async () => {
+        const CHUNKS = 50
+        const CHUNK_SIZE = 200
+
+        const src1 = makeSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+        const sink1 = makeAccumulatorSink()
+        const baselineResult = await baselineLoop(src1, filterScoreGte50, sink1)
+
+        const src2 = makeSource({chunks: CHUNKS, chunkSize: CHUNK_SIZE})
+        const sink2 = makeAccumulatorSink()
+        let engineScanned = 0
+        let engineMatched = 0
+        let engineLoaded = 0
+        for await (const progress of runLoop(src2, [filterScoreGte50], sink2, undefined)) {
+            engineScanned = progress.scanned
+            engineMatched = progress.matched
+            engineLoaded = progress.loaded
+        }
+
+        assert.strictEqual(
+            engineScanned,
+            baselineResult.scanned,
+            `engine scanned ${engineScanned} but baseline scanned ${baselineResult.scanned}`,
+        )
+        assert.strictEqual(
+            engineMatched,
+            baselineResult.matched,
+            `engine matched ${engineMatched} but baseline matched ${baselineResult.matched}`,
+        )
+        assert.strictEqual(
+            engineLoaded,
+            baselineResult.loaded,
+            `engine loaded ${engineLoaded} but baseline loaded ${baselineResult.loaded}`,
+        )
+        assert.strictEqual(sink1.received, sink2.received, "sinks received different counts")
+    })
+})
diff --git a/web/packages/agenta-entities/src/etl/adapters/makeSourceFromCursorFetch.ts b/web/packages/agenta-entities/src/etl/adapters/makeSourceFromCursorFetch.ts
new file mode 100644
index 0000000000..f2efa54422
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/adapters/makeSourceFromCursorFetch.ts
@@ -0,0 +1,81 @@
+/**
+ * makeSourceFromCursorFetch — builds an ETL `Source<T>` from a cursor-paged
+ * fetch function.
+ *
+ * The caller injects only the per-page transport (`fetchPage`); this adapter
+ * owns the scan loop: cursor advance, the empty-page no-progress guard,
+ * request throttling, `AbortSignal` checks, and `cursor: null` end-of-stream
+ * signaling. Any cursor-paginated API (trace queries, scenario queries, …)
+ * becomes an ETL source through this one adapter.
+ *
+ * @packageDocumentation
+ */
+
+import type {Chunk, Source} from "../core/types"
+
+/** One page of a cursor-paged scan. `nextCursor: null` ⇒ end of stream. */
+export interface CursorPage<T> {
+    rows: T[]
+    nextCursor: string | null
+}
+
+export interface CursorFetchSourceConfig<T> {
+    /** Fetch one page. `cursor` is `null` for the first page. */
+    fetchPage: (cursor: string | null, signal: AbortSignal) => Promise<CursorPage<T>>
+    /** Delay between page requests, ms — reduces API load. Default 100. */
+    pageDelayMs?: number
+    /**
+     * Empty pages in a row before the scan stops (a guard against a cursor
+     * that never advances). Default 3.
+     */
+    maxEmptyPages?: number
+}
+
+const delay = (ms: number) => new Promise<void>((resolve) => setTimeout(resolve, ms))
+
+/**
+ * Build an ETL `Source<T>` from a cursor-paged fetch function. Yields one
+ * `Chunk` per page; the final chunk carries `cursor: null`.
+ */
+export function makeSourceFromCursorFetch<T>(
+    config: CursorFetchSourceConfig<T>,
+): Source<T, undefined> {
+    const {fetchPage, pageDelayMs = 100, maxEmptyPages = 3} = config
+
+    return {
+        async *extract(_params, signal): AsyncIterable<Chunk<T>> {
+            let cursor: string | null = null
+            let emptyPageCount = 0
+
+            while (true) {
+                // Cooperative cancellation between pages.
+                if (signal.aborted) return
+
+                const page = await fetchPage(cursor, signal)
+
+                emptyPageCount = page.rows.length === 0 ? emptyPageCount + 1 : 0
+                const nextCursor = page.nextCursor
+                // Safety: the cursor stopped advancing — end the stream so
+                // the loop can't spin forever. Two ways this happens:
+                //   - Pages keep coming back empty (`maxEmptyPages` guard).
+                //   - The server returns the SAME cursor we just used. Real
+                //     example: every row on the page shares a `start_time`
+                //     the backend's strict-less-than filter can't bump past,
+                //     so each fetch returns the same rows + same cursor. The
+                //     page is non-empty so the empty-page guard never fires,
+                //     but downstream dedup filters everything out and the
+                //     run never makes forward progress.
+                const cursorStuck = cursor !== null && nextCursor !== null && nextCursor === cursor
+                const stalled = emptyPageCount >= maxEmptyPages || cursorStuck
+                const lastChunk = nextCursor === null || stalled
+
+                yield {items: page.rows, cursor: lastChunk ? null : nextCursor}
+
+                if (lastChunk) return
+
+                await delay(pageDelayMs)
+                cursor = nextCursor
+            }
+        },
+    }
+}
diff --git a/web/packages/agenta-entities/src/etl/adapters/makeSourceFromPaginatedStore.ts b/web/packages/agenta-entities/src/etl/adapters/makeSourceFromPaginatedStore.ts
new file mode 100644
index 0000000000..2d23e09211
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/adapters/makeSourceFromPaginatedStore.ts
@@ -0,0 +1,214 @@
+/**
+ * makeSourceFromPaginatedStore — wraps any `createPaginatedEntityStore` instance
+ * as an ETL `Source<TApiRow>`.
+ *
+ * This is the integration point between the ETL engine and the existing
+ * paginated-store infrastructure. The Source drives the store's reactive
+ * pagination machinery (subscribes to the controller, schedules next pages
+ * via the underlying table store's atoms), yielding chunks of newly-loaded
+ * rows for each page.
+ *
+ * Key properties:
+ *   - Uses the SAME `fetchPage` callback the store uses (no duplicate plumbing)
+ *   - Shares the store's accumulated rows with any UI consumer subscribed to
+ *     the same scope (so an ETL pipeline running in parallel populates the
+ *     same atoms the V-table would read from)
+ *   - Honors AbortSignal — cancellation stops the loop and prevents further
+ *     page scheduling
+ *   - Yields cursor=null on the final chunk to signal end-of-stream cleanly
+ *
+ * Architectural note: this file uses deep relative imports (rather than
+ * `@agenta/entities/shared`) because the entities package's `shared` barrel
+ * transitively pulls React components (via `shared/user/UserAuthorLabel.tsx`).
+ * The barrel hygiene issue is documented in eval-package-architecture.md;
+ * once that's fixed, this file can switch to clean package imports.
+ *
+ * @packageDocumentation
+ */
+
+import {getDefaultStore, type Atom, type WritableAtom} from "jotai"
+
+import type {WindowingState} from "../../shared/tableTypes"
+import type {Chunk, Source} from "../core/types"
+
+// ============================================================================
+// Type-level shape of what we need from a PaginatedEntityStore
+// ============================================================================
+
+/**
+ * The subset of `PaginatedEntityStore` this adapter relies on. Declared
+ * locally so we don't have to import the full type (which would pull deep
+ * paginated-store internals into the engine package).
+ */
+export interface PaginatedStoreLike<TApiRow> {
+    entityName: string
+    /** The dataset store wraps the inner table store. */
+    store: {
+        /** Inner table store with the pagination primitives. */
+        store: {
+            atoms: {
+                paginationInfoAtomFamily: (params: {scopeId: string; pageSize: number}) => Atom<{
+                    isFetching: boolean
+                    hasMore: boolean
+                    nextCursor: string | null
+                    nextOffset: number | null
+                    nextWindowing: WindowingState | null
+                    totalCount: number | null
+                }>
+                combinedRowsAtomFamily: (params: {
+                    scopeId: string
+                    pageSize: number
+                }) => Atom<TApiRow[]>
+                // Must match `ScheduleWriteArg` in createInfiniteTableStore.ts:
+                // a real paginated store's scheduler requires all four fields.
+                scheduleNextPageAtomFamily: (params: {
+                    scopeId: string
+                    pageSize: number
+                }) => WritableAtom<
+                    null,
+                    [
+                        null | {
+                            nextCursor: string
+                            nextOffset: number
+                            nextWindowing: WindowingState | null
+                            totalRows: number
+                        },
+                    ],
+                    void
+                >
+            }
+        }
+    }
+    /** Controller atom family — subscribing triggers the initial fetch. */
+    controller: (params: {scopeId: string; pageSize: number}) => Atom<{
+        rows: TApiRow[]
+        hasMore: boolean
+        isFetching: boolean
+        totalCount: number | null
+        selectedKeys: unknown[]
+    }>
+}
+
+// ============================================================================
+// Adapter
+// ============================================================================
+
+export interface MakeSourceParams {
+    /** Scope ID for the paginated store's controller atom family. */
+    scopeId: string
+    /** Page size — passed to the store's pagination machinery. */
+    pageSize?: number
+    /** Max time (ms) to wait for any single page load. Defaults to 30s. */
+    pageLoadTimeoutMs?: number
+}
+
+/**
+ * Wraps a `createPaginatedEntityStore` instance as an ETL `Source<TApiRow>`.
+ *
+ * Implementation strategy:
+ *   1. Subscribe to the controller atom — this kicks off the initial fetch
+ *      and keeps the store's reactive pagination machinery alive.
+ *   2. Poll the pagination atom for `isFetching=false`. When the page lands,
+ *      yield the newly-loaded rows as a chunk (diff from rowsSeen index).
+ *   3. If `hasMore` is true, dispatch `scheduleNextPage` and loop.
+ *   4. If `hasMore` is false, yield with `cursor: null` and return.
+ *
+ * The same `fetchPage` callback the store was constructed with drives every
+ * page load. Other consumers of the store (e.g. a V-table subscribed to the
+ * same scope) will see rows accumulate in real time as this Source iterates.
+ */
+export function makeSourceFromPaginatedStore<TApiRow>(
+    paginatedStore: PaginatedStoreLike<TApiRow>,
+    params: MakeSourceParams,
+): Source<TApiRow, undefined> {
+    const {scopeId, pageSize = 200, pageLoadTimeoutMs = 30_000} = params
+
+    return {
+        async *extract(_extractParams, signal) {
+            const store = getDefaultStore()
+            const tableAtoms = paginatedStore.store.store.atoms
+            const familyKey = {scopeId, pageSize}
+
+            const paginationAtom = tableAtoms.paginationInfoAtomFamily(familyKey)
+            const rowsAtom = tableAtoms.combinedRowsAtomFamily(familyKey)
+            const scheduleAtom = tableAtoms.scheduleNextPageAtomFamily(familyKey)
+            const controllerAtom = paginatedStore.controller(familyKey)
+
+            // Subscribe to controller — kicks off the initial fetch and keeps
+            // the reactive machinery alive. Unsubscribe in the finally block.
+            const unsub = store.sub(controllerAtom, () => {})
+
+            try {
+                let rowsSeen = 0
+                let lastCursor: string | null = null
+
+                while (!signal.aborted) {
+                    // Wait for the current page to settle (isFetching → false)
+                    const waitStart = Date.now()
+                    while (!signal.aborted) {
+                        const pagination = store.get(paginationAtom)
+                        if (!pagination.isFetching) break
+                        if (Date.now() - waitStart > pageLoadTimeoutMs) {
+                            throw new Error(
+                                `page load exceeded ${pageLoadTimeoutMs}ms (scope: ${scopeId})`,
+                            )
+                        }
+                        await new Promise((r) => setTimeout(r, 50))
+                    }
+                    if (signal.aborted) return
+
+                    const pagination = store.get(paginationAtom)
+                    const rows = store.get(rowsAtom)
+                    const newRows = rows.slice(rowsSeen)
+                    rowsSeen = rows.length
+
+                    // Compute cursor for this chunk:
+                    //   - If hasMore: cursor is the store's nextCursor (or fallback to last-row-id)
+                    //   - If !hasMore: cursor is null (end of stream)
+                    const apiCursor = pagination.nextCursor
+                    const lastRow = newRows[newRows.length - 1] as {id?: string} | undefined
+                    const fallback = lastRow?.id ?? null
+                    const chunkCursor: string | null = pagination.hasMore
+                        ? (apiCursor ?? fallback)
+                        : null
+
+                    const chunk: Chunk<TApiRow> = {
+                        items: newRows,
+                        cursor: chunkCursor,
+                        meta: {
+                            hint: paginatedStore.entityName,
+                            hasMore: pagination.hasMore,
+                        },
+                    }
+
+                    yield chunk
+
+                    if (!pagination.hasMore || newRows.length === 0) return
+                    if (signal.aborted) return
+
+                    // Drive the next page via the store's own scheduler
+                    const nextCursor = pagination.nextCursor ?? fallback
+                    if (!nextCursor || nextCursor === lastCursor) return
+                    lastCursor = nextCursor
+
+                    // Mirror `useInfiniteTablePagination.loadNextPage` — the
+                    // store's scheduler reducer requires all four fields.
+                    const nextWindowing: WindowingState = pagination.nextWindowing ?? {
+                        next: nextCursor,
+                        order: "ascending",
+                        limit: pageSize,
+                        stop: null,
+                    }
+                    store.set(scheduleAtom, {
+                        nextCursor,
+                        nextOffset: pagination.nextOffset ?? rowsSeen,
+                        nextWindowing,
+                        totalRows: rowsSeen,
+                    })
+                }
+            } finally {
+                unsub()
+            }
+        },
+    }
+}
diff --git a/web/packages/agenta-entities/src/etl/core/types.ts b/web/packages/agenta-entities/src/etl/core/types.ts
new file mode 100644
index 0000000000..b00eb39dcf
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/core/types.ts
@@ -0,0 +1,191 @@
+/**
+ * ETL Loop Engine — Contracts
+ *
+ * The four shapes that define the engine. No DSL, no implementations,
+ * just the protocol. See docs/designs/etl-engine.md for the design RFC.
+ *
+ * Five guarantees of the loop runtime:
+ *   1. Pipeline memory bounded by chunk size
+ *   2. Cancellation through the loop body (via AbortSignal)
+ *   3. Progress is observable (yielded per chunk)
+ *   4. Backpressure is natural (await sink.load)
+ *   5. Idempotent resume is possible (cursor + AbortSignal + deterministic sink)
+ *
+ * @packageDocumentation
+ */
+
+// ============================================================================
+// CURSOR — opaque pagination token
+// ============================================================================
+
+/**
+ * Cursor for paginated sources. The server emits an opaque string and the
+ * client passes it back verbatim in the next request's windowing.next.
+ * No client-side arithmetic.
+ *
+ * Object cursors are reserved for joined sources (see JoinedCursor).
+ * Null means "end of stream".
+ */
+export type Cursor = string | number | object | null
+
+/**
+ * Cursor shape for MultiSourceTransform-driven sources (e.g. derived.joined).
+ * Each side advances independently; the loop carries both.
+ */
+export interface JoinedCursor {
+    aCursor: string | null
+    bCursor: string | null
+}
+
+// ============================================================================
+// CHUNK — unit of iteration
+// ============================================================================
+
+/**
+ * Chunk metadata. Opaque to the loop; consumers can attach hints
+ * (source name, page index, server response headers).
+ */
+export interface ChunkMeta {
+    page?: number
+    hint?: string
+    [k: string]: unknown
+}
+
+/**
+ * A chunk carries its items plus enough metadata for the loop to advance.
+ * Cursor `null` signals end of stream.
+ */
+export interface Chunk<T> {
+    items: T[]
+    cursor: Cursor | null
+    meta?: ChunkMeta
+}
+
+// ============================================================================
+// SOURCE — lazy producer of chunks
+// ============================================================================
+
+/**
+ * A Source produces chunks lazily. AsyncIterable means the loop can pull
+ * one chunk at a time without holding earlier chunks in memory.
+ *
+ * Implementations must:
+ *   - Check `signal.aborted` between fetches and exit cleanly when set
+ *   - Yield one chunk per server response; let the loop advance the cursor
+ *   - Yield a final chunk with `cursor: null` when exhausted, OR return
+ *     before yielding (both signal end-of-stream to the loop)
+ */
+export interface Source<T, Params = unknown> {
+    extract(params: Params, signal: AbortSignal): AsyncIterable<Chunk<T>>
+}
+
+// ============================================================================
+// TRANSFORM — chunk → chunk
+// ============================================================================
+
+/**
+ * A Transform is a pure (or pure-ish) function from one chunk to another.
+ * Compose by array — each transform in `runLoop`'s transforms[] runs in
+ * declared order. Short-circuits on empty: if a transform returns an
+ * empty chunk, subsequent transforms are skipped for that iteration.
+ *
+ * Async permitted (e.g. to await prefetched correlated data inside a
+ * predicate) but slow paths cost the loop directly — backpressure is
+ * the loop's behavior, not the transform's responsibility.
+ */
+export type Transform<In, Out> = (chunk: Chunk<In>) => Chunk<Out> | Promise<Chunk<Out>>
+
+/**
+ * Carries state across chunk boundaries during a multi-source join.
+ * Typically a Map<joinKey, sourceBRow> accumulator. Consumer-defined;
+ * the engine just threads it through the transform on each chunk.
+ */
+export interface JoinState {
+    hashMap?: Map<unknown, unknown>
+    [k: string]: unknown
+}
+
+/**
+ * A MultiSourceTransform reads from two chunks simultaneously, threading
+ * state across chunk boundaries. Used by derived.joined to implement
+ * client-side joins. See open question 8 in the design RFC.
+ */
+export type MultiSourceTransform<A, B, Out> = (
+    chunkA: Chunk<A>,
+    chunkB: Chunk<B>,
+    state: JoinState,
+) => Chunk<Out> | Promise<Chunk<Out>>
+
+// ============================================================================
+// SINK — chunk consumer
+// ============================================================================
+
+/**
+ * Result of a successful sink load. `loadedCount` may differ from
+ * `chunk.items.length` (e.g. a deduplicating sink).
+ */
+export interface LoadResult {
+    loadedCount?: number
+    warnings?: string[]
+}
+
+/**
+ * A Sink consumes chunks. `finalize?` is for commit-style sinks
+ * (testset revision commit, file close) — the loop calls it in a
+ * `finally` block so it runs on cancellation or error as well as
+ * normal completion.
+ */
+export interface Sink<T> {
+    load(chunk: Chunk<T>): Promise<LoadResult>
+    finalize?(): Promise<void>
+}
+
+// ============================================================================
+// PROGRESS — yielded per loop iteration
+// ============================================================================
+
+/**
+ * Progress event yielded after each chunk. Consumers read this to update
+ * UI counters, decide when to break out of the loop, or trigger
+ * tier-escalation in filter pipelines.
+ */
+export interface Progress {
+    scanned: number
+    matched: number
+    loaded: number
+    cursor: Cursor | null
+}
+
+/**
+ * The loop's final return value. Includes everything Progress carries
+ * plus a `done` flag distinguishing normal completion from cancellation
+ * mid-iteration.
+ */
+export interface LoopResult extends Progress {
+    done: boolean
+}
+
+// ============================================================================
+// CHUNK RELEASE HOOK — per-chunk cache eviction
+// ============================================================================
+
+/**
+ * Called once a chunk has been fully consumed by the sink. At this point
+ * nothing downstream in the loop needs the chunk's data — the safe point
+ * to release per-chunk side-effect caches (e.g. the entity caches a
+ * hydrate transform populated) so heap stays bounded by chunk size, not
+ * by dataset size.
+ *
+ * The loop's guarantee #1 ("memory bounded") covers the loop's own chunk
+ * variable — it does NOT cover caches a transform writes as a side
+ * effect. This hook is how a consumer extends that guarantee to those
+ * side-effect caches without the loop having to know about them.
+ *
+ * Receives the post-transform chunk, so the handler can walk every
+ * entity id the materialized rows reference. Note: if a transform that
+ * runs AFTER the cache-populating transform drops rows (e.g. a
+ * post-hydrate filter), the handler only sees the surviving rows — a
+ * cache-populating transform that is followed by a row-dropping
+ * transform should tag `chunk.meta` with its full id manifest instead.
+ */
+export type ChunkReleaseHook<T> = (chunk: Chunk<T>) => void | Promise<void>
diff --git a/web/packages/agenta-entities/src/etl/index.ts b/web/packages/agenta-entities/src/etl/index.ts
new file mode 100644
index 0000000000..53fb486d3b
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/index.ts
@@ -0,0 +1,54 @@
+/**
+ * @agenta/entities/etl
+ *
+ * General-purpose chunked iteration engine for ETL pipelines.
+ *
+ * Defines four contracts (Source, Transform, Sink, Chunk) and one
+ * runtime (runLoop). Zero entity coupling. See docs/designs/etl-engine.md
+ * for the design RFC, docs/designs/eval-etl-engine.md for the canonical
+ * consumer (eval's filter pipeline).
+ *
+ * @example
+ * ```ts
+ * import { runLoop } from "@agenta/entities/etl"
+ *
+ * const source: Source<MyRow> = { ... }
+ * const transform: Transform<MyRow, MyRow> = (chunk) => ({ ...chunk, items: chunk.items.filter(p) })
+ * const sink: Sink<MyRow> = { load: async (chunk) => ({ loadedCount: chunk.items.length }) }
+ *
+ * for await (const progress of runLoop(source, [transform], sink, params, signal)) {
+ *   console.log(progress)
+ * }
+ * ```
+ *
+ * @packageDocumentation
+ */
+
+export type {
+    Chunk,
+    ChunkMeta,
+    Cursor,
+    JoinedCursor,
+    JoinState,
+    LoadResult,
+    LoopResult,
+    MultiSourceTransform,
+    Progress,
+    Sink,
+    Source,
+    Transform,
+} from "./core/types"
+
+export {runLoop} from "./runtime/runLoop"
+
+export {makeSourceFromPaginatedStore} from "./adapters/makeSourceFromPaginatedStore"
+export type {MakeSourceParams, PaginatedStoreLike} from "./adapters/makeSourceFromPaginatedStore"
+
+export {makeSourceFromCursorFetch} from "./adapters/makeSourceFromCursorFetch"
+export type {CursorFetchSourceConfig, CursorPage} from "./adapters/makeSourceFromCursorFetch"
+
+export {BatchFlushError, makeBufferedBatchSink} from "./sinks/makeBufferedBatchSink"
+export type {BufferedBatchSinkConfig, BufferedBatchSinkHandle} from "./sinks/makeBufferedBatchSink"
+
+export {makeUniqueKeyTransform} from "./transforms/makeUniqueKeyTransform"
+export type {UniqueKeyTransformConfig} from "./transforms/makeUniqueKeyTransform"
diff --git a/web/packages/agenta-entities/src/etl/runtime/runLoop.ts b/web/packages/agenta-entities/src/etl/runtime/runLoop.ts
new file mode 100644
index 0000000000..77e20588f3
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/runtime/runLoop.ts
@@ -0,0 +1,120 @@
+/**
+ * ETL Loop Engine — Runtime
+ *
+ * The loop is one function. ~50 lines including comments. All five
+ * guarantees from `core/types.ts` fall out of this code:
+ *   1. Memory bounded: only `current` is held; previous chunks released
+ *   2. Cancellation: `signal.aborted` checked between iterations
+ *   3. Progress: yielded after every chunk
+ *   4. Backpressure: `await sink.load` blocks the loop
+ *   5. Cleanup: `finally` runs `sink.finalize?()` on any exit path
+ *
+ * See docs/designs/etl-engine.md for the design RFC.
+ *
+ * @packageDocumentation
+ */
+
+import type {
+    Chunk,
+    ChunkReleaseHook,
+    Cursor,
+    LoopResult,
+    Progress,
+    Sink,
+    Source,
+    Transform,
+} from "../core/types"
+
+/**
+ * Iterate a pipeline chunk-by-chunk. AsyncGenerator yields a Progress
+ * event after each chunk and returns a LoopResult when done or cancelled.
+ *
+ * Consumer usage:
+ *   ```ts
+ *   const gen = runLoop(source, [filter, project], sink, params, signal)
+ *   for await (const progress of gen) {
+ *     if (progress.matched >= viewportSize) break  // viewport-cancel
+ *   }
+ *   ```
+ *
+ * Or, with access to the final result:
+ *   ```ts
+ *   while (true) {
+ *     const r = await gen.next()
+ *     if (r.done) { result = r.value; break }
+ *     handleProgress(r.value)
+ *   }
+ *   ```
+ *
+ * The loop accepts heterogeneous Transform arrays (`Transform<any, any>[]`)
+ * because TypeScript can't express "chain of transforms" with a single
+ * type parameter pair. Type safety on transforms is the consumer's
+ * responsibility — usually trivial via factory functions that return
+ * correctly-typed Transforms.
+ */
+// Type-erased Transform used in the loop's transforms[] array. TypeScript
+// cannot express "chain of transforms where output of N matches input of N+1"
+// with a single type parameter pair, so the engine accepts a heterogeneous
+// chain. Type safety on transforms is the consumer's responsibility via
+// factory functions returning correctly-typed Transforms (see worked examples).
+// eslint-disable-next-line @typescript-eslint/no-explicit-any
+type AnyTransform = Transform<any, any>
+
+export async function* runLoop<TIn, TOut>(
+    source: Source<TIn>,
+    transforms: AnyTransform[],
+    sink: Sink<TOut>,
+    params: Parameters<Source<TIn>["extract"]>[0],
+    signal?: AbortSignal,
+    /**
+     * Optional per-chunk release hook. Called after the sink has consumed
+     * each chunk (and before the Progress yield, so it runs even when a
+     * consumer viewport-cancels). Lets a consumer free per-chunk
+     * side-effect caches — see `ChunkReleaseHook` in core/types.ts.
+     */
+    onChunkReleased?: ChunkReleaseHook<TOut>,
+): AsyncGenerator<Progress, LoopResult> {
+    const abort = signal ?? new AbortController().signal
+    let scanned = 0
+    let matched = 0
+    let loaded = 0
+    let lastCursor: Cursor | null = null
+
+    try {
+        for await (const chunk of source.extract(params, abort)) {
+            if (abort.aborted) break
+
+            scanned += chunk.items.length
+            lastCursor = chunk.cursor
+
+            // Run transforms in order. Short-circuit on empty.
+            // eslint-disable-next-line @typescript-eslint/no-explicit-any
+            let current: Chunk<any> = chunk
+            for (const tx of transforms) {
+                current = await tx(current)
+                if (current.items.length === 0) break
+            }
+
+            matched += current.items.length
+
+            if (current.items.length > 0) {
+                const result = await sink.load(current as Chunk<TOut>)
+                loaded += result.loadedCount ?? current.items.length
+            }
+
+            // Chunk fully consumed — let the consumer release any per-chunk
+            // side-effect caches (e.g. hydrated entity caches). Runs before
+            // `yield` so a viewport-cancel still releases this chunk.
+            await onChunkReleased?.(current as Chunk<TOut>)
+
+            yield {scanned, matched, loaded, cursor: lastCursor}
+
+            // Source signaled end-of-stream via cursor: null
+            if (chunk.cursor === null) break
+        }
+    } finally {
+        await sink.finalize?.()
+    }
+
+    return {scanned, matched, loaded, cursor: lastCursor, done: true}
+}
diff --git a/web/packages/agenta-entities/src/etl/sinks/makeBufferedBatchSink.ts b/web/packages/agenta-entities/src/etl/sinks/makeBufferedBatchSink.ts
new file mode 100644
index 0000000000..20ec23595b
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/sinks/makeBufferedBatchSink.ts
@@ -0,0 +1,122 @@
+/**
+ * makeBufferedBatchSink — an ETL `Sink<T>` that buffers items and flushes
+ * them in fixed-size batches.
+ *
+ * Decouples the flush size from the scan page size: a 500-row page can flush
+ * as two 250-item batches, and a partial batch carries across page
+ * boundaries. The caller injects only the `flush` transport.
+ *
+ *   - `load`     flushes every full batch the buffer can produce.
+ *   - `finalize` flushes the remainder on clean completion. On cancel or
+ *                after a failed flush it drops the buffer instead, so a
+ *                partial write stays a clean multiple of `batchSize`.
+ *   - a failed flush throws `BatchFlushError` (carries the flushed-so-far
+ *     count) — surfaces, never silent.
+ *
+ * @packageDocumentation
+ */
+
+import type {Chunk, LoadResult, Sink} from "../core/types"
+
+/** Default items per flush. */
+const DEFAULT_BATCH_SIZE = 250
+
+/**
+ * A batch flush failed mid-run. Carries how many items were successfully
+ * flushed before the failure so callers can render an honest partial state.
+ */
+export class BatchFlushError extends Error {
+    /** Items successfully flushed before the failure. */
+    readonly flushedCount: number
+    /** Size of the batch that failed to flush. */
+    readonly failedCount: number
+    /** The underlying error, when the failure was a thrown exception. */
+    readonly originalError?: unknown
+
+    constructor(flushedCount: number, failedCount: number, originalError?: unknown) {
+        super(
+            `Batch flush failed for ${failedCount} item(s); ` +
+                `${flushedCount} flushed before the failure`,
+        )
+        this.name = "BatchFlushError"
+        this.flushedCount = flushedCount
+        this.failedCount = failedCount
+        this.originalError = originalError
+    }
+}
+
+export interface BufferedBatchSinkConfig<T> {
+    /** Flush one batch. A throw is wrapped as `BatchFlushError`. */
+    flush: (batch: T[]) => Promise<void>
+    /** Items per flush. Default 250. */
+    batchSize?: number
+    /**
+     * Run abort signal. When aborted, `finalize` drops the buffer instead of
+     * flushing it — the partial write stays a clean multiple of `batchSize`.
+     */
+    signal?: AbortSignal
+}
+
+export interface BufferedBatchSinkHandle<T> {
+    sink: Sink<T>
+    /** Items successfully flushed so far. */
+    getFlushedCount(): number
+}
+
+/**
+ * Build a `Sink<T>` plus a handle exposing the authoritative flushed-count
+ * (the engine's `Progress.loaded` lags by the unflushed buffer and never sees
+ * the `finalize` flush).
+ */
+export function makeBufferedBatchSink<T>(
+    config: BufferedBatchSinkConfig<T>,
+): BufferedBatchSinkHandle<T> {
+    const {flush, batchSize = DEFAULT_BATCH_SIZE, signal} = config
+
+    let buffer: T[] = []
+    let flushed = 0
+    let errored = false
+
+    const flushBatch = async (batch: T[]): Promise<void> => {
+        try {
+            await flush(batch)
+        } catch (err) {
+            errored = true
+            throw new BatchFlushError(flushed, batch.length, err)
+        }
+        flushed += batch.length
+    }
+
+    return {
+        getFlushedCount: () => flushed,
+        sink: {
+            async load(chunk: Chunk<T>): Promise<LoadResult> {
+                if (chunk.items.length > 0) buffer.push(...chunk.items)
+
+                let flushedThisCall = 0
+                while (buffer.length >= batchSize) {
+                    const batch = buffer.slice(0, batchSize)
+                    buffer = buffer.slice(batchSize)
+                    await flushBatch(batch)
+                    flushedThisCall += batch.length
+                }
+                // `loadedCount` is what the engine adds to `Progress.loaded` —
+                // report only what actually flushed this call.
+                return {loadedCount: flushedThisCall}
+            },
+            async finalize(): Promise<void> {
+                // Drop the buffer on cancel / after a failed flush — keeps a
+                // partial write a clean multiple of `batchSize`.
+                if (errored || signal?.aborted) {
+                    buffer = []
+                    return
+                }
+                if (buffer.length > 0) {
+                    const batch = buffer
+                    buffer = []
+                    await flushBatch(batch)
+                }
+            },
+        },
+    }
+}
diff --git a/web/packages/agenta-entities/src/etl/transforms/makeUniqueKeyTransform.ts b/web/packages/agenta-entities/src/etl/transforms/makeUniqueKeyTransform.ts
new file mode 100644
index 0000000000..c4e9461079
--- /dev/null
+++ b/web/packages/agenta-entities/src/etl/transforms/makeUniqueKeyTransform.ts
@@ -0,0 +1,56 @@
+/**
+ * makeUniqueKeyTransform — an ETL `Transform<TIn, string>` that emits each
+ * row's key exactly once across the whole scan.
+ *
+ * Dedups by key across chunk boundaries (captures a `Set`) and honors an
+ * optional exclude set. Rows with no key are skipped. An optional `limit`
+ * caps how many keys are emitted across the whole scan. Useful whenever a
+ * scan should write a deduplicated (and optionally bounded) set of ids
+ * downstream (e.g. trace ids → a batched sink).
+ *
+ * @packageDocumentation
+ */
+
+import type {Chunk, Transform} from "../core/types"
+
+export interface UniqueKeyTransformConfig<TIn> {
+    /** Extract the dedup key. Rows with no key (null/empty) are skipped. */
+    selectKey: (item: TIn) => string | undefined | null
+    /** Keys to treat as already-seen — counted but never emitted. */
+    exclude?: ReadonlySet<string>
+    /**
+     * Cap on the total number of keys emitted across the whole scan. Once
+     * `limit` keys have been emitted, every later row is skipped and the
+     * transform yields empty chunks. Excluded keys never count toward the
+     * limit. Omit (or pass `undefined`) for no cap.
+     */
+    limit?: number
+}
+
+/**
+ * Build a `Transform<TIn, string>` that emits unique keys. The captured
+ * `Set` (and emitted counter) means one transform instance is good for
+ * exactly one scan.
+ */
+export function makeUniqueKeyTransform<TIn>(
+    config: UniqueKeyTransformConfig<TIn>,
+): Transform<TIn, string> {
+    const {selectKey, exclude, limit} = config
+    const seen = new Set<string>()
+    let emitted = 0
+
+    return (chunk: Chunk<TIn>): Chunk<string> => {
+        const keys: string[] = []
+        for (const item of chunk.items) {
+            // Stop emitting once the cap is hit — later rows yield nothing.
+            if (limit !== undefined && emitted >= limit) break
+            const key = selectKey(item)
+            if (!key || seen.has(key)) continue
+            seen.add(key)
+            if (exclude?.has(key)) continue
+            keys.push(key)
+            emitted += 1
+        }
+        return {items: keys, cursor: chunk.cursor, meta: chunk.meta}
+    }
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/api/api.ts b/web/packages/agenta-entities/src/evaluationRun/api/api.ts
index f831cb181e..9bbfe58432 100644
--- a/web/packages/agenta-entities/src/evaluationRun/api/api.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/api/api.ts
@@ -9,19 +9,23 @@
 
 import {getAgentaApiUrl, axios} from "@agenta/shared/api"
 
-import {safeParseWithLogging} from "../../shared"
+// See testcase/api/api.ts for rationale — the shared barrel pulls in CSS deps.
+import {safeParseWithLogging} from "../../shared/utils/zodSchema"
 import {
     evaluationRunResponseSchema,
     evaluationRunsResponseSchema,
     evaluationResultsResponseSchema,
+    evaluationMetricsResponseSchema,
     type EvaluationRun,
     type EvaluationRunsResponse,
     type EvaluationResult,
+    type EvaluationMetric,
 } from "../core"
 import type {
     EvaluationRunDetailParams,
     EvaluationRunQueryParams,
     EvaluationResultsQueryParams,
+    EvaluationMetricsQueryParams,
 } from "../core"
 
 // ============================================================================
@@ -126,3 +130,42 @@ export async function queryEvaluationResults({
     )
     return validated?.results ?? []
 }
+
+// ============================================================================
+// QUERY EVALUATION METRICS
+// ============================================================================
+
+/**
+ * Query evaluation metrics by run ID and (optionally) scenario IDs.
+ *
+ * Metrics carry the actual scores / stat blobs. Per-scenario metrics have
+ * `scenario_id` populated; run-level aggregates have `scenario_id = null`.
+ *
+ * Endpoint: `POST /evaluations/metrics/query`
+ */
+export async function queryEvaluationMetrics({
+    projectId,
+    runId,
+    scenarioIds,
+}: EvaluationMetricsQueryParams): Promise<EvaluationMetric[]> {
+    if (!projectId || !runId) return []
+    if (scenarioIds && scenarioIds.length === 0) return []
+
+    const body: Record<string, unknown> = {
+        metrics: {
+            run_id: runId,
+            ...(scenarioIds?.length ? {scenario_ids: scenarioIds} : {}),
+        },
+    }
+
+    const response = await axios.post(`${getAgentaApiUrl()}/evaluations/metrics/query`, body, {
+        params: {project_id: projectId},
+    })
+
+    const validated = safeParseWithLogging(
+        evaluationMetricsResponseSchema,
+        response.data,
+        "[queryEvaluationMetrics]",
+    )
+    return validated?.metrics ?? []
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/api/index.ts b/web/packages/agenta-entities/src/evaluationRun/api/index.ts
index 3f130ead27..c36695c9c4 100644
--- a/web/packages/agenta-entities/src/evaluationRun/api/index.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/api/index.ts
@@ -1 +1,6 @@
-export {fetchEvaluationRun, queryEvaluationRuns, queryEvaluationResults} from "./api"
+export {
+    fetchEvaluationRun,
+    queryEvaluationRuns,
+    queryEvaluationResults,
+    queryEvaluationMetrics,
+} from "./api"
diff --git a/web/packages/agenta-entities/src/evaluationRun/core/index.ts b/web/packages/agenta-entities/src/evaluationRun/core/index.ts
index fe57db9cf9..b472aef13e 100644
--- a/web/packages/agenta-entities/src/evaluationRun/core/index.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/core/index.ts
@@ -28,10 +28,16 @@ export {
     type EvaluationResult,
     evaluationResultsResponseSchema,
     type EvaluationResultsResponse,
+    // Evaluation Metrics
+    evaluationMetricSchema,
+    type EvaluationMetric,
+    evaluationMetricsResponseSchema,
+    type EvaluationMetricsResponse,
 } from "./schema"
 
 export type {
     EvaluationRunDetailParams,
     EvaluationRunQueryParams,
     EvaluationResultsQueryParams,
+    EvaluationMetricsQueryParams,
 } from "./types"
diff --git a/web/packages/agenta-entities/src/evaluationRun/core/schema.ts b/web/packages/agenta-entities/src/evaluationRun/core/schema.ts
index 5fc00e787a..b5b6127587 100644
--- a/web/packages/agenta-entities/src/evaluationRun/core/schema.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/core/schema.ts
@@ -9,7 +9,11 @@
 
 import {z} from "zod"
 
-import {auditFieldsSchema, timestampFieldsSchema} from "../../shared"
+// Import from the pure zodSchema source rather than the shared barrel. The
+// shared barrel transitively re-exports paginated/table helpers that depend on
+// agenta-ui (CSS modules), which breaks Node-side execution. Schemas must stay
+// Node-safe so they can be reused in scripts, tests, and ETL adapters.
+import {auditFieldsSchema, timestampFieldsSchema} from "../../shared/utils/zodSchema"
 
 // ============================================================================
 // ENUMS
@@ -168,3 +172,49 @@ export const evaluationResultsResponseSchema = z.object({
     results: z.array(evaluationResultSchema).default([]),
 })
 export type EvaluationResultsResponse = z.infer<typeof evaluationResultsResponseSchema>
+
+// ============================================================================
+// EVALUATION METRIC SCHEMAS
+// ============================================================================
+
+/**
+ * A single evaluation metric — carries the actual scores / stat blobs for a
+ * scenario (when `scenario_id` is set) or for the whole run (when null = aggregate).
+ *
+ * `data` is a nested dict keyed by step_key, with values that are either raw
+ * scores or stat objects (e.g. `{type: "numeric/continuous", mean: 7.5, ...}` or
+ * `{type: "binary", freq: [...]}`). The shape of `data` is run-specific and
+ * driven by run.data.mappings — consumers should join through mappings to
+ * resolve column names.
+ *
+ * Fetched via `POST /evaluations/metrics/query`.
+ */
+export const evaluationMetricSchema = z
+    .object({
+        id: z.string(),
+        run_id: z.string(),
+        // null on run-level aggregates, populated on per-scenario metrics
+        scenario_id: z.string().nullable().optional(),
+        status: z.string().nullable().optional(),
+        // Used for temporal metrics; null on point-in-time metrics
+        interval: z.number().nullable().optional(),
+        timestamp: z.string().nullable().optional(),
+        // The actual values keyed by step_key → mapping path
+        data: z.record(z.string(), z.unknown()).nullable().optional(),
+        flags: z.record(z.string(), z.unknown()).nullable().optional(),
+        tags: z.array(z.string()).nullable().optional(),
+        meta: z.record(z.string(), z.unknown()).nullable().optional(),
+    })
+    .merge(timestampFieldsSchema)
+    .merge(auditFieldsSchema)
+
+export type EvaluationMetric = z.infer<typeof evaluationMetricSchema>
+
+/**
+ * Response envelope for evaluation metrics query.
+ */
+export const evaluationMetricsResponseSchema = z.object({
+    count: z.number().optional().default(0),
+    metrics: z.array(evaluationMetricSchema).default([]),
+})
+export type EvaluationMetricsResponse = z.infer<typeof evaluationMetricsResponseSchema>
diff --git a/web/packages/agenta-entities/src/evaluationRun/core/types.ts b/web/packages/agenta-entities/src/evaluationRun/core/types.ts
index 0c1be9334c..427a002737 100644
--- a/web/packages/agenta-entities/src/evaluationRun/core/types.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/core/types.ts
@@ -33,3 +33,14 @@ export interface EvaluationResultsQueryParams {
     scenarioIds?: string[]
     stepKeys?: string[]
 }
+
+/**
+ * Params for querying evaluation metrics.
+ * Metrics are joined to runs (run-level aggregates) or scenarios (per-scenario).
+ */
+export interface EvaluationMetricsQueryParams {
+    projectId: string
+    runId: string
+    /** Restrict to per-scenario metrics for these scenarios. Omit for all run metrics. */
+    scenarioIds?: string[]
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/filterSchema.test.ts b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/filterSchema.test.ts
new file mode 100644
index 0000000000..86bb7eed80
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/filterSchema.test.ts
@@ -0,0 +1,107 @@
+/**
+ * buildFilterSchema — derives the filterable fields the Phase 2 / T4
+ * filter UI offers (decision D8).
+ *
+ * Covers field derivation, the schema-only value-type heuristic, the
+ * type-matched operator sets, the `resolveValueType` refinement seam, and
+ * deduplication.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import {buildFilterSchema, operatorsForType} from "../filterSchema"
+import type {RunSchema} from "../resolveMappings"
+
+const SCHEMA: RunSchema = {
+    steps: [
+        {key: "in", type: "input", references: {testset: {id: "t1", slug: "ts"}}},
+        {key: "ev", type: "annotation", references: {evaluator: {id: "e1", slug: "exact-match"}}},
+    ],
+    mappings: [
+        {column: {kind: "input", name: "question"}, step: {key: "in", path: "data.question"}},
+        {column: {kind: "annotation", name: "success"}, step: {key: "ev", path: "out"}},
+        // Metrics path overrides step-type grouping → "metrics" group.
+        {
+            column: {kind: "metric", name: "Cost"},
+            step: {key: "ev", path: "attributes.ag.metrics.cost"},
+        },
+    ],
+}
+
+describe("operatorsForType", () => {
+    it("number gets ordered comparisons", () => {
+        const ops = operatorsForType("number")
+        for (const op of ["lt", "lte", "gt", "gte"]) assert.ok(ops.includes(op as never))
+    })
+
+    it("unknown / boolean withhold ordered comparisons", () => {
+        for (const op of ["lt", "gt"]) {
+            assert.equal(operatorsForType("unknown").includes(op as never), false)
+            assert.equal(operatorsForType("boolean").includes(op as never), false)
+        }
+    })
+
+    it("returns a fresh array (callers may mutate)", () => {
+        const a = operatorsForType("number")
+        a.pop()
+        assert.notEqual(a.length, operatorsForType("number").length)
+    })
+})
+
+describe("buildFilterSchema", () => {
+    it("returns an empty schema for a null run schema", () => {
+        assert.deepEqual(buildFilterSchema(null), {fields: []})
+    })
+
+    it("emits one field per mapped column", () => {
+        const {fields} = buildFilterSchema(SCHEMA)
+        assert.deepEqual(fields.map((f) => f.columnName).sort(), ["Cost", "question", "success"])
+    })
+
+    it("types metrics columns as number, others as unknown", () => {
+        const {fields} = buildFilterSchema(SCHEMA)
+        const cost = fields.find((f) => f.columnName === "Cost")
+        const success = fields.find((f) => f.columnName === "success")
+        assert.equal(cost?.valueType, "number")
+        assert.equal(success?.valueType, "unknown")
+        assert.ok(cost?.operators.includes("gt"))
+        assert.equal(success?.operators.includes("gt"), false)
+    })
+
+    it("carries the targeting triple + labels", () => {
+        const {fields} = buildFilterSchema(SCHEMA)
+        const success = fields.find((f) => f.columnName === "success")
+        assert.equal(success?.groupKind, "evaluator")
+        assert.equal(success?.groupSlug, "exact-match")
+        assert.equal(success?.label, "success")
+        assert.ok(success?.groupLabel)
+    })
+
+    it("resolveValueType refines a field's type + operators", () => {
+        const {fields} = buildFilterSchema(SCHEMA, {
+            resolveValueType: (f) => (f.columnName === "success" ? "boolean" : undefined),
+        })
+        const success = fields.find((f) => f.columnName === "success")
+        assert.equal(success?.valueType, "boolean")
+        assert.deepEqual(success?.operators, ["eq", "ne"])
+        // Untouched fields keep the schema-only default.
+        assert.equal(fields.find((f) => f.columnName === "Cost")?.valueType, "number")
+    })
+
+    it("deduplicates identical (groupKind, groupSlug, columnName) triples", () => {
+        const dupSchema: RunSchema = {
+            steps: SCHEMA.steps,
+            mappings: [
+                ...SCHEMA.mappings,
+                // Same column name + same step as an existing mapping.
+                {
+                    column: {kind: "input", name: "question"},
+                    step: {key: "in", path: "data.question"},
+                },
+            ],
+        }
+        const {fields} = buildFilterSchema(dupSchema)
+        assert.equal(fields.filter((f) => f.columnName === "question").length, 1)
+    })
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/groupRunColumns.test.ts b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/groupRunColumns.test.ts
new file mode 100644
index 0000000000..6026360173
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/groupRunColumns.test.ts
@@ -0,0 +1,197 @@
+/**
+ * groupRunColumns — column-parity regression guard for the ETL scenario
+ * table migration (docs/designs/eval-scenarios-table-integration.md, T2).
+ *
+ * The backend-metadata column path (`usePreviewColumns`) and the run-graph
+ * column path (`useEtlColumns` → `groupRunColumns`) must surface the SAME
+ * visible column set. The most load-bearing part of that: the PoC's
+ * `useEtlColumns` dropped `group.kind === "other"` columns ("skip in the
+ * test page"). Production must keep them — dropping them silently shrinks
+ * the user-visible column set. These tests pin that down.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import {groupRunColumns, type RunMapping, type RunStep} from "../resolveMappings"
+
+// A representative testset+app+evaluator run schema. auto / human / online
+// runs all share this shape — the eval type only changes which metrics
+// show and the scenario fetch order, neither of which `groupRunColumns`
+// (a pure steps+mappings function) is aware of.
+const STEPS: RunStep[] = [
+    {key: "input", type: "input", references: {testset: {id: "ts1", slug: "my-testset"}}},
+    {
+        key: "invocation",
+        type: "invocation",
+        references: {application: {id: "app1", slug: "my-app"}},
+    },
+    {
+        key: "eval-exact",
+        type: "annotation",
+        references: {evaluator: {id: "ev1", slug: "exact-match"}},
+    },
+]
+
+const MAPPINGS: RunMapping[] = [
+    {column: {kind: "input", name: "question"}, step: {key: "input", path: "data.question"}},
+    {
+        column: {kind: "input", name: "ground_truth"},
+        step: {key: "input", path: "data.ground_truth"},
+    },
+    {
+        column: {kind: "invocation", name: "output"},
+        step: {key: "invocation", path: "attributes.ag.data.outputs"},
+    },
+    {
+        column: {kind: "annotation", name: "success"},
+        step: {key: "eval-exact", path: "attributes.ag.data.outputs.success"},
+    },
+    // Metrics path overrides the step-type grouping → "metrics" group.
+    {
+        column: {kind: "metric", name: "Cost"},
+        step: {key: "invocation", path: "attributes.ag.metrics.costs.cumulative.total"},
+    },
+]
+
+describe("groupRunColumns — testset/app/evaluator/metrics", () => {
+    it("groups columns by source in stable order", () => {
+        const grouped = groupRunColumns(STEPS, MAPPINGS)
+        assert.deepEqual(
+            grouped.map((g) => g.group.kind),
+            ["testset", "application", "evaluator", "metrics"],
+        )
+    })
+
+    it("keeps every mapped column — none dropped", () => {
+        const grouped = groupRunColumns(STEPS, MAPPINGS)
+        const total = grouped.reduce((n, g) => n + g.columns.length, 0)
+        assert.equal(total, MAPPINGS.length)
+    })
+
+    it("places multiple columns under their shared group", () => {
+        const grouped = groupRunColumns(STEPS, MAPPINGS)
+        const testset = grouped.find((g) => g.group.kind === "testset")
+        assert.ok(testset)
+        assert.deepEqual(
+            testset.columns.map((c) => c.name),
+            ["question", "ground_truth"],
+        )
+    })
+
+    it("carries group kind + slug onto each leaf", () => {
+        const grouped = groupRunColumns(STEPS, MAPPINGS)
+        const evaluator = grouped.find((g) => g.group.kind === "evaluator")
+        assert.ok(evaluator)
+        assert.equal(evaluator.columns[0].name, "success")
+        assert.equal(evaluator.columns[0].kind, "evaluator")
+        assert.equal(evaluator.columns[0].groupSlug, "exact-match")
+    })
+})
+
+describe("groupRunColumns — 'other' columns are INCLUDED (regression)", () => {
+    it("includes columns whose step has an unrecognised type", () => {
+        const steps: RunStep[] = [...STEPS, {key: "transform", type: "transform"}]
+        const mappings: RunMapping[] = [
+            ...MAPPINGS,
+            {
+                column: {kind: "transform", name: "normalized"},
+                step: {key: "transform", path: "data.normalized"},
+            },
+        ]
+        const grouped = groupRunColumns(steps, mappings)
+        const other = grouped.find((g) => g.group.kind === "other")
+        assert.ok(other, "the unrecognised-step column must produce an 'other' group")
+        assert.deepEqual(
+            other.columns.map((c) => c.name),
+            ["normalized"],
+        )
+        // "other" sorts last.
+        assert.equal(grouped[grouped.length - 1].group.kind, "other")
+    })
+
+    it("includes columns whose mapping references a missing step", () => {
+        const mappings: RunMapping[] = [
+            ...MAPPINGS,
+            {column: {kind: "meta", name: "orphan"}, step: {key: "does-not-exist", path: "x"}},
+        ]
+        const grouped = groupRunColumns(STEPS, mappings)
+        const other = grouped.find((g) => g.group.kind === "other")
+        assert.ok(other, "a mapping with no resolvable step must produce an 'other' group")
+        assert.deepEqual(
+            other.columns.map((c) => c.name),
+            ["orphan"],
+        )
+    })
+
+    it("the visible column count includes 'other' columns", () => {
+        const steps: RunStep[] = [...STEPS, {key: "transform", type: "transform"}]
+        const mappings: RunMapping[] = [
+            ...MAPPINGS,
+            {column: {name: "normalized"}, step: {key: "transform", path: "p"}},
+        ]
+        const grouped = groupRunColumns(steps, mappings)
+        const total = grouped.reduce((n, g) => n + g.columns.length, 0)
+        assert.equal(total, mappings.length)
+    })
+})
+
+describe("groupRunColumns — edge cases", () => {
+    it("skips mappings with no column name", () => {
+        const mappings: RunMapping[] = [
+            ...MAPPINGS,
+            {column: {kind: "input", name: ""}, step: {key: "input", path: "data.blank"}},
+            {column: {kind: "input"}, step: {key: "input", path: "data.noname"}},
+        ]
+        const grouped = groupRunColumns(STEPS, mappings)
+        const total = grouped.reduce((n, g) => n + g.columns.length, 0)
+        assert.equal(total, MAPPINGS.length)
+    })
+
+    it("returns an empty list for an empty schema", () => {
+        assert.deepEqual(groupRunColumns([], []), [])
+    })
+
+    it("drops internal _dedup_id columns (regression)", () => {
+        const mappings: RunMapping[] = [
+            ...MAPPINGS,
+            {
+                column: {kind: "input", name: "testcase_dedup_id"},
+                step: {key: "input", path: "data.testcase_dedup_id"},
+            },
+        ]
+        const grouped = groupRunColumns(STEPS, mappings)
+        const names = grouped.flatMap((g) => g.columns.map((c) => c.name))
+        assert.equal(names.includes("testcase_dedup_id"), false)
+        // The dedup column is excluded; every other mapped column is kept.
+        assert.equal(
+            grouped.reduce((n, g) => n + g.columns.length, 0),
+            MAPPINGS.length,
+        )
+    })
+
+    it("disambiguates two evaluators emitting the same column name", () => {
+        const steps: RunStep[] = [
+            ...STEPS,
+            {
+                key: "eval-judge",
+                type: "annotation",
+                references: {evaluator: {id: "ev2", slug: "llm-judge"}},
+            },
+        ]
+        const mappings: RunMapping[] = [
+            ...MAPPINGS,
+            {
+                column: {kind: "annotation", name: "success"},
+                step: {key: "eval-judge", path: "attributes.ag.data.outputs.success"},
+            },
+        ]
+        const grouped = groupRunColumns(steps, mappings)
+        const evaluators = grouped.filter((g) => g.group.kind === "evaluator")
+        assert.equal(evaluators.length, 2)
+        assert.deepEqual(
+            evaluators.map((g) => g.group.slug),
+            ["exact-match", "llm-judge"],
+        )
+    })
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/hitRatioMeter.test.ts b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/hitRatioMeter.test.ts
new file mode 100644
index 0000000000..e718d42dfa
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/hitRatioMeter.test.ts
@@ -0,0 +1,158 @@
+/**
+ * hitRatioMeter — unit tests for the v1→v2 escalation signal.
+ *
+ * The meter has three observable states and a small policy surface
+ * (rolling window + threshold). These tests lock in the regime transitions
+ * and the edge cases that would otherwise drift silently.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import {createHitRatioMeter} from "../hitRatioMeter"
+
+// =============================================================================
+// State machine — warming → client → escalate
+// =============================================================================
+
+describe("hitRatioMeter — state transitions", () => {
+    it("starts in `warming` with no observations", () => {
+        const meter = createHitRatioMeter()
+        const r = meter.regime()
+        assert.equal(r.state, "warming")
+        assert.equal(r.rollingRatio, null)
+        assert.equal(r.chunksObserved, 0)
+    })
+
+    it("stays `warming` until windowSize chunks observed (default 3)", () => {
+        const meter = createHitRatioMeter()
+        meter.record({chunk: 1, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "warming")
+        meter.record({chunk: 2, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "warming")
+        meter.record({chunk: 3, scanned: 50, matched: 0})
+        // Now has 3 chunks → transitions out of warming
+        assert.notEqual(meter.regime().state, "warming")
+    })
+
+    it("recommends `client` when rolling ratio >= threshold", () => {
+        const meter = createHitRatioMeter({windowSize: 3, threshold: 0.1})
+        meter.record({chunk: 1, scanned: 50, matched: 45}) // 90%
+        meter.record({chunk: 2, scanned: 50, matched: 40}) // 80%
+        meter.record({chunk: 3, scanned: 50, matched: 42}) // 84%
+        const r = meter.regime()
+        assert.equal(r.state, "client")
+        assert.ok(r.rollingRatio !== null && r.rollingRatio > 0.8)
+    })
+
+    it("recommends `escalate` when rolling ratio < threshold", () => {
+        const meter = createHitRatioMeter({windowSize: 3, threshold: 0.1})
+        meter.record({chunk: 1, scanned: 50, matched: 1}) // 2%
+        meter.record({chunk: 2, scanned: 50, matched: 2}) // 4%
+        meter.record({chunk: 3, scanned: 50, matched: 1}) // 2%
+        const r = meter.regime()
+        assert.equal(r.state, "escalate")
+        assert.ok(r.rollingRatio !== null && r.rollingRatio < 0.1)
+        assert.match(r.reason, /recommend v2 server-side filter/)
+    })
+
+    it("oscillates between client and escalate as the rolling window slides", () => {
+        const meter = createHitRatioMeter({windowSize: 3, threshold: 0.1})
+        meter.record({chunk: 1, scanned: 50, matched: 50}) // 100%
+        meter.record({chunk: 2, scanned: 50, matched: 50}) // 100%
+        meter.record({chunk: 3, scanned: 50, matched: 50}) // 100%
+        assert.equal(meter.regime().state, "client")
+        // Slide window: window=[c2,c3,c4] = 100,100,0 → still 67%, client
+        meter.record({chunk: 4, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "client")
+        // Slide: [c3,c4,c5] = 100,0,0 → 33%, still client (above 10%)
+        meter.record({chunk: 5, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "client")
+        // Slide: [c4,c5,c6] = 0,0,0 → 0%, escalate
+        meter.record({chunk: 6, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "escalate")
+        // Slide back up: [c5,c6,c7] = 0,0,50 → 33%, back to client
+        meter.record({chunk: 7, scanned: 50, matched: 50})
+        assert.equal(meter.regime().state, "client")
+    })
+})
+
+// =============================================================================
+// Edge cases
+// =============================================================================
+
+describe("hitRatioMeter — edge cases", () => {
+    it("zero-scanned chunks count as observed but contribute 0 to ratio", () => {
+        const meter = createHitRatioMeter({windowSize: 3, threshold: 0.1})
+        meter.record({chunk: 1, scanned: 0, matched: 0})
+        meter.record({chunk: 2, scanned: 50, matched: 25})
+        meter.record({chunk: 3, scanned: 50, matched: 25})
+        const r = meter.regime()
+        // total scanned across window: 100, total matched: 50 → 50%
+        assert.equal(r.rollingRatio, 0.5)
+        assert.equal(r.state, "client")
+    })
+
+    it("dedups repeated chunk indices — caller can replay without distortion", () => {
+        const meter = createHitRatioMeter({windowSize: 3, threshold: 0.1})
+        meter.record({chunk: 1, scanned: 50, matched: 45})
+        meter.record({chunk: 1, scanned: 50, matched: 45}) // duplicate
+        meter.record({chunk: 1, scanned: 50, matched: 45}) // duplicate
+        meter.record({chunk: 2, scanned: 50, matched: 45})
+        meter.record({chunk: 3, scanned: 50, matched: 45})
+        const r = meter.regime()
+        assert.equal(r.chunksObserved, 3, "duplicates ignored")
+    })
+
+    it("reset() drops all observations and returns to warming", () => {
+        const meter = createHitRatioMeter()
+        for (let i = 1; i <= 5; i++) meter.record({chunk: i, scanned: 50, matched: 50})
+        assert.notEqual(meter.regime().state, "warming")
+        meter.reset()
+        assert.equal(meter.regime().state, "warming")
+        assert.equal(meter.regime().chunksObserved, 0)
+    })
+
+    it("windows() returns observations in chunk-arrival order", () => {
+        const meter = createHitRatioMeter()
+        meter.record({chunk: 1, scanned: 50, matched: 10})
+        meter.record({chunk: 2, scanned: 50, matched: 20})
+        meter.record({chunk: 3, scanned: 50, matched: 30})
+        const ws = meter.windows()
+        assert.equal(ws.length, 3)
+        assert.equal(ws[0].chunk, 1)
+        assert.equal(ws[0].ratio, 0.2)
+        assert.equal(ws[2].chunk, 3)
+        assert.equal(ws[2].ratio, 0.6)
+    })
+
+    it("rejects invalid windowSize", () => {
+        assert.throws(() => createHitRatioMeter({windowSize: 0}), /windowSize must be >= 1/)
+    })
+
+    it("rejects threshold outside [0, 1]", () => {
+        assert.throws(() => createHitRatioMeter({threshold: -0.1}), /threshold must be/)
+        assert.throws(() => createHitRatioMeter({threshold: 1.1}), /threshold must be/)
+    })
+
+    it("custom windowSize affects when regime is decidable", () => {
+        const meter = createHitRatioMeter({windowSize: 5, threshold: 0.1})
+        for (let i = 1; i <= 4; i++) meter.record({chunk: i, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "warming")
+        meter.record({chunk: 5, scanned: 50, matched: 0})
+        assert.equal(meter.regime().state, "escalate")
+    })
+
+    it("custom threshold drives the same windows to different regimes", () => {
+        const high = createHitRatioMeter({windowSize: 3, threshold: 0.5})
+        const low = createHitRatioMeter({windowSize: 3, threshold: 0.1})
+        for (let i = 1; i <= 3; i++) {
+            high.record({chunk: i, scanned: 50, matched: 15}) // 30%
+            low.record({chunk: i, scanned: 50, matched: 15})
+        }
+        // High threshold (50%): 30% < 50% → escalate
+        assert.equal(high.regime().state, "escalate")
+        // Low threshold (10%): 30% >= 10% → client
+        assert.equal(low.regime().state, "client")
+    })
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/predicateGroup.test.ts b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/predicateGroup.test.ts
new file mode 100644
index 0000000000..3c11d614c9
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/predicateGroup.test.ts
@@ -0,0 +1,329 @@
+/**
+ * Multi-predicate AND/OR filtering (Phase 2 / T4 — decision D8).
+ *
+ * Covers the pure predicate-evaluation core: single predicate, flat AND/OR
+ * groups, the `RowFilter` dispatch, the row-level `matchesRowFilter`
+ * convenience, the `makePredicateGroupFilter` pipeline transform, and the
+ * `predicateToEntitySlices` union for group inputs.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import type {Chunk} from "../../../etl/core/types"
+import type {HydratedScenarioRow} from "../hydrateScenariosTransform"
+import {predicateToEntitySlices} from "../predicateToEntitySlices"
+import type {ColumnGroup, ResolvedColumn, RunSchema} from "../resolveMappings"
+import {
+    evaluatePredicateGroup,
+    evaluateRowFilter,
+    evaluateRowPredicate,
+    isPredicateGroup,
+    makePredicateGroupFilter,
+    matchesRowFilter,
+    type PredicateGroup,
+    type RowPredicate,
+} from "../rowPredicateFilter"
+
+// A resolved column fixture — the shape `resolveMappings` emits.
+function col(opts: {
+    name: string
+    kind: ColumnGroup["kind"]
+    slug?: string | null
+    value: unknown
+}): ResolvedColumn {
+    const group: ColumnGroup = {
+        kind: opts.kind,
+        slug: opts.slug ?? null,
+        label: opts.kind,
+        key: `${opts.kind}:${opts.slug ?? "x"}`,
+        refs: null,
+    }
+    return {
+        name: opts.name,
+        kind: opts.kind,
+        stepKey: "step",
+        stepType: opts.kind,
+        path: "",
+        value: opts.value,
+        source: "metric",
+        group,
+    }
+}
+
+const COLS: ResolvedColumn[] = [
+    col({name: "success", kind: "evaluator", slug: "exact-match", value: true}),
+    col({name: "score", kind: "evaluator", slug: "llm-judge", value: 0.9}),
+    col({name: "country", kind: "testset", slug: "ts", value: "US"}),
+]
+
+// =============================================================================
+// evaluateRowPredicate — one clause
+// =============================================================================
+
+describe("evaluateRowPredicate", () => {
+    it("eq / ne", () => {
+        assert.equal(
+            evaluateRowPredicate(
+                {groupKind: "evaluator", columnName: "success", op: "eq", value: true},
+                COLS,
+            ),
+            true,
+        )
+        assert.equal(
+            evaluateRowPredicate(
+                {groupKind: "evaluator", columnName: "success", op: "ne", value: true},
+                COLS,
+            ),
+            false,
+        )
+    })
+
+    it("numeric comparisons", () => {
+        const p = (op: RowPredicate["op"], value: number): RowPredicate => ({
+            groupKind: "evaluator",
+            columnName: "score",
+            op,
+            value,
+        })
+        assert.equal(evaluateRowPredicate(p("gt", 0.8), COLS), true)
+        assert.equal(evaluateRowPredicate(p("gte", 0.9), COLS), true)
+        assert.equal(evaluateRowPredicate(p("lt", 0.8), COLS), false)
+        assert.equal(evaluateRowPredicate(p("lte", 0.9), COLS), true)
+    })
+
+    it("in / nin", () => {
+        assert.equal(
+            evaluateRowPredicate(
+                {groupKind: "testset", columnName: "country", op: "in", value: ["US", "CA"]},
+                COLS,
+            ),
+            true,
+        )
+        assert.equal(
+            evaluateRowPredicate(
+                {groupKind: "testset", columnName: "country", op: "nin", value: ["US", "CA"]},
+                COLS,
+            ),
+            false,
+        )
+    })
+
+    it("narrows by groupSlug when set", () => {
+        // Same column name across two evaluators — slug disambiguates.
+        const cols = [
+            col({name: "success", kind: "evaluator", slug: "a", value: true}),
+            col({name: "success", kind: "evaluator", slug: "b", value: false}),
+        ]
+        assert.equal(
+            evaluateRowPredicate(
+                {
+                    groupKind: "evaluator",
+                    groupSlug: "b",
+                    columnName: "success",
+                    op: "eq",
+                    value: false,
+                },
+                cols,
+            ),
+            true,
+        )
+    })
+
+    it("a missing column fails eq but passes ne (compares against undefined)", () => {
+        const p = (op: RowPredicate["op"]): RowPredicate => ({
+            groupKind: "evaluator",
+            columnName: "does-not-exist",
+            op,
+            value: true,
+        })
+        assert.equal(evaluateRowPredicate(p("eq"), COLS), false)
+        assert.equal(evaluateRowPredicate(p("ne"), COLS), true)
+    })
+
+    it("unwraps a stats-blob value before comparing", () => {
+        const cols = [
+            col({
+                name: "success",
+                kind: "evaluator",
+                value: {type: "binary", freq: [{value: true, density: 1}]},
+            }),
+        ]
+        assert.equal(
+            evaluateRowPredicate(
+                {groupKind: "evaluator", columnName: "success", op: "eq", value: true},
+                cols,
+            ),
+            true,
+        )
+    })
+})
+
+// =============================================================================
+// evaluatePredicateGroup — flat AND / OR
+// =============================================================================
+
+describe("evaluatePredicateGroup", () => {
+    const pass: RowPredicate = {
+        groupKind: "evaluator",
+        columnName: "success",
+        op: "eq",
+        value: true,
+    }
+    const fail: RowPredicate = {groupKind: "evaluator", columnName: "score", op: "gt", value: 999}
+
+    it("AND — every condition must match", () => {
+        assert.equal(evaluatePredicateGroup({op: "and", conditions: [pass, pass]}, COLS), true)
+        assert.equal(evaluatePredicateGroup({op: "and", conditions: [pass, fail]}, COLS), false)
+    })
+
+    it("OR — at least one condition must match", () => {
+        assert.equal(evaluatePredicateGroup({op: "or", conditions: [fail, pass]}, COLS), true)
+        assert.equal(evaluatePredicateGroup({op: "or", conditions: [fail, fail]}, COLS), false)
+    })
+
+    it("an empty group is no constraint — passes for both ops", () => {
+        assert.equal(evaluatePredicateGroup({op: "and", conditions: []}, COLS), true)
+        assert.equal(evaluatePredicateGroup({op: "or", conditions: []}, COLS), true)
+    })
+})
+
+// =============================================================================
+// evaluateRowFilter — dispatch + isPredicateGroup
+// =============================================================================
+
+describe("evaluateRowFilter / isPredicateGroup", () => {
+    it("isPredicateGroup distinguishes a group from a single predicate", () => {
+        const single: RowPredicate = {
+            groupKind: "evaluator",
+            columnName: "success",
+            op: "eq",
+            value: true,
+        }
+        const group: PredicateGroup = {op: "and", conditions: [single]}
+        assert.equal(isPredicateGroup(single), false)
+        assert.equal(isPredicateGroup(group), true)
+    })
+
+    it("evaluates a single predicate or a group transparently", () => {
+        const single: RowPredicate = {
+            groupKind: "evaluator",
+            columnName: "success",
+            op: "eq",
+            value: true,
+        }
+        assert.equal(evaluateRowFilter(single, COLS), true)
+        assert.equal(evaluateRowFilter({op: "or", conditions: [single]}, COLS), true)
+    })
+})
+
+// =============================================================================
+// matchesRowFilter — resolve schema, then evaluate
+// =============================================================================
+
+const ANNOTATION_SCHEMA: RunSchema = {
+    steps: [
+        {key: "eval", type: "annotation", references: {evaluator: {id: "e1", slug: "exact-match"}}},
+    ],
+    mappings: [{column: {kind: "annotation", name: "success"}, step: {key: "eval", path: "out"}}],
+}
+
+function annotationRow(success: boolean): HydratedScenarioRow {
+    return {
+        scenario: {id: "s1", status: "success"},
+        results: [],
+        // resolveFromMetric only reads `m.data` — a minimal metric is enough.
+        metrics: [{data: {eval: {out: success}}}] as unknown as HydratedScenarioRow["metrics"],
+        testcase: null,
+        traces: {},
+    }
+}
+
+describe("matchesRowFilter", () => {
+    it("resolves the run schema then evaluates the filter", () => {
+        const filter: PredicateGroup = {
+            op: "and",
+            conditions: [{groupKind: "evaluator", columnName: "success", op: "eq", value: true}],
+        }
+        assert.equal(matchesRowFilter(filter, ANNOTATION_SCHEMA, annotationRow(true)), true)
+        assert.equal(matchesRowFilter(filter, ANNOTATION_SCHEMA, annotationRow(false)), false)
+    })
+})
+
+// =============================================================================
+// makePredicateGroupFilter — pipeline transform
+// =============================================================================
+
+describe("makePredicateGroupFilter", () => {
+    it("keeps only rows the filter matches", async () => {
+        const transform = makePredicateGroupFilter({
+            filter: {
+                op: "or",
+                conditions: [
+                    {groupKind: "evaluator", columnName: "success", op: "eq", value: true},
+                ],
+            },
+            schema: ANNOTATION_SCHEMA,
+        })
+        const chunk: Chunk<HydratedScenarioRow> = {
+            items: [annotationRow(true), annotationRow(false), annotationRow(true)],
+            cursor: null,
+        }
+        const out = await transform(chunk)
+        assert.equal(out.items.length, 2)
+    })
+})
+
+// =============================================================================
+// predicateToEntitySlices — union across a group's conditions
+// =============================================================================
+
+const MIXED_SCHEMA: RunSchema = {
+    steps: [
+        {key: "in", type: "input", references: {testset: {id: "t1", slug: "ts"}}},
+        {key: "ev", type: "annotation", references: {evaluator: {id: "e1", slug: "em"}}},
+    ],
+    mappings: [
+        {column: {kind: "input", name: "question"}, step: {key: "in", path: "data.question"}},
+        {column: {kind: "annotation", name: "success"}, step: {key: "ev", path: "out"}},
+    ],
+}
+
+describe("predicateToEntitySlices — group input", () => {
+    const testsetCond: RowPredicate = {
+        groupKind: "testset",
+        columnName: "question",
+        op: "eq",
+        value: "x",
+    }
+    const evaluatorCond: RowPredicate = {
+        groupKind: "evaluator",
+        columnName: "success",
+        op: "eq",
+        value: true,
+    }
+
+    it("takes the union of every condition's slices", () => {
+        const group: PredicateGroup = {op: "and", conditions: [testsetCond, evaluatorCond]}
+        const {slices} = predicateToEntitySlices(MIXED_SCHEMA, group)
+        // testset → results + testcases; evaluator → results + metrics.
+        assert.deepEqual([...slices].sort(), ["metrics", "results", "testcases"])
+    })
+
+    it("the boolean operator does not change the slice set", () => {
+        const and = predicateToEntitySlices(MIXED_SCHEMA, {
+            op: "and",
+            conditions: [testsetCond, evaluatorCond],
+        })
+        const or = predicateToEntitySlices(MIXED_SCHEMA, {
+            op: "or",
+            conditions: [testsetCond, evaluatorCond],
+        })
+        assert.deepEqual([...and.slices].sort(), [...or.slices].sort())
+    })
+
+    it("an empty group needs no slices", () => {
+        const {slices} = predicateToEntitySlices(MIXED_SCHEMA, {op: "and", conditions: []})
+        assert.equal(slices.size, 0)
+    })
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/resolveMappings.test.ts b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/resolveMappings.test.ts
new file mode 100644
index 0000000000..e8cae5e683
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/__tests__/resolveMappings.test.ts
@@ -0,0 +1,744 @@
+/**
+ * resolveMappings — unit tests covering known shapes + extensibility.
+ *
+ * The point of these tests: any future change to resolver shapes or step
+ * types should be confirmed against this suite before shipping. New shapes
+ * encountered in the wild should be added here so we don't re-encounter
+ * the same patch-each-time problem.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import type {HydratedScenarioRow, HydratableScenario} from "../hydrateScenariosTransform"
+import {
+    DEFAULT_STEP_RESOLVERS,
+    composeResolvers,
+    computeColumnGroup,
+    findInTrace,
+    getAtPath,
+    groupResolvedColumns,
+    resolveMappings,
+    type RunSchema,
+    type StepResolver,
+} from "../resolveMappings"
+
+interface TestScenario extends HydratableScenario {
+    id: string
+    status: string
+    testcase_id?: string | null
+}
+
+function makeRow(overrides: Partial<HydratedScenarioRow<TestScenario>> = {}) {
+    const base: HydratedScenarioRow<TestScenario> = {
+        scenario: {id: "scen1", status: "success", testcase_id: null},
+        results: [],
+        metrics: [],
+        testcase: null,
+        traces: {},
+    }
+    return {...base, ...overrides}
+}
+
+// =============================================================================
+// getAtPath — dot-path traversal
+// =============================================================================
+
+describe("getAtPath", () => {
+    it("returns nested values", () => {
+        assert.equal(getAtPath({a: {b: {c: 7}}}, "a.b.c"), 7)
+    })
+
+    it("returns undefined for missing intermediate", () => {
+        assert.equal(getAtPath({a: {}}, "a.b.c"), undefined)
+    })
+
+    it("returns undefined for non-object intermediate", () => {
+        assert.equal(getAtPath({a: 1}, "a.b"), undefined)
+    })
+
+    it("returns undefined for empty path", () => {
+        assert.equal(getAtPath({a: 1}, ""), undefined)
+    })
+
+    it("returns undefined for null/undefined input", () => {
+        assert.equal(getAtPath(null, "a"), undefined)
+        assert.equal(getAtPath(undefined, "a"), undefined)
+    })
+})
+
+// =============================================================================
+// findInTrace — multi-shape trace navigation
+// =============================================================================
+
+describe("findInTrace", () => {
+    const path = "attributes.ag.data.outputs"
+    const leaf = "some output"
+
+    it("Shape A: {spans: {<name>: span}} (bulk fetch)", () => {
+        const trace = {
+            spans: {
+                completion_v0: {
+                    attributes: {ag: {data: {outputs: leaf}}},
+                },
+            },
+        }
+        assert.equal(findInTrace(trace, path), leaf)
+    })
+
+    it("Shape A nested: data lives on a child span, not the root", () => {
+        const trace = {
+            spans: {
+                completion_v0: {
+                    span_name: "root",
+                    spans: {
+                        litellm_client: {
+                            attributes: {ag: {data: {outputs: leaf}}},
+                        },
+                    },
+                },
+            },
+        }
+        assert.equal(findInTrace(trace, path), leaf)
+    })
+
+    it("Shape B: span array under .spans", () => {
+        const trace = {
+            spans: [{attributes: {ag: {data: {outputs: leaf}}}}],
+        }
+        assert.equal(findInTrace(trace, path), leaf)
+    })
+
+    it("Shape C: {response: {tree: [span]}} (agenta-format wrapped)", () => {
+        const trace = {
+            response: {
+                tree: [{attributes: {ag: {data: {outputs: leaf}}}}],
+            },
+        }
+        assert.equal(findInTrace(trace, path), leaf)
+    })
+
+    it("Shape D: envelope IS a span", () => {
+        const trace = {attributes: {ag: {data: {outputs: leaf}}}}
+        assert.equal(findInTrace(trace, path), leaf)
+    })
+
+    it("descends through .children arrays", () => {
+        const trace = {
+            span_name: "root",
+            children: [
+                {
+                    span_name: "child1",
+                    children: [{attributes: {ag: {data: {outputs: leaf}}}}],
+                },
+            ],
+        }
+        assert.equal(findInTrace(trace, path), leaf)
+    })
+
+    it("returns undefined when nothing matches", () => {
+        assert.equal(findInTrace({spans: {root: {other: "thing"}}}, path), undefined)
+    })
+
+    it("returns undefined for non-object input", () => {
+        assert.equal(findInTrace(null, path), undefined)
+        assert.equal(findInTrace("string", path), undefined)
+    })
+})
+
+// =============================================================================
+// Built-in resolvers (via DEFAULT_STEP_RESOLVERS)
+// =============================================================================
+
+describe("input step → resolveFromTestcase", () => {
+    const schema: RunSchema = {
+        steps: [{key: "testset-1", type: "input"}],
+        mappings: [
+            {
+                column: {kind: "testset", name: "country"},
+                step: {key: "testset-1", path: "data.country"},
+            },
+        ],
+    }
+
+    it("resolves when testcase is present", () => {
+        const row = makeRow({
+            testcase: {
+                id: "tc1",
+                data: {country: "USA"},
+            } as unknown as HydratedScenarioRow<TestScenario>["testcase"],
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.value, "USA")
+        assert.equal(col.source, "testcase")
+    })
+
+    it("returns missing when testcase is null", () => {
+        const row = makeRow({testcase: null})
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.value, undefined)
+        assert.equal(col.source, "missing")
+    })
+
+    it("returns missing when path not present", () => {
+        const row = makeRow({
+            testcase: {
+                id: "tc1",
+                data: {},
+            } as unknown as HydratedScenarioRow<TestScenario>["testcase"],
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.value, undefined)
+    })
+})
+
+describe("invocation step → resolveFromTrace", () => {
+    const schema: RunSchema = {
+        steps: [{key: "app-1", type: "invocation"}],
+        mappings: [
+            {
+                column: {kind: "invocation", name: "outputs"},
+                step: {key: "app-1", path: "attributes.ag.data.outputs"},
+            },
+        ],
+    }
+
+    it("resolves via the trace pointed to by result.trace_id", () => {
+        const row = makeRow({
+            results: [
+                {
+                    run_id: "r1",
+                    scenario_id: "scen1",
+                    step_key: "app-1",
+                    trace_id: "trace-abc",
+                    status: "success",
+                },
+            ],
+            traces: {
+                "trace-abc": {
+                    spans: {
+                        completion_v0: {
+                            attributes: {ag: {data: {outputs: "the answer"}}},
+                        },
+                    },
+                },
+            },
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.value, "the answer")
+        assert.equal(col.source, "trace")
+        assert.equal(col.stepType, "invocation")
+    })
+
+    it("returns missing when no result for the step", () => {
+        const row = makeRow({results: []})
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.source, "missing")
+    })
+
+    it("returns missing when trace not in row.traces", () => {
+        const row = makeRow({
+            results: [
+                {
+                    run_id: "r1",
+                    scenario_id: "scen1",
+                    step_key: "app-1",
+                    trace_id: "trace-abc",
+                    status: "success",
+                },
+            ],
+            // traces map empty → no entry for "trace-abc"
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.source, "missing")
+    })
+})
+
+describe("annotation step → composeResolvers(metric, trace)", () => {
+    const schema: RunSchema = {
+        steps: [{key: "eval-1", type: "annotation"}],
+        mappings: [
+            {
+                column: {kind: "annotation", name: "success"},
+                step: {key: "eval-1", path: "attributes.ag.data.outputs.success"},
+            },
+        ],
+    }
+
+    it("prefers metric.data when present (flat key, not dot-walk)", () => {
+        const row = makeRow({
+            results: [
+                {
+                    run_id: "r1",
+                    scenario_id: "scen1",
+                    step_key: "eval-1",
+                    trace_id: "trace-eval",
+                    status: "success",
+                },
+            ],
+            metrics: [
+                {
+                    id: "m1",
+                    run_id: "r1",
+                    data: {
+                        "eval-1": {
+                            "attributes.ag.data.outputs.success": {
+                                type: "binary",
+                                freq: [{value: false, density: 1}],
+                            },
+                        },
+                    },
+                } as unknown as HydratedScenarioRow<TestScenario>["metrics"][number],
+            ],
+            traces: {
+                "trace-eval": {
+                    spans: {
+                        eval_v0: {
+                            attributes: {ag: {data: {outputs: {success: false}}}},
+                        },
+                    },
+                },
+            },
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.source, "metric")
+        // The stats blob is returned as-is (not unwrapped) — that's the wire shape
+        const v = col.value as {type: string; freq: unknown[]}
+        assert.equal(v.type, "binary")
+    })
+
+    it("falls back to trace when metric is missing or has no bucket for the step", () => {
+        const row = makeRow({
+            results: [
+                {
+                    run_id: "r1",
+                    scenario_id: "scen1",
+                    step_key: "eval-1",
+                    trace_id: "trace-eval",
+                    status: "success",
+                },
+            ],
+            metrics: [], // no metric — fall through to trace
+            traces: {
+                "trace-eval": {
+                    spans: {
+                        eval_v0: {attributes: {ag: {data: {outputs: {success: true}}}}},
+                    },
+                },
+            },
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.source, "trace")
+        assert.equal(col.value, true)
+    })
+
+    it("returns missing when neither metric nor trace has the path", () => {
+        const row = makeRow({
+            results: [
+                {
+                    run_id: "r1",
+                    scenario_id: "scen1",
+                    step_key: "eval-1",
+                    trace_id: "trace-eval",
+                    status: "success",
+                },
+            ],
+            traces: {"trace-eval": {spans: {eval_v0: {other: "thing"}}}},
+        })
+        const [col] = resolveMappings(row, schema)
+        assert.equal(col.source, "missing")
+    })
+})
+
+// =============================================================================
+// Extensibility — custom step types
+// =============================================================================
+
+describe("customResolvers extensibility", () => {
+    it("a new step.type can be added without editing the registry", () => {
+        const schema: RunSchema = {
+            steps: [{key: "custom-1", type: "my_custom"}],
+            mappings: [{column: {kind: "custom", name: "x"}, step: {key: "custom-1", path: "x"}}],
+        }
+        const row = makeRow()
+
+        // Without a custom resolver: missing with descriptive source
+        const [col1] = resolveMappings(row, schema)
+        assert.equal(col1.value, undefined)
+        assert.match(col1.source, /no resolver for step\.type="my_custom"/)
+
+        // With a custom resolver
+        const myResolver: StepResolver = () => ({value: 42, source: "custom-magic"})
+        const [col2] = resolveMappings(row, schema, {customResolvers: {my_custom: myResolver}})
+        assert.equal(col2.value, 42)
+        assert.equal(col2.source, "custom-magic")
+    })
+
+    it("customResolvers can override a built-in", () => {
+        const schema: RunSchema = {
+            steps: [{key: "testset-1", type: "input"}],
+            mappings: [
+                {
+                    column: {kind: "testset", name: "country"},
+                    step: {key: "testset-1", path: "data.country"},
+                },
+            ],
+        }
+        const row = makeRow({
+            testcase: {
+                id: "tc1",
+                data: {country: "USA"},
+            } as unknown as HydratedScenarioRow<TestScenario>["testcase"],
+        })
+
+        const override: StepResolver = () => ({value: "OVERRIDE", source: "override"})
+        const [col] = resolveMappings(row, schema, {customResolvers: {input: override}})
+        assert.equal(col.value, "OVERRIDE")
+        assert.equal(col.source, "override")
+    })
+
+    it("fallbackResolver is invoked for unknown step types when set", () => {
+        const schema: RunSchema = {
+            steps: [{key: "anything-1", type: "weird"}],
+            mappings: [{column: {kind: "?", name: "x"}, step: {key: "anything-1", path: "p"}}],
+        }
+        const row = makeRow()
+        const fallback: StepResolver = () => ({value: "fallback-val", source: "fallback"})
+        const [col] = resolveMappings(row, schema, {fallbackResolver: fallback})
+        assert.equal(col.value, "fallback-val")
+        assert.equal(col.source, "fallback")
+    })
+})
+
+describe("composeResolvers", () => {
+    it("returns the first non-null", () => {
+        const a: StepResolver = () => null
+        const b: StepResolver = () => ({value: "b", source: "B"})
+        const c: StepResolver = () => ({value: "c", source: "C"})
+        const composed = composeResolvers(a, b, c)
+        const out = composed({
+            step: {key: "k", type: "t"},
+            result: undefined,
+            row: makeRow(),
+            path: "",
+        })
+        assert.deepEqual(out, {value: "b", source: "B"})
+    })
+
+    it("returns null when all return null", () => {
+        const composed = composeResolvers(
+            () => null,
+            () => null,
+        )
+        const out = composed({
+            step: {key: "k", type: "t"},
+            result: undefined,
+            row: makeRow(),
+            path: "",
+        })
+        assert.equal(out, null)
+    })
+})
+
+// =============================================================================
+// Edge cases
+// =============================================================================
+
+// =============================================================================
+// Column grouping — namespacing for multiple evaluators + metrics override
+// =============================================================================
+
+describe("computeColumnGroup", () => {
+    it("input step → testset group keyed by testset.slug", () => {
+        const g = computeColumnGroup(
+            {
+                key: "testset-x",
+                type: "input",
+                references: {testset: {id: "t1", slug: "my-testset"}},
+            },
+            "data.country",
+        )
+        assert.equal(g.kind, "testset")
+        assert.equal(g.slug, "my-testset")
+        assert.equal(g.label, "Testset my-testset")
+        assert.equal(g.key, "testset:my-testset")
+    })
+
+    it("invocation step → application group keyed by application.slug", () => {
+        const g = computeColumnGroup(
+            {
+                key: "app-x",
+                type: "invocation",
+                references: {application: {id: "a1", slug: "comp-1"}},
+            },
+            "attributes.ag.data.outputs",
+        )
+        assert.equal(g.kind, "application")
+        assert.equal(g.slug, "comp-1")
+        assert.equal(g.label, "Application comp-1")
+    })
+
+    it("annotation step → evaluator group titlecased from slug", () => {
+        const g = computeColumnGroup(
+            {
+                key: "eval-x",
+                type: "annotation",
+                references: {evaluator: {id: "e1", slug: "exact-match"}},
+            },
+            "attributes.ag.data.outputs.success",
+        )
+        assert.equal(g.kind, "evaluator")
+        assert.equal(g.slug, "exact-match")
+        assert.equal(g.label, "Exact Match", "slug 'exact-match' → 'Exact Match'")
+    })
+
+    it("two annotation steps with same column name get distinct groups", () => {
+        const g1 = computeColumnGroup(
+            {
+                key: "eval-1",
+                type: "annotation",
+                references: {evaluator: {id: "e-exact", slug: "exact-match"}},
+            },
+            "attributes.ag.data.outputs.success",
+        )
+        const g2 = computeColumnGroup(
+            {
+                key: "eval-2",
+                type: "annotation",
+                references: {evaluator: {id: "e-fuzzy", slug: "fuzzy-match"}},
+            },
+            "attributes.ag.data.outputs.success",
+        )
+        // Same column NAME, different group KEY — no collision.
+        assert.notEqual(g1.key, g2.key)
+    })
+
+    it("metrics path overrides step type — goes to Metrics group", () => {
+        // An invocation-step mapping pointing at attributes.ag.metrics.* still
+        // belongs to the cross-cutting Metrics group (per UI layout).
+        const g = computeColumnGroup(
+            {
+                key: "app-x",
+                type: "invocation",
+                references: {application: {id: "a-comp-1", slug: "comp-1"}},
+            },
+            "attributes.ag.metrics.tokens.cumulative.total",
+        )
+        assert.equal(g.kind, "metrics")
+        assert.equal(g.label, "Metrics")
+        assert.equal(g.key, "metrics")
+    })
+
+    it("missing step → 'other' group", () => {
+        const g = computeColumnGroup(null, "anything")
+        assert.equal(g.kind, "other")
+    })
+
+    it("references fallback: testset_revision.slug if testset.slug absent", () => {
+        const g = computeColumnGroup(
+            {
+                key: "k",
+                type: "input",
+                references: {testset_revision: {id: "tr-rev-abc", slug: "rev-abc"}},
+            },
+            "data.x",
+        )
+        assert.equal(g.kind, "testset")
+        assert.equal(g.slug, "rev-abc")
+    })
+})
+
+describe("groupResolvedColumns", () => {
+    it("groups columns by group.key preserving mapping order within a group", () => {
+        const schema: RunSchema = {
+            steps: [
+                {
+                    key: "testset-1",
+                    type: "input",
+                    references: {testset: {id: "t1", slug: "my-testset"}},
+                },
+                {
+                    key: "eval-1",
+                    type: "annotation",
+                    references: {evaluator: {id: "e-exact", slug: "exact-match"}},
+                },
+                {
+                    key: "eval-2",
+                    type: "annotation",
+                    references: {evaluator: {id: "e-fuzzy", slug: "fuzzy-match"}},
+                },
+            ],
+            mappings: [
+                {
+                    column: {kind: "testset", name: "country"},
+                    step: {key: "testset-1", path: "data.country"},
+                },
+                {
+                    column: {kind: "annotation", name: "success"},
+                    step: {key: "eval-1", path: "attributes.ag.data.outputs.success"},
+                },
+                {
+                    column: {kind: "annotation", name: "success"},
+                    step: {key: "eval-2", path: "attributes.ag.data.outputs.success"},
+                },
+                {
+                    column: {kind: "testset", name: "expected"},
+                    step: {key: "testset-1", path: "data.expected"},
+                },
+            ],
+        }
+        const row = {
+            scenario: {id: "s1", status: "success"},
+            results: [],
+            metrics: [],
+            testcase: null,
+            traces: {},
+        } as unknown as Parameters<typeof resolveMappings>[0]
+        const cols = resolveMappings(row, schema)
+        const groups = groupResolvedColumns(cols)
+
+        assert.equal(groups.length, 3, "3 groups: 1 testset, 2 evaluators")
+        // Order: testset first, then evaluators in first-appearance order
+        assert.equal(groups[0].group.kind, "testset")
+        assert.equal(groups[0].group.label, "Testset my-testset")
+        assert.equal(groups[0].columns.length, 2)
+        assert.equal(groups[0].columns[0].name, "country")
+        assert.equal(groups[0].columns[1].name, "expected")
+        assert.equal(groups[1].group.label, "Exact Match")
+        assert.equal(groups[2].group.label, "Fuzzy Match")
+        // Both evaluators have a "success" column but they're in separate groups
+        assert.equal(groups[1].columns[0].name, "success")
+        assert.equal(groups[2].columns[0].name, "success")
+    })
+
+    it("metrics paths from multiple steps all land in one 'Metrics' group", () => {
+        const schema: RunSchema = {
+            steps: [
+                {
+                    key: "app-1",
+                    type: "invocation",
+                    references: {application: {id: "a-comp-1", slug: "comp-1"}},
+                },
+            ],
+            mappings: [
+                {
+                    column: {kind: "invocation", name: "outputs"},
+                    step: {key: "app-1", path: "attributes.ag.data.outputs"},
+                },
+                {
+                    column: {kind: "invocation", name: "tokens"},
+                    step: {key: "app-1", path: "attributes.ag.metrics.tokens.cumulative.total"},
+                },
+                {
+                    column: {kind: "invocation", name: "cost"},
+                    step: {key: "app-1", path: "attributes.ag.metrics.costs.cumulative.total"},
+                },
+            ],
+        }
+        const row = {
+            scenario: {id: "s1", status: "success"},
+            results: [],
+            metrics: [],
+            testcase: null,
+            traces: {},
+        } as unknown as Parameters<typeof resolveMappings>[0]
+        const cols = resolveMappings(row, schema)
+        const groups = groupResolvedColumns(cols)
+
+        // application group + metrics group
+        assert.equal(groups.length, 2)
+        assert.equal(groups[0].group.kind, "application")
+        assert.equal(groups[1].group.kind, "metrics")
+        assert.equal(groups[1].columns.length, 2)
+        assert.deepEqual(groups[1].columns.map((c) => c.name).sort(), ["cost", "tokens"])
+    })
+
+    it("group ordering: testset → application → evaluator → metrics → other", () => {
+        const schema: RunSchema = {
+            steps: [
+                {key: "ts", type: "input", references: {testset: {id: "ts1", slug: "ts1"}}},
+                {key: "app", type: "invocation", references: {application: {id: "a1", slug: "a1"}}},
+                {key: "ev", type: "annotation", references: {evaluator: {id: "e1", slug: "e1"}}},
+            ],
+            // Intentionally out-of-order in mappings to verify the sort
+            mappings: [
+                {
+                    column: {kind: "annotation", name: "success"},
+                    step: {key: "ev", path: "attributes.ag.data.outputs.success"},
+                },
+                {
+                    column: {kind: "invocation", name: "tokens"},
+                    step: {key: "app", path: "attributes.ag.metrics.tokens.cumulative.total"},
+                },
+                {
+                    column: {kind: "testset", name: "country"},
+                    step: {key: "ts", path: "data.country"},
+                },
+                {
+                    column: {kind: "invocation", name: "outputs"},
+                    step: {key: "app", path: "attributes.ag.data.outputs"},
+                },
+            ],
+        }
+        const row = {
+            scenario: {id: "s1", status: "success"},
+            results: [],
+            metrics: [],
+            testcase: null,
+            traces: {},
+        } as unknown as Parameters<typeof resolveMappings>[0]
+        const cols = resolveMappings(row, schema)
+        const groups = groupResolvedColumns(cols)
+        assert.deepEqual(
+            groups.map((g) => g.group.kind),
+            ["testset", "application", "evaluator", "metrics"],
+        )
+    })
+})
+
+describe("edge cases", () => {
+    it("schema with no mappings → empty result", () => {
+        const cols = resolveMappings(makeRow(), {steps: [], mappings: []})
+        assert.deepEqual(cols, [])
+    })
+
+    it("mapping referring to unknown step key → missing", () => {
+        const schema: RunSchema = {
+            steps: [],
+            mappings: [{column: {kind: "x", name: "y"}, step: {key: "nope", path: "p"}}],
+        }
+        const [col] = resolveMappings(makeRow(), schema)
+        assert.equal(col.source, "missing")
+        assert.equal(col.stepType, "?")
+    })
+
+    it("preserves mapping order", () => {
+        const schema: RunSchema = {
+            steps: [
+                {key: "a", type: "input"},
+                {key: "b", type: "input"},
+            ],
+            mappings: [
+                {column: {kind: "testset", name: "second"}, step: {key: "b", path: "data.b"}},
+                {column: {kind: "testset", name: "first"}, step: {key: "a", path: "data.a"}},
+            ],
+        }
+        const row = makeRow({
+            testcase: {
+                id: "tc",
+                data: {a: 1, b: 2},
+            } as unknown as HydratedScenarioRow<TestScenario>["testcase"],
+        })
+        const cols = resolveMappings(row, schema)
+        assert.equal(cols[0].name, "second")
+        assert.equal(cols[0].value, 2)
+        assert.equal(cols[1].name, "first")
+        assert.equal(cols[1].value, 1)
+    })
+
+    it("DEFAULT_STEP_RESOLVERS contains the three built-in types", () => {
+        assert.ok(typeof DEFAULT_STEP_RESOLVERS.input === "function")
+        assert.ok(typeof DEFAULT_STEP_RESOLVERS.invocation === "function")
+        assert.ok(typeof DEFAULT_STEP_RESOLVERS.annotation === "function")
+    })
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/cacheAwareFetchers.ts b/web/packages/agenta-entities/src/evaluationRun/etl/cacheAwareFetchers.ts
new file mode 100644
index 0000000000..6d372d4164
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/cacheAwareFetchers.ts
@@ -0,0 +1,144 @@
+/**
+ * Molecule-backed `HydrateFetchers` — the proper entity-layer integration.
+ *
+ * Each of the four entity types the hydrate transform needs now has a
+ * cache-aware prefetch action on (or alongside) its molecule:
+ *
+ *   - results    →  evaluationResultMolecule.actions.prefetchByScenarioIds
+ *   - metrics    →  evaluationMetricMolecule.actions.prefetchByScenarioIds
+ *   - testcases  →  prefetchTestcasesByIds (testcase/state/prefetch)
+ *   - traces     →  prefetchTracesByIds   (trace/state/prefetch)
+ *
+ * Every action:
+ *   1. Reads from the shared TanStack Query cache for each requested id
+ *   2. Bulk-fetches only the misses
+ *   3. Writes new rows back to cache (including empties, so we don't
+ *      re-fetch scenarios that genuinely have no data)
+ *   4. Returns a `{cacheHits, cacheMisses, fetchMs}` stat block
+ *
+ * The hydrate transform doesn't need to know any of this — it just calls
+ * `fetchers.fetch*` and receives `HydrateFetchers`-shaped output. The
+ * adapter here glues the molecule outcomes (rich) to the fetcher
+ * contract (flat) and emits cache stats via `onCacheStats` if provided.
+ *
+ * @packageDocumentation
+ */
+
+import {prefetchTestcasesByIds} from "../../testcase/state/prefetch"
+import {prefetchTracesByIds} from "../../trace/state/prefetch"
+import {evaluationMetricMolecule} from "../state/metricMolecule"
+import {evaluationResultMolecule} from "../state/resultMolecule"
+
+import type {HydrateFetchers} from "./hydrateScenariosTransform"
+
+/**
+ * Stats one entity type emitted during a single chunk hydration.
+ */
+export interface EntityCacheStats {
+    cacheHits: number
+    cacheMisses: number
+    fetchMs: number
+}
+
+/**
+ * Per-chunk cache stats across all four entity types.
+ */
+export interface ChunkCacheStats {
+    results: EntityCacheStats
+    metrics: EntityCacheStats
+    testcases: EntityCacheStats
+    traces: EntityCacheStats
+}
+
+export interface BuildMoleculeFetchersOptions {
+    /**
+     * Optional sink for per-chunk cache stats. Called exactly once per
+     * `fetch*` invocation. Use to surface cache hit ratios in observability.
+     */
+    onCacheStats?: (entity: keyof ChunkCacheStats, stats: EntityCacheStats) => void
+}
+
+/**
+ * Build a HydrateFetchers that routes every fetch through the molecule
+ * layer. Each call emits cache stats via the optional callback.
+ */
+export function buildMoleculeBackedFetchers(
+    options: BuildMoleculeFetchersOptions = {},
+): HydrateFetchers {
+    const emit = options.onCacheStats
+
+    return {
+        fetchResults: async ({projectId, runId, scenarioIds}) => {
+            const out = await evaluationResultMolecule.actions.prefetchByScenarioIds({
+                projectId,
+                runId,
+                scenarioIds,
+            })
+            emit?.("results", {
+                cacheHits: out.cacheHits,
+                cacheMisses: out.cacheMisses,
+                fetchMs: out.fetchMs,
+            })
+            return out.results
+        },
+
+        fetchMetrics: async ({projectId, runId, scenarioIds}) => {
+            const out = await evaluationMetricMolecule.actions.prefetchByScenarioIds({
+                projectId,
+                runId,
+                scenarioIds,
+            })
+            emit?.("metrics", {
+                cacheHits: out.cacheHits,
+                cacheMisses: out.cacheMisses,
+                fetchMs: out.fetchMs,
+            })
+            return out.metrics
+        },
+
+        fetchTestcases: async ({projectId, testcaseIds}) => {
+            const out = await prefetchTestcasesByIds({projectId, testcaseIds})
+            emit?.("testcases", {
+                cacheHits: out.cacheHits,
+                cacheMisses: out.cacheMisses,
+                fetchMs: out.fetchMs,
+            })
+            return out.testcases
+        },
+
+        fetchTraces: async ({projectId, traceIds}) => {
+            const out = await prefetchTracesByIds({projectId, traceIds})
+            emit?.("traces", {
+                cacheHits: out.cacheHits,
+                cacheMisses: out.cacheMisses,
+                fetchMs: out.fetchMs,
+            })
+            // Pass the TracesApiResponse envelope through unchanged. The
+            // envelope shape `{count, traces: {[traceIdNoDashes]: traceData}}`
+            // is the documented contract for the shared
+            // `["trace-entity", projectId, traceId]` cache key and is what
+            // every other consumer (traceEntityAtomFamily, EvalRunDetails)
+            // expects. `findInTrace` knows how to drill through it
+            // (resolveMappings.ts case 3), so the hydrate pipeline doesn't
+            // need to pre-unwrap.
+            const flat = new Map<string, unknown>()
+            out.traces.forEach((envelope, traceId) => flat.set(traceId, envelope))
+            return flat
+        },
+    }
+}
+
+/**
+ * Default cache-aware fetchers (no stats emission). For the common case
+ * where you just want cache integration without observability.
+ */
+export const MOLECULE_BACKED_HYDRATE_FETCHERS: HydrateFetchers = buildMoleculeBackedFetchers()
+
+/**
+ * @deprecated Use `MOLECULE_BACKED_HYDRATE_FETCHERS` instead. Kept for one
+ * release as an alias so PoC scripts don't break.
+ */
+export const CACHE_AWARE_HYDRATE_FETCHERS = MOLECULE_BACKED_HYDRATE_FETCHERS
+
+// Backward-compat re-export — the old single-fn API still exists.
+export {prefetchTestcasesByIds as cacheAwareFetchTestcases}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/cacheDiagnostics.ts b/web/packages/agenta-entities/src/evaluationRun/etl/cacheDiagnostics.ts
new file mode 100644
index 0000000000..067d4ed978
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/cacheDiagnostics.ts
@@ -0,0 +1,174 @@
+/**
+ * Diagnostic helpers for inspecting the shared TanStack Query cache.
+ *
+ * These are intentionally side-effect-free — they walk the cache and report
+ * what's there. Use from PoC scripts, observability surfaces, or long-run
+ * tests to bound memory empirically.
+ *
+ * Caveats:
+ *   - "Bytes" is `JSON.stringify(data).length` — a rough proxy for in-memory
+ *     size, not a true heap measurement. Good for relative comparisons and
+ *     blow-up detection, not for accounting.
+ *   - The cache is process-wide. If multiple runs/scopes are active, you'll
+ *     see entries from all of them. Filter via `byPrefix` when you need
+ *     scope isolation.
+ *
+ * @packageDocumentation
+ */
+
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {
+    inspectAtomFamilies,
+    type AtomFamilyStats,
+} from "../../shared/molecule/instrumentedAtomFamily"
+
+export interface CacheSliceStats {
+    /** First component of the cache key — e.g. "evaluation-results", "trace-entity", "testcase". */
+    prefix: string
+    /** Number of cache entries in this slice. */
+    entries: number
+    /** Approximate JSON-byte cost of all entries in this slice. */
+    approxBytes: number
+    /** Largest single entry (bytes). Useful for spotting outliers. */
+    largestEntryBytes: number
+}
+
+export interface CacheDiagnostics {
+    totalEntries: number
+    totalApproxBytes: number
+    /** Per-prefix breakdown, sorted by approxBytes descending. */
+    slices: CacheSliceStats[]
+}
+
+function getQc() {
+    return getDefaultStore().get(queryClientAtom)
+}
+
+/**
+ * Default prefixes inspected by the diagnostic surface. Covers every TanStack
+ * cache key the entity layer writes to, including the span-level cache that
+ * `traceBatchFetcher` populates as a side-effect (separate from the trace-level
+ * cache entry the prefetch action writes).
+ *
+ * Updating this list is the right move when adding a new entity-cache prefix.
+ */
+export const DEFAULT_DIAGNOSTIC_PREFIXES = [
+    "evaluation-results",
+    "evaluation-metrics",
+    "testcase",
+    "trace-entity",
+    // Span-level cache — written by traceBatchFetcher when it materializes a
+    // trace response. Each span in a trace gets its own cache entry under
+    // `["span", projectId, spanId]`. Without this in the diagnostic list the
+    // per-trace cost is under-counted.
+    "span",
+] as const
+
+/**
+ * Walk the TanStack cache, return per-prefix entry counts and approximate
+ * byte sizes. Pass `prefixes` to restrict — defaults to `DEFAULT_DIAGNOSTIC_PREFIXES`.
+ */
+export function inspectCache(opts: {prefixes?: readonly string[]} = {}): CacheDiagnostics {
+    let qc: ReturnType<typeof getQc> | null = null
+    try {
+        qc = getQc()
+    } catch {
+        return {totalEntries: 0, totalApproxBytes: 0, slices: []}
+    }
+
+    const queries = qc.getQueryCache().getAll()
+    const bySlice = new Map<string, {entries: number; bytes: number; max: number}>()
+    const prefixes = opts.prefixes ?? DEFAULT_DIAGNOSTIC_PREFIXES
+
+    for (const q of queries) {
+        const key = q.queryKey
+        const prefix =
+            Array.isArray(key) && typeof key[0] === "string" ? (key[0] as string) : "(unknown)"
+        if (!prefixes.includes(prefix)) continue
+
+        const data = q.state.data
+        let bytes = 0
+        try {
+            bytes = data === undefined ? 0 : JSON.stringify(data).length
+        } catch {
+            bytes = 0
+        }
+
+        const slot = bySlice.get(prefix) ?? {entries: 0, bytes: 0, max: 0}
+        slot.entries++
+        slot.bytes += bytes
+        if (bytes > slot.max) slot.max = bytes
+        bySlice.set(prefix, slot)
+    }
+
+    const slices: CacheSliceStats[] = Array.from(bySlice.entries()).map(([prefix, s]) => ({
+        prefix,
+        entries: s.entries,
+        approxBytes: s.bytes,
+        largestEntryBytes: s.max,
+    }))
+    slices.sort((a, b) => b.approxBytes - a.approxBytes)
+
+    return {
+        totalEntries: slices.reduce((a, s) => a + s.entries, 0),
+        totalApproxBytes: slices.reduce((a, s) => a + s.approxBytes, 0),
+        slices,
+    }
+}
+
+/**
+ * Combined memory snapshot — TanStack cache + atom family sizes + heap.
+ *
+ * Useful as a one-liner in observability surfaces; produces a complete
+ * "how much is the entity layer holding right now" answer.
+ */
+export interface MemorySnapshot {
+    /** TanStack cache, per-prefix. */
+    cache: CacheDiagnostics
+    /** Active params per instrumented atom family. */
+    atomFamilies: AtomFamilyStats[]
+    /** Total params across every instrumented family — quick proxy for "atoms alive". */
+    totalAtomFamilyEntries: number
+    /** process.memoryUsage().heapUsed at snapshot time. */
+    heapUsedBytes: number
+}
+
+export function inspectMemory(opts: {prefixes?: readonly string[]} = {}): MemorySnapshot {
+    const cache = inspectCache(opts)
+    const atomFamilies = inspectAtomFamilies()
+    const totalAtomFamilyEntries = atomFamilies.reduce((a, f) => a + f.size, 0)
+    return {
+        cache,
+        atomFamilies,
+        totalAtomFamilyEntries,
+        heapUsedBytes: typeof process !== "undefined" ? process.memoryUsage().heapUsed : 0,
+    }
+}
+
+/**
+ * Walk the cache and remove all entries matching any of the given prefixes.
+ * Returns the number of entries removed. Use this for explicit teardown in
+ * scripts or after a run finishes.
+ */
+export function clearCacheByPrefix(prefixes: string[]): number {
+    let qc: ReturnType<typeof getQc> | null = null
+    try {
+        qc = getQc()
+    } catch {
+        return 0
+    }
+    const cache = qc.getQueryCache()
+    const queries = cache.getAll()
+    let removed = 0
+    for (const q of queries) {
+        const key = q.queryKey
+        const prefix = Array.isArray(key) && typeof key[0] === "string" ? key[0] : null
+        if (prefix && prefixes.includes(prefix)) {
+            cache.remove(q)
+            removed++
+        }
+    }
+    return removed
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/filterSchema.ts b/web/packages/agenta-entities/src/evaluationRun/etl/filterSchema.ts
new file mode 100644
index 0000000000..273063033e
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/filterSchema.ts
@@ -0,0 +1,133 @@
+/**
+ * filterSchema — derive the set of *filterable fields* for an evaluation
+ * run from its schema (steps + mappings).
+ *
+ * The filter UI (Phase 2 / T4) needs to know, before any scenario data is
+ * loaded: which columns can be filtered, what each one's value type is,
+ * and which operators that type allows. This module produces exactly
+ * that, keyed the same way `RowPredicate` targets a column
+ * (groupKind + groupSlug + columnName), so a UI selection maps straight
+ * onto a predicate.
+ *
+ * # Value typing
+ *
+ * The run schema does not carry per-column value types — those live in
+ * evaluator output schemas / sampled values, which this module does not
+ * fetch. So typing is best-effort:
+ *
+ *   - metrics columns        → "number" (cost / duration / tokens / scores)
+ *   - everything else        → "unknown"
+ *
+ * "unknown" still gets a safe equality-oriented operator set. Callers that
+ * *can* determine a precise type (e.g. T4 wiring with access to evaluator
+ * output schemas, or by sampling resolved values) pass `resolveValueType`
+ * to refine it — that is the intended extension seam, not an edit here.
+ *
+ * @packageDocumentation
+ */
+
+import {groupRunColumns, type ColumnGroup, type RunSchema} from "./resolveMappings"
+import type {RowPredicate} from "./rowPredicateFilter"
+
+/** Value type of a filterable field — drives the operator set. */
+export type FilterValueType = "string" | "number" | "boolean" | "unknown"
+
+/** All comparison operators a `RowPredicate` supports. */
+export type FilterOperator = RowPredicate["op"]
+
+/** A single field the user can filter on. */
+export interface FilterableField {
+    /** Targeting triple — maps directly onto `RowPredicate`. */
+    groupKind: ColumnGroup["kind"]
+    groupSlug: string | null
+    columnName: string
+    /** Display label for the field (the column name). */
+    label: string
+    /** Display label for the owning group (nested-header style). */
+    groupLabel: string
+    /** Best-effort value type — "unknown" when undeterminable from the schema. */
+    valueType: FilterValueType
+    /** Operators valid for this field's type. */
+    operators: FilterOperator[]
+}
+
+export interface FilterSchema {
+    fields: FilterableField[]
+}
+
+const OPERATORS_BY_TYPE: Record<FilterValueType, FilterOperator[]> = {
+    number: ["eq", "ne", "lt", "lte", "gt", "gte", "in", "nin"],
+    string: ["eq", "ne", "in", "nin"],
+    boolean: ["eq", "ne"],
+    // Undeterminable type — equality + membership are always safe; ordered
+    // comparisons are not, so they are withheld until the type is known.
+    unknown: ["eq", "ne", "in", "nin"],
+}
+
+/** The operator set valid for a given value type. */
+export function operatorsForType(type: FilterValueType): FilterOperator[] {
+    return [...OPERATORS_BY_TYPE[type]]
+}
+
+/** Schema-only default value type — metrics are numeric, the rest unknown. */
+function defaultValueType(kind: ColumnGroup["kind"]): FilterValueType {
+    return kind === "metrics" ? "number" : "unknown"
+}
+
+export interface BuildFilterSchemaOptions {
+    /**
+     * Refine a field's value type. Return `undefined` to keep the
+     * schema-only default. This is the seam for type information that does
+     * not live in the run schema — evaluator output schemas, sampled
+     * resolved values, etc.
+     */
+    resolveValueType?: (field: {
+        groupKind: ColumnGroup["kind"]
+        groupSlug: string | null
+        columnName: string
+    }) => FilterValueType | undefined
+}
+
+/**
+ * Build the filterable-field schema for a run. Fields appear in the same
+ * group order the table renders columns (testset → application →
+ * evaluator → metrics → other). Duplicate (groupKind, groupSlug,
+ * columnName) triples are collapsed to one field.
+ */
+export function buildFilterSchema(
+    schema: RunSchema | null,
+    options: BuildFilterSchemaOptions = {},
+): FilterSchema {
+    if (!schema) return {fields: []}
+
+    const groups = groupRunColumns(schema.steps, schema.mappings)
+    const fields: FilterableField[] = []
+    const seen = new Set<string>()
+
+    for (const g of groups) {
+        for (const leaf of g.columns) {
+            const dedupKey = `${leaf.kind}::${leaf.groupSlug ?? ""}::${leaf.name}`
+            if (seen.has(dedupKey)) continue
+            seen.add(dedupKey)
+
+            const hinted = options.resolveValueType?.({
+                groupKind: leaf.kind,
+                groupSlug: leaf.groupSlug,
+                columnName: leaf.name,
+            })
+            const valueType = hinted ?? defaultValueType(leaf.kind)
+
+            fields.push({
+                groupKind: leaf.kind,
+                groupSlug: leaf.groupSlug,
+                columnName: leaf.name,
+                label: leaf.name,
+                groupLabel: g.group.label,
+                valueType,
+                operators: operatorsForType(valueType),
+            })
+        }
+    }
+
+    return {fields}
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/hitRatioMeter.ts b/web/packages/agenta-entities/src/evaluationRun/etl/hitRatioMeter.ts
new file mode 100644
index 0000000000..0443c1297c
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/hitRatioMeter.ts
@@ -0,0 +1,190 @@
+/**
+ * Hit-ratio meter — the v1→v2 escalation signal.
+ *
+ * # Why this exists
+ *
+ * The eval-filtering RFC (docs/designs/eval-filtering.md §D2 + §C3) defines
+ * a two-engine strategy:
+ *
+ *   - **v1** evaluates filter predicates **client-side**, over already-loaded
+ *     metric data. Cheap to ship, no backend work. Correct for high-hit-ratio
+ *     predicates where most rows pass and full materialization is cheap.
+ *
+ *   - **v2** evaluates filter predicates **server-side** via the
+ *     `scenarios/query` `filtering` parameter. Same wire format, transform
+ *     becomes a no-op. Required when the predicate is low-hit-ratio — the
+ *     "catastrophic case" of infinite-scroll fetching the whole run just to
+ *     fill a viewport.
+ *
+ * The decision between v1 and v2 is data-dependent: which engine should run
+ * THIS predicate against THIS dataset? The answer is encoded in the
+ * hit-ratio meter:
+ *
+ *   - Observe `(matched / scanned)` per chunk
+ *   - Roll the ratio over a window of N consecutive chunks
+ *   - When the rolling ratio falls below the threshold → recommend escalation
+ *
+ * The RFC's default policy: window = 3 chunks, threshold = 0.10. Below 10%
+ * average pass over 3 windows → escalate.
+ *
+ * # What this module does (today)
+ *
+ * It **reports the regime**. It does not swap engines. The PoC and any
+ * caller consume `regime()` and decide what to do with the recommendation
+ * — log it, surface a banner, swap the source, etc.
+ *
+ * v2 backend support is the next milestone. When it lands, the consumer
+ * pattern is: "regime === 'escalate' → next chunk's source request carries
+ * `filtering` payload, this transform becomes a no-op."
+ *
+ * # State machine
+ *
+ *   warming   → fewer than `windowSize` chunks observed
+ *               (rolling ratio undefined → recommend keep-client by default)
+ *   client    → rolling ratio ≥ threshold
+ *               (v1 is comfortable — keep the client transform)
+ *   escalate  → rolling ratio < threshold
+ *               (v1 is wasteful — switch to v2 backend predicate)
+ *
+ * Transitions happen on each `record()` call. The meter is monotonic in
+ * "chunks observed" but the regime itself can oscillate (rare in practice —
+ * the rolling average smooths noise).
+ *
+ * @packageDocumentation
+ */
+
+export interface HitRatioWindow {
+    /** 1-based chunk index. */
+    chunk: number
+    /** Rows the predicate filter saw at this chunk. */
+    scanned: number
+    /** Rows that passed the predicate at this chunk. */
+    matched: number
+    /** Per-chunk pass ratio (matched / scanned, 0..1). */
+    ratio: number
+}
+
+export type HitRatioState = "warming" | "client" | "escalate"
+
+export interface HitRatioRegime {
+    /** Current recommendation. */
+    state: HitRatioState
+    /** Rolling-window ratio (matched/scanned summed over the window). Null while warming. */
+    rollingRatio: number | null
+    /** How many chunks have been recorded so far. */
+    chunksObserved: number
+    /** Window size (number of chunks the rolling ratio averages over). */
+    windowSize: number
+    /** Threshold the rolling ratio is compared against. */
+    threshold: number
+    /** Human-readable single-line explanation suitable for logs / banners. */
+    reason: string
+}
+
+export interface HitRatioMeterOptions {
+    /**
+     * Number of recent chunks to average over. Default 3, matching the RFC's
+     * "below threshold over 3 windows" trigger.
+     */
+    windowSize?: number
+    /**
+     * Rolling-ratio threshold for escalation. Default 0.10 (10%) — the RFC's
+     * recommended starting point. Below this → recommend v2.
+     */
+    threshold?: number
+}
+
+export interface HitRatioMeter {
+    /** Record a chunk's stats. Idempotent on repeated calls for the same chunk index. */
+    record: (args: {chunk: number; scanned: number; matched: number}) => void
+    /** Compute the current regime. Pure read — does not mutate state. */
+    regime: () => HitRatioRegime
+    /** All recorded windows, in chunk order. */
+    windows: () => HitRatioWindow[]
+    /** Drop all observations — useful when starting a new predicate. */
+    reset: () => void
+    /** Configured window size + threshold (for diagnostics). */
+    readonly config: {windowSize: number; threshold: number}
+}
+
+const DEFAULT_WINDOW = 3
+const DEFAULT_THRESHOLD = 0.1
+
+export function createHitRatioMeter(options: HitRatioMeterOptions = {}): HitRatioMeter {
+    const windowSize = options.windowSize ?? DEFAULT_WINDOW
+    const threshold = options.threshold ?? DEFAULT_THRESHOLD
+
+    if (windowSize < 1) throw new Error(`windowSize must be >= 1, got ${windowSize}`)
+    if (threshold < 0 || threshold > 1) {
+        throw new Error(`threshold must be between 0 and 1, got ${threshold}`)
+    }
+
+    let observed: HitRatioWindow[] = []
+    const seenChunks = new Set<number>()
+
+    function record(args: {chunk: number; scanned: number; matched: number}) {
+        // Dedup by chunk index — the predicate filter emits one event per
+        // predicate per chunk, but for meter purposes we only need one entry
+        // per chunk. Caller is responsible for passing aggregate stats.
+        if (seenChunks.has(args.chunk)) return
+        seenChunks.add(args.chunk)
+        observed.push({
+            chunk: args.chunk,
+            scanned: args.scanned,
+            matched: args.matched,
+            ratio: args.scanned > 0 ? args.matched / args.scanned : 0,
+        })
+    }
+
+    function regime(): HitRatioRegime {
+        const chunksObserved = observed.length
+        if (chunksObserved < windowSize) {
+            return {
+                state: "warming",
+                rollingRatio: null,
+                chunksObserved,
+                windowSize,
+                threshold,
+                reason: `warming (${chunksObserved}/${windowSize} chunks observed — need ${windowSize} before recommending)`,
+            }
+        }
+
+        const tail = observed.slice(-windowSize)
+        const totalScanned = tail.reduce((a, w) => a + w.scanned, 0)
+        const totalMatched = tail.reduce((a, w) => a + w.matched, 0)
+        const rollingRatio = totalScanned > 0 ? totalMatched / totalScanned : 0
+
+        if (rollingRatio < threshold) {
+            return {
+                state: "escalate",
+                rollingRatio,
+                chunksObserved,
+                windowSize,
+                threshold,
+                reason: `rolling ratio ${(rollingRatio * 100).toFixed(1)}% < ${(threshold * 100).toFixed(0)}% threshold over last ${windowSize} chunks — recommend v2 server-side filter`,
+            }
+        }
+
+        return {
+            state: "client",
+            rollingRatio,
+            chunksObserved,
+            windowSize,
+            threshold,
+            reason: `rolling ratio ${(rollingRatio * 100).toFixed(1)}% ≥ ${(threshold * 100).toFixed(0)}% threshold over last ${windowSize} chunks — v1 client filter is appropriate`,
+        }
+    }
+
+    function reset() {
+        observed = []
+        seenChunks.clear()
+    }
+
+    return {
+        record,
+        regime,
+        windows: () => observed.slice(),
+        reset,
+        config: {windowSize, threshold},
+    }
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/hydrateScenariosTransform.ts b/web/packages/agenta-entities/src/evaluationRun/etl/hydrateScenariosTransform.ts
new file mode 100644
index 0000000000..8c2cd4a38f
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/hydrateScenariosTransform.ts
@@ -0,0 +1,439 @@
+/**
+ * hydrateScenariosTransform — joins scenarios with their correlated entities.
+ *
+ * Scenarios as returned by `/evaluations/scenarios/query` are *references*:
+ * they carry an id, a status, a run_id, and a testcase_id. To render anything
+ * meaningful in the UI (input data, app outputs, evaluator scores, traces) we
+ * have to join 4 additional entities, each fetched in bulk by the IDs present
+ * in the chunk:
+ *
+ *   - results   (one per `step_key`):   POST /evaluations/results/query
+ *   - metrics   (per-scenario scores):  POST /evaluations/metrics/query
+ *   - testcases (input data):           POST /testcases/query
+ *   - traces    (app outputs/spans):    POST /tracing/spans/query
+ *                                       (filter: trace_id IN [...])
+ *
+ * This factory returns a `Transform<EvaluationScenario, HydratedScenarioRow>`
+ * that runs all four fetches in parallel per chunk. This is what the architecture
+ * RFC calls `correlatedDataPrefetch` (Convention 7) — except instead of being
+ * a side-effect on chunk arrival, here it's an explicit pipeline stage so the
+ * downstream sink receives fully materialized rows.
+ *
+ * Per-chunk request budget: 4 bulk calls (results, metrics, testcases, traces).
+ * Independent of chunk size or column count.
+ *
+ * Each call uses the **entities-package API surface** (queryEvaluationResults,
+ * queryEvaluationMetrics, fetchTestcasesBatch, fetchAllPreviewTraces). That's
+ * the load-bearing claim: hydration goes through the same code path as cell
+ * rendering, so anything we build here drops straight into a real store.
+ *
+ * @packageDocumentation
+ */
+
+import type {Transform, Chunk} from "../../etl/core/types"
+import {fetchTestcasesBatch} from "../../testcase/api"
+import type {Testcase} from "../../testcase/core"
+import {fetchAllPreviewTraces} from "../../trace/api"
+import {queryEvaluationResults, queryEvaluationMetrics} from "../api"
+import type {EvaluationResult, EvaluationMetric} from "../core"
+
+/**
+ * Minimal scenario shape this transform consumes. The full schema lives in
+ * `realScenarioSource.ts` as `RealEvaluationScenario`, but consumers may pass
+ * any object that carries an `id` and (optionally) a `testcase_id`.
+ */
+export interface HydratableScenario {
+    id: string
+    testcase_id?: string | null
+    [k: string]: unknown
+}
+
+/**
+ * The output of the hydrate transform — a row fully joined to its correlated
+ * entities. Sinks that consume this know enough to render any column in the
+ * UI without further fetches.
+ */
+export interface HydratedScenarioRow<TScenario extends HydratableScenario = HydratableScenario> {
+    scenario: TScenario
+    /** All results (one per step_key) for this scenario. May be empty if the run is still in progress. */
+    results: EvaluationResult[]
+    /** Per-scenario metrics. Often one row keyed by step_key in `metric.data`, but the API doesn't constrain count. */
+    metrics: EvaluationMetric[]
+    /** Testcase referenced by scenario.testcase_id. Null if no reference or fetch failed. */
+    testcase: Testcase | null
+    /** Trace data keyed by trace_id (dashes preserved). May be empty if no result.trace_id existed yet. */
+    traces: Record<string, unknown>
+}
+
+/**
+ * Pluggable fetcher contracts.
+ *
+ * The hydrate transform doesn't know how to fetch anything — it just describes
+ * what it needs (ids in, joined data out). Each fetcher is injected, so the
+ * same transform runs against:
+ *
+ *   - raw HTTP fetchers (Node scripts, ETL)            ← current default
+ *   - TanStack-cached fetchers (browser, dedupes)      ← drop-in upgrade
+ *   - molecule.actions.prefetchMany(ids) (full entity layer) ← future
+ *
+ * As entity-layer abstractions land (molecules with prefetch actions,
+ * traceBatchFetcher export, etc.), callers swap them in here. The transform
+ * doesn't change.
+ */
+export interface HydrateFetchers {
+    /** Bulk fetch results by scenario IDs. */
+    fetchResults: (args: {
+        projectId: string
+        runId: string
+        scenarioIds: string[]
+    }) => Promise<EvaluationResult[]>
+    /** Bulk fetch metrics by scenario IDs. */
+    fetchMetrics: (args: {
+        projectId: string
+        runId: string
+        scenarioIds: string[]
+    }) => Promise<EvaluationMetric[]>
+    /** Bulk fetch testcases by IDs. Returns Map<id, Testcase>. */
+    fetchTestcases: (args: {
+        projectId: string
+        testcaseIds: string[]
+    }) => Promise<Map<string, Testcase>>
+    /**
+     * Bulk fetch traces by IDs. Returns Map<traceId (dashed), traceEnvelope>.
+     * Implementations are responsible for any ID canonicalisation.
+     */
+    fetchTraces: (args: {projectId: string; traceIds: string[]}) => Promise<Map<string, unknown>>
+}
+
+export interface HydrateScenariosTransformParams {
+    /** Project scope for all sub-fetches. */
+    projectId: string
+    /** Run scope for results + metrics queries. */
+    runId: string
+    /**
+     * Override individual fetchers. Anything you don't pass falls back to
+     * the API-direct defaults (raw HTTP, no entity-cache integration). Use
+     * this slot to plug in molecule-backed or batch-fetcher-backed versions
+     * once they exist.
+     */
+    fetchers?: Partial<HydrateFetchers>
+    /**
+     * Skip the trace fetch. Useful when the pipeline only needs scores +
+     * input data (e.g. for table summary rendering) and traces are drilled
+     * into on demand. Defaults to false (traces are fetched).
+     */
+    skipTraces?: boolean
+    /**
+     * Skip the testcase fetch. Useful for pipelines that only need scores.
+     * Defaults to false.
+     */
+    skipTestcases?: boolean
+    /**
+     * Optional callback invoked once per chunk with the raw per-stage timings
+     * and counts. Lets the PoC / observability surface measure the hydrate
+     * cost without coupling the transform to logging.
+     */
+    onChunkHydrated?: (info: {
+        chunkScenarios: number
+        resultsFetched: number
+        metricsFetched: number
+        testcasesFetched: number
+        tracesFetched: number
+        resultsMs: number
+        metricsMs: number
+        testcasesMs: number
+        tracesMs: number
+        totalMs: number
+    }) => void
+}
+
+/**
+ * Default fetchers — raw HTTP via the entities-package api layer.
+ *
+ * These do NOT consult the entity cache. They will refetch data even when
+ * the same testcase / trace / metric is already in the TanStack cache from
+ * another view. Acceptable for headless scripts and one-shot ETL runs;
+ * upgrade to cache-aware fetchers in long-lived browser sessions.
+ */
+export const DEFAULT_HYDRATE_FETCHERS: HydrateFetchers = {
+    fetchResults: queryEvaluationResults,
+    fetchMetrics: queryEvaluationMetrics,
+    fetchTestcases: ({projectId, testcaseIds}) => fetchTestcasesBatch({projectId, testcaseIds}),
+    fetchTraces: async ({projectId, traceIds}) => {
+        // Mirror what trace/state/store.ts:traceBatchFetcher does at the API
+        // level: canonicalise IDs (strip dashes), bulk-fetch via IN filter,
+        // rekey by the dashed form so the caller can look up by the value
+        // they see in result.trace_id.
+        const out = new Map<string, unknown>()
+        if (traceIds.length === 0) return out
+        const canonicalIds = traceIds.map((id) => id.replace(/-/g, ""))
+        const data = await fetchAllPreviewTraces(
+            {
+                focus: "trace",
+                format: "agenta",
+                filter: JSON.stringify({
+                    conditions: [{field: "trace_id", operator: "in", value: canonicalIds}],
+                }),
+            },
+            "",
+            projectId,
+        )
+        const tracesObj = (data as {traces?: Record<string, unknown>} | null)?.traces ?? {}
+        traceIds.forEach((traceId, idx) => {
+            const canon = canonicalIds[idx]
+            if (tracesObj[canon] !== undefined) out.set(traceId, tracesObj[canon])
+        })
+        return out
+    },
+}
+
+/**
+ * Build a `Transform<TScenario, HydratedScenarioRow<TScenario>>` that joins
+ * each chunk of scenarios with its correlated entities.
+ *
+ * Usage:
+ * ```ts
+ * const hydrate = makeHydrateScenariosTransform({projectId, runId})
+ *
+ * for await (const progress of runLoop(scenarioSource, [hydrate], hydratedSink, undefined)) {
+ *   // ...
+ * }
+ * ```
+ *
+ * Per-chunk behaviour:
+ *
+ * 1. Collect scenario_ids and testcase_ids from the chunk.
+ * 2. Fan out three parallel bulk calls — results, metrics, testcases.
+ * 3. Once results return, collect trace_ids and fetch traces in one bulk call.
+ * 4. Group results / metrics by scenario_id, look up testcase + traces, emit
+ *    a hydrated row per scenario.
+ */
+export function makeHydrateScenariosTransform<TScenario extends HydratableScenario>(
+    params: HydrateScenariosTransformParams,
+): Transform<TScenario, HydratedScenarioRow<TScenario>> {
+    const {
+        projectId,
+        runId,
+        skipTraces = false,
+        skipTestcases = false,
+        onChunkHydrated,
+        fetchers: fetcherOverrides,
+    } = params
+    const fetchers: HydrateFetchers = {
+        ...DEFAULT_HYDRATE_FETCHERS,
+        ...(fetcherOverrides ?? {}),
+    }
+
+    return async (chunk: Chunk<TScenario>): Promise<Chunk<HydratedScenarioRow<TScenario>>> => {
+        const totalStart = performance.now()
+
+        const scenarios = chunk.items
+        const scenarioIds = scenarios.map((s) => s.id).filter(Boolean)
+
+        // Empty chunk fast-path — nothing to hydrate, propagate cursor unchanged.
+        if (scenarios.length === 0) {
+            onChunkHydrated?.({
+                chunkScenarios: 0,
+                resultsFetched: 0,
+                metricsFetched: 0,
+                testcasesFetched: 0,
+                tracesFetched: 0,
+                resultsMs: 0,
+                metricsMs: 0,
+                testcasesMs: 0,
+                tracesMs: 0,
+                totalMs: 0,
+            })
+            return {
+                items: [],
+                cursor: chunk.cursor,
+                meta: {...(chunk.meta as Record<string, unknown> | undefined), hydrated: true},
+            }
+        }
+
+        // -----------------------------------------------------------------
+        // Stage 1 — fan out results + metrics in parallel.
+        //
+        // We cannot fetch testcases yet because the run schema may carry
+        // testcase_id on the input-step's *result*, not on the scenario.
+        // We collect testcase_ids from both scenarios AND results in stage 2.
+        // -----------------------------------------------------------------
+
+        const resultsStart = performance.now()
+        const metricsStart = performance.now()
+
+        const [results, metrics] = await Promise.all([
+            fetchers.fetchResults({projectId, runId, scenarioIds}).catch((e) => {
+                console.warn(
+                    `[hydrateScenarios] results fetch failed: ${e instanceof Error ? e.message : e}`,
+                )
+                return [] as EvaluationResult[]
+            }),
+            fetchers.fetchMetrics({projectId, runId, scenarioIds}).catch((e) => {
+                console.warn(
+                    `[hydrateScenarios] metrics fetch failed: ${e instanceof Error ? e.message : e}`,
+                )
+                return [] as EvaluationMetric[]
+            }),
+        ])
+
+        const resultsMs = performance.now() - resultsStart
+        const metricsMs = performance.now() - metricsStart
+
+        // -----------------------------------------------------------------
+        // Stage 2 — testcases + traces (both depend on results), in parallel.
+        //   - testcase_ids come from scenario.testcase_id ∪ result.testcase_id
+        //   - trace_ids   come from result.trace_id
+        // -----------------------------------------------------------------
+
+        const testcaseIds = Array.from(
+            new Set(
+                [
+                    ...scenarios.map((s) => s.testcase_id),
+                    ...results.map((r) => r.testcase_id),
+                ].filter((v): v is string => typeof v === "string" && v.length > 0),
+            ),
+        )
+
+        const testcasesStart = performance.now()
+        const tracesStart = performance.now()
+        let traceMap: Record<string, unknown> = {}
+        let tracesFetched = 0
+        let testcaseMap = new Map<string, Testcase>()
+
+        const stage2Tasks: Promise<unknown>[] = []
+
+        if (!skipTestcases && testcaseIds.length > 0) {
+            stage2Tasks.push(
+                fetchers
+                    .fetchTestcases({projectId, testcaseIds})
+                    .then((m) => {
+                        testcaseMap = m
+                    })
+                    .catch((e) => {
+                        console.warn(
+                            `[hydrateScenarios] testcases fetch failed: ${e instanceof Error ? e.message : e}`,
+                        )
+                    }),
+            )
+        }
+
+        if (!skipTraces) {
+            const traceIds = Array.from(
+                new Set(
+                    results
+                        .map((r) => r.trace_id)
+                        .filter((v): v is string => typeof v === "string" && v.length > 0),
+                ),
+            )
+
+            if (traceIds.length > 0) {
+                stage2Tasks.push(
+                    fetchers
+                        .fetchTraces({projectId, traceIds})
+                        .then((m) => {
+                            m.forEach((trace, traceId) => {
+                                traceMap[traceId] = trace
+                                tracesFetched++
+                            })
+                        })
+                        .catch((e) => {
+                            console.warn(
+                                `[hydrateScenarios] traces fetch failed: ${e instanceof Error ? e.message : e}`,
+                            )
+                        }),
+                )
+            }
+        }
+
+        await Promise.all(stage2Tasks)
+
+        const testcasesMs = performance.now() - testcasesStart
+        const tracesMs = performance.now() - tracesStart
+
+        // -----------------------------------------------------------------
+        // Stage 3 — group results/metrics by scenario, emit hydrated rows.
+        // -----------------------------------------------------------------
+
+        const resultsByScenario = new Map<string, EvaluationResult[]>()
+        for (const r of results) {
+            const arr = resultsByScenario.get(r.scenario_id) ?? []
+            arr.push(r)
+            resultsByScenario.set(r.scenario_id, arr)
+        }
+
+        const metricsByScenario = new Map<string, EvaluationMetric[]>()
+        for (const m of metrics) {
+            const sid = m.scenario_id ?? null
+            if (!sid) continue // run-level aggregate; not joined to a row
+            const arr = metricsByScenario.get(sid) ?? []
+            arr.push(m)
+            metricsByScenario.set(sid, arr)
+        }
+
+        const hydrated: HydratedScenarioRow<TScenario>[] = scenarios.map((scenario) => {
+            const rowResults = resultsByScenario.get(scenario.id) ?? []
+            const rowMetrics = metricsByScenario.get(scenario.id) ?? []
+
+            // Testcase resolution — try scenario.testcase_id first, then fall
+            // back to any result.testcase_id (input step results carry it when
+            // the scenario itself doesn't). This handles both legacy and
+            // current run-graph schemas.
+            const scenarioTcId =
+                typeof scenario.testcase_id === "string" ? scenario.testcase_id : null
+            const resultTcId = rowResults
+                .map((r) => r.testcase_id)
+                .find((v): v is string => typeof v === "string" && v.length > 0)
+            const effectiveTcId = scenarioTcId ?? resultTcId ?? null
+            const testcase = effectiveTcId ? (testcaseMap.get(effectiveTcId) ?? null) : null
+
+            // Only include traces this row actually references — keeps row payload
+            // bounded; callers can still cross-reference by trace_id if needed.
+            const rowTraces: Record<string, unknown> = {}
+            for (const r of rowResults) {
+                if (r.trace_id && traceMap[r.trace_id] !== undefined) {
+                    rowTraces[r.trace_id] = traceMap[r.trace_id]
+                }
+            }
+
+            return {
+                scenario,
+                results: rowResults,
+                metrics: rowMetrics,
+                testcase,
+                traces: rowTraces,
+            }
+        })
+
+        const totalMs = performance.now() - totalStart
+
+        onChunkHydrated?.({
+            chunkScenarios: scenarios.length,
+            resultsFetched: results.length,
+            metricsFetched: metrics.length,
+            testcasesFetched: testcaseMap.size,
+            tracesFetched,
+            resultsMs,
+            metricsMs,
+            testcasesMs,
+            tracesMs,
+            totalMs,
+        })
+
+        return {
+            items: hydrated,
+            cursor: chunk.cursor,
+            meta: {
+                ...(chunk.meta as Record<string, unknown> | undefined),
+                hydrated: true,
+                hydrateCounts: {
+                    scenarios: scenarios.length,
+                    results: results.length,
+                    metrics: metrics.length,
+                    testcases: testcaseMap.size,
+                    traces: tracesFetched,
+                },
+            },
+        }
+    }
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/index.ts b/web/packages/agenta-entities/src/evaluationRun/etl/index.ts
new file mode 100644
index 0000000000..4bb71e0faf
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/index.ts
@@ -0,0 +1,151 @@
+/**
+ * @agenta/entities/evaluationRun/etl
+ *
+ * Eval-specific ETL adapters. See docs/designs/eval-etl-engine.md for
+ * the design.
+ *
+ * Currently exposed:
+ *   - makeRealScenarioSource: minimal real Source that hits
+ *     /evaluations/scenarios/query directly. Used by the PoC; will
+ *     eventually be replaced by makeSource(scenariosPaginatedStore)
+ *     once Phase 1-2 of the architecture RFC lands.
+ *
+ * @packageDocumentation
+ */
+
+export type {RealEvaluationScenario, RealScenarioSourceParams} from "./realScenarioSource"
+export {makeRealScenarioSource} from "./realScenarioSource"
+
+// Hydrate transform — joins each scenario chunk to its correlated entities
+// (results, metrics, testcases, traces) via injected HydrateFetchers.
+export type {
+    HydratableScenario,
+    HydratedScenarioRow,
+    HydrateScenariosTransformParams,
+    HydrateFetchers,
+} from "./hydrateScenariosTransform"
+export {makeHydrateScenariosTransform, DEFAULT_HYDRATE_FETCHERS} from "./hydrateScenariosTransform"
+
+// Column resolver — declarative, driven by run.data.steps[].type and the
+// run's column mappings. Groups columns by source (testset / application /
+// evaluator / metrics) so the UI can mirror the screenshot's grouped header
+// layout with no name-collision risk across multiple evaluators.
+export type {
+    RunStep,
+    RunMapping,
+    RunSchema,
+    ResolveSource,
+    ResolvedColumn,
+    ResolveContext,
+    StepResolver,
+    ResolveMappingsOptions,
+    ColumnGroup,
+    ResolvedColumnGroup,
+    RunColumnLeaf,
+    RunColumnGroup,
+} from "./resolveMappings"
+export {
+    DEFAULT_STEP_RESOLVERS,
+    resolveFromTestcase,
+    resolveFromTrace,
+    resolveFromMetric,
+    composeResolvers,
+    findInTrace,
+    getAtPath,
+    resolveMappings,
+    computeColumnGroup,
+    groupResolvedColumns,
+    groupRunColumns,
+} from "./resolveMappings"
+
+// Molecule-backed cache-aware fetchers — all 4 entity types go through
+// the entity layer (TanStack cache read, bulk-fetch misses, write-back).
+export {
+    buildMoleculeBackedFetchers,
+    MOLECULE_BACKED_HYDRATE_FETCHERS,
+    CACHE_AWARE_HYDRATE_FETCHERS, // @deprecated alias
+    cacheAwareFetchTestcases,
+    type EntityCacheStats,
+    type ChunkCacheStats,
+    type BuildMoleculeFetchersOptions,
+} from "./cacheAwareFetchers"
+
+// Cache diagnostics — inspect the TanStack cache + atom family sizes
+export {
+    DEFAULT_DIAGNOSTIC_PREFIXES,
+    inspectCache,
+    inspectMemory,
+    clearCacheByPrefix,
+    type CacheDiagnostics,
+    type CacheSliceStats,
+    type MemorySnapshot,
+} from "./cacheDiagnostics"
+// Atom family registry — direct access for tests / advanced consumers
+export {
+    inspectAtomFamilies,
+    clearAllAtomFamilies,
+    instrumentedAtomFamily,
+    type AtomFamilyStats,
+    type InstrumentedAtomFamily,
+    type InstrumentedAtomFamilyOptions,
+} from "../../shared/molecule/instrumentedAtomFamily"
+
+// Post-hydrate predicate filter — value-equality against resolved UI columns.
+// Per eval-filtering.md §D2: this is the v1 frontend transform over already-
+// loaded metric data. v2 server-side filter swaps the source's `filtering`
+// param and this transform becomes a no-op.
+//
+// Multi-predicate AND/OR composition (decision D8) — `PredicateGroup` plus
+// the `evaluate*` / `matchesRowFilter` row-level entry points and the
+// `makePredicateGroupFilter` pipeline transform.
+export {
+    makeRowPredicateFilter,
+    makePredicateGroupFilter,
+    unwrapStatsForCompare,
+    isPredicateGroup,
+    evaluateRowPredicate,
+    evaluatePredicateGroup,
+    evaluateRowFilter,
+    matchesRowFilter,
+    type RowPredicate,
+    type PredicateGroup,
+    type RowFilter,
+    type PredicateFilterOptions,
+    type PredicateGroupFilterOptions,
+} from "./rowPredicateFilter"
+
+// filterSchema — derives the filterable fields (typed + type-matched
+// operators) the Phase 2 filter UI offers. Decision D8 / eval-filtering D4.
+export {
+    buildFilterSchema,
+    operatorsForType,
+    type FilterSchema,
+    type FilterableField,
+    type FilterValueType,
+    type FilterOperator,
+    type BuildFilterSchemaOptions,
+} from "./filterSchema"
+
+// Hit-ratio meter — v1→v2 escalation signal (reports the regime; doesn't
+// swap engines today). Per eval-filtering.md §D2 + §C3: tracks rolling
+// (matched/scanned) and recommends escalating to v2 when the ratio falls
+// below threshold.
+export {
+    createHitRatioMeter,
+    type HitRatioMeter,
+    type HitRatioMeterOptions,
+    type HitRatioRegime,
+    type HitRatioState,
+    type HitRatioWindow,
+} from "./hitRatioMeter"
+
+// Predicate → entity slice resolver — drives filter-aware hydrate so we
+// don't fetch slices the active predicate(s) never touch (e.g. skip
+// trace fetches when the filter only references evaluator metrics).
+// Same direction-inverted convention as resolveMappings (which goes
+// column → value); this goes column → entity-slice.
+export {
+    predicateToEntitySlices,
+    type EntitySlice,
+    type PredicateSliceResult,
+} from "./predicateToEntitySlices"
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/predicateToEntitySlices.ts b/web/packages/agenta-entities/src/evaluationRun/etl/predicateToEntitySlices.ts
new file mode 100644
index 0000000000..4a8bca4c6a
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/predicateToEntitySlices.ts
@@ -0,0 +1,184 @@
+/**
+ * predicateToEntitySlices
+ *
+ * Given a run schema + active predicate(s), return the minimum set of
+ * entity slices the hydrate stage needs to fetch in order to evaluate
+ * the predicates. The downstream effect: predicate-driven hydrate skips
+ * slices the predicate doesn't touch, cutting network for the common
+ * filter case by ~50-75%.
+ *
+ * Mapping is derived from the same step.type → entity convention
+ * `resolveMappings` uses on the read side:
+ *
+ *   testset      step.type = "input"       → reads from testcase
+ *   application  step.type = "invocation"  → reads from trace (via result.trace_id)
+ *   evaluator    step.type = "annotation"  → reads from result + metric
+ *                                            (composeResolvers(metric, trace))
+ *   metrics      (path is attributes.ag.metrics.*) → reads from metric
+ *
+ * Results are also fetched implicitly when any of testcase / trace / metric
+ * are needed — testcase_id and trace_id live on result rows, not on the
+ * scenario itself.
+ *
+ * @see resolveMappings.ts (the reverse-direction resolver — given a column
+ *      shape, return the value from a hydrated row)
+ */
+
+import type {ColumnGroup, RunMapping, RunSchema, RunStep} from "./resolveMappings"
+import {computeColumnGroup} from "./resolveMappings"
+import type {PredicateGroup, RowPredicate} from "./rowPredicateFilter"
+import {isPredicateGroup} from "./rowPredicateFilter"
+
+export type EntitySlice = "results" | "metrics" | "testcases" | "traces"
+
+const ALL_SLICES: readonly EntitySlice[] = ["results", "metrics", "testcases", "traces"] as const
+
+export interface PredicateSliceResult {
+    /** Which entity slices the predicate(s) actually need. */
+    slices: Set<EntitySlice>
+    /**
+     * Which columns the predicate(s) match — for diagnostics + future
+     * "narrowly fetch only this column" optimizations.
+     */
+    matchedColumns: {
+        groupKind: ColumnGroup["kind"]
+        groupSlug: string | null
+        columnName: string
+        sliceContributions: EntitySlice[]
+    }[]
+    /**
+     * True if the resolver couldn't map a predicate's column back to a
+     * step (e.g. column name doesn't appear in any mapping). When true,
+     * caller should fall back to fetching all slices to stay correct —
+     * over-fetching is safer than dropping a predicate silently.
+     */
+    fallbackToAll: boolean
+}
+
+/**
+ * Compute the slice set for a single predicate against a run schema.
+ * Returns `null` if the predicate references a column the schema doesn't
+ * surface (signals "fall back to all slices" at the caller).
+ */
+function sliceForPredicate(schema: RunSchema, predicate: RowPredicate): EntitySlice[] | null {
+    // 1. Find the mapping that matches this predicate's column.
+    const stepByKey = new Map<string, RunStep>()
+    for (const s of schema.steps) stepByKey.set(s.key, s)
+
+    let matchedMapping: RunMapping | null = null
+    let matchedGroup: ColumnGroup | null = null
+
+    for (const m of schema.mappings) {
+        const columnName = m.column?.name
+        if (typeof columnName !== "string" || columnName !== predicate.columnName) continue
+        const step = m.step?.key ? (stepByKey.get(m.step.key) ?? null) : null
+        const group = computeColumnGroup(step, m.step?.path ?? "")
+        if (group.kind !== predicate.groupKind) continue
+        if (predicate.groupSlug != null && group.slug !== predicate.groupSlug) continue
+        matchedMapping = m
+        matchedGroup = group
+        break
+    }
+
+    if (!matchedMapping || !matchedGroup) return null
+
+    // 2. Map group → entity slices.
+    //
+    // results is the join graph's root for testcase + trace fetches:
+    //   testcase_id lives on result rows, not scenarios
+    //   trace_id lives on result rows
+    // So any predicate that needs testcase or trace transitively needs results.
+
+    const slices: EntitySlice[] = []
+    switch (matchedGroup.kind) {
+        case "testset":
+            slices.push("results", "testcases")
+            break
+        case "application":
+            // invocation step's value is span-resident — need trace.
+            slices.push("results", "traces")
+            break
+        case "evaluator":
+            // Annotation outputs live in metric.data — the metric writer
+            // unfolds the evaluator's emitted attributes (incl.
+            // `attributes.ag.data.outputs.*` AND `attributes.ag.metrics.*`)
+            // as flat keys under `data[stepKey][path]`. composeResolvers
+            // does (metric → trace), so trace is only used as a fallback
+            // when an evaluator wrote span-only outputs that didn't make
+            // it into metrics — a rare edge case.
+            //
+            // For predicate hydrate we trust metric is canonical. Skipping
+            // traces here drops the heaviest endpoint (~70% of bytes,
+            // ~60% of loop time on the 1000-scenario reference run) for
+            // the common evaluator-filter case.
+            //
+            // If the predicate column ever turns out to be span-only
+            // (evaluator didn't write to metric.data), the cell-side
+            // materializer requests traces on first cell render, and the
+            // predicate filter's "keep visible until known" fallback
+            // keeps rows displayed during that lag. Correctness
+            // preserved, performance recovered.
+            slices.push("results", "metrics")
+            break
+        case "metrics":
+            slices.push("metrics")
+            break
+        case "other":
+            // Unknown shape — be conservative and fetch everything for this row.
+            return null
+    }
+    return Array.from(new Set(slices))
+}
+
+/**
+ * Resolve the full set of slices needed across all active predicates.
+ *
+ * Accepts a single predicate, a predicate array, or a `PredicateGroup`
+ * (flat AND/OR — decision D8). For a group the slice set is the **union**
+ * of every condition's slices: evaluating either an AND or an OR needs the
+ * data behind every condition, so the boolean operator does not change the
+ * fetch set.
+ *
+ * Empty predicate set = no filter active = no predicate-driven fetch
+ * required. Caller decides what to do (fetch all for display, or wait
+ * for cells to materialize themselves).
+ */
+export function predicateToEntitySlices(
+    schema: RunSchema | null,
+    predicates: RowPredicate | RowPredicate[] | PredicateGroup | null | undefined,
+): PredicateSliceResult {
+    if (!schema || !predicates) {
+        return {slices: new Set(), matchedColumns: [], fallbackToAll: false}
+    }
+    const list: RowPredicate[] = Array.isArray(predicates)
+        ? predicates
+        : isPredicateGroup(predicates)
+          ? predicates.conditions
+          : [predicates]
+    if (list.length === 0) {
+        return {slices: new Set(), matchedColumns: [], fallbackToAll: false}
+    }
+
+    const acc = new Set<EntitySlice>()
+    const matched: PredicateSliceResult["matchedColumns"] = []
+    let fallback = false
+
+    for (const p of list) {
+        const slicesForP = sliceForPredicate(schema, p)
+        if (slicesForP === null) {
+            // Unresolvable column → over-fetch everything to stay correct.
+            fallback = true
+            for (const s of ALL_SLICES) acc.add(s)
+            continue
+        }
+        for (const s of slicesForP) acc.add(s)
+        matched.push({
+            groupKind: p.groupKind,
+            groupSlug: p.groupSlug ?? null,
+            columnName: p.columnName,
+            sliceContributions: slicesForP,
+        })
+    }
+
+    return {slices: acc, matchedColumns: matched, fallbackToAll: fallback}
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/realScenarioSource.ts b/web/packages/agenta-entities/src/evaluationRun/etl/realScenarioSource.ts
new file mode 100644
index 0000000000..656fea24a3
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/realScenarioSource.ts
@@ -0,0 +1,167 @@
+/**
+ * Real evaluation scenario source — hits the actual `/evaluations/scenarios/query`
+ * endpoint and yields chunks of EvaluationScenario.
+ *
+ * This is the minimum-viable real source for the PoC. It deliberately does NOT:
+ *   - Wrap createPaginatedEntityStore (that's Phase 2 of the integration)
+ *   - Implement the correlatedDataPrefetch hook (that's Phase 1c of the architecture RFC)
+ *   - Validate predicates against a FilterSchema (that's D4 of the filter RFC)
+ *   - Plug into Jotai (that's not needed for headless validation)
+ *
+ * It DOES:
+ *   - Hit the real Agenta API with proper auth
+ *   - Honor the cursor pagination contract (windowing.next opaque string)
+ *   - Yield chunks shaped like a real Source<EvaluationScenario>
+ *   - Honor AbortSignal
+ *
+ * Use in headless scripts:
+ *
+ * ```ts
+ * import {makeRealScenarioSource} from "@agenta/entities/evaluationRun/etl"
+ *
+ * const source = makeRealScenarioSource({
+ *   baseUrl: process.env.AGENTA_API_URL!,
+ *   apiKey: process.env.AGENTA_API_KEY!,
+ *   projectId: process.env.AGENTA_PROJECT_ID!,
+ *   runId: process.env.AGENTA_RUN_ID!,
+ *   chunkSize: 200,
+ * })
+ *
+ * for await (const chunk of source.extract(undefined, abort.signal)) {
+ *   console.log(`${chunk.items.length} scenarios, next=${chunk.cursor}`)
+ * }
+ * ```
+ *
+ * @packageDocumentation
+ */
+
+import type {Source} from "../../etl/core/types"
+
+/**
+ * Minimal EvaluationScenario shape — what the API actually returns.
+ * In Phase 2 of the architecture RFC, this gets a proper Zod schema and
+ * lives in evaluationRun/core/schema.ts. For the PoC, this is enough.
+ */
+export interface RealEvaluationScenario {
+    id: string
+    status: string
+    created_at?: string
+    updated_at?: string
+    testcase_id?: string | null
+    timestamp?: string | null
+    [k: string]: unknown
+}
+
+export interface RealScenarioSourceParams {
+    /** Base URL of the Agenta API (e.g. http://localhost:8000) */
+    baseUrl: string
+    /** API key for Bearer auth */
+    apiKey: string
+    /** Project ID — sent as a query param */
+    projectId: string
+    /** Run ID — sent in the request body */
+    runId: string
+    /** Chunk size — sent as windowing.limit. Defaults to 200. */
+    chunkSize?: number
+    /** Ordering — "ascending" (default) or "descending" */
+    order?: "ascending" | "descending"
+}
+
+interface ScenariosResponse {
+    scenarios?: RealEvaluationScenario[]
+    windowing?: {
+        next?: string | null
+        oldest?: string | null
+        newest?: string | null
+        limit?: number
+        order?: string
+    }
+    [k: string]: unknown
+}
+
+/**
+ * Factory for the real evaluation-scenarios Source. The source yields chunks
+ * by repeatedly calling POST /evaluations/scenarios/query with the previous
+ * response's windowing.next cursor.
+ */
+export function makeRealScenarioSource(
+    params: RealScenarioSourceParams,
+): Source<RealEvaluationScenario, undefined> {
+    const {baseUrl, apiKey, projectId, runId, chunkSize = 200, order = "ascending"} = params
+    const endpoint = `${baseUrl.replace(/\/$/, "")}/evaluations/scenarios/query`
+
+    return {
+        async *extract(_params, signal) {
+            let cursor: string | null = null
+            let chunkIdx = 0
+
+            while (!signal.aborted) {
+                const body = {
+                    scenario: {run_id: runId},
+                    windowing: {
+                        next: cursor,
+                        limit: chunkSize,
+                        order,
+                    },
+                }
+
+                const url = `${endpoint}?project_id=${encodeURIComponent(projectId)}`
+
+                const res = await fetch(url, {
+                    method: "POST",
+                    headers: {
+                        "Content-Type": "application/json",
+                        // Agenta accepts both "ApiKey <key>" and bare "<key>"; using the
+                        // explicit prefix for clarity.
+                        Authorization: `ApiKey ${apiKey}`,
+                    },
+                    body: JSON.stringify(body),
+                    signal,
+                })
+
+                if (!res.ok) {
+                    const text = await res.text()
+                    throw new Error(
+                        `scenarios/query failed: ${res.status} ${res.statusText} — ${text.slice(0, 200)}`,
+                    )
+                }
+
+                const data: ScenariosResponse = await res.json()
+                const items = Array.isArray(data?.scenarios) ? data.scenarios : []
+
+                // Cursor resolution — three cases:
+                //   1. Server returned a `windowing` object with `next: <string>`:
+                //      authoritative — use it.
+                //   2. Server returned `windowing: {next: null}` (or omitted next
+                //      within a present windowing object): authoritative end-of-stream.
+                //      Skip the heuristic fallback; no extra RTT.
+                //   3. Server omitted `windowing` entirely (current local Agenta
+                //      behavior for /evaluations/scenarios/query): we don't know.
+                //      Use last-row-id heuristic when items.length === limit,
+                //      matching the OSS fallback in fetchEvaluationScenarioWindow.
+                //      Costs one extra RTT at end-of-stream (the "phantom chunk").
+                const windowingPresent = data?.windowing !== undefined
+                const apiNext = data?.windowing?.next ?? null
+                const fallbackCursor =
+                    items.length === chunkSize ? (items[items.length - 1]?.id ?? null) : null
+                const next: string | null = windowingPresent
+                    ? apiNext // Trust the server's explicit signal
+                    : (apiNext ?? fallbackCursor) // Server doesn't provide windowing — heuristic
+
+                // Also short-circuit if we got fewer rows than requested — definitive end
+                const definitivelyExhausted = items.length < chunkSize
+                const finalCursor: string | null = definitivelyExhausted ? null : next
+
+                yield {
+                    items,
+                    cursor: finalCursor,
+                    meta: {page: chunkIdx, hint: "real-scenarios"},
+                }
+
+                if (!finalCursor) return
+                cursor = finalCursor
+                chunkIdx++
+            }
+        },
+    }
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/resolveMappings.ts b/web/packages/agenta-entities/src/evaluationRun/etl/resolveMappings.ts
new file mode 100644
index 0000000000..f2d9be9f86
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/resolveMappings.ts
@@ -0,0 +1,720 @@
+/**
+ * resolveMappings — turns a hydrated row + the run's schema (steps + mappings)
+ * into the named-column values the UI would render.
+ *
+ * # Why this exists
+ *
+ * Evaluation runs are self-describing. `run.data.steps` declares the eval
+ * graph and `run.data.mappings` declares the UI columns. Each mapping says
+ * "column X is at `step.path` on the step named `step.key`". The mapping is
+ * declarative — the renderer doesn't need to know *which* run type it is.
+ *
+ * Different runs (testset+app+evaluator, chat eval, multi-step, custom origin)
+ * use the same vocabulary but with different step compositions. To support
+ * any combination without growing a giant `if (kind === ...)` ladder, this
+ * module dispatches on **step.type** (`input`, `invocation`, `annotation`,
+ * or whatever custom type a workflow declares) and each step type has its
+ * own resolver strategy.
+ *
+ * # Resolution rules
+ *
+ * - **input** — the step's result carries `testcase_id`; the path is applied
+ *   to the joined testcase (e.g. `data.country` → `testcase.data.country`).
+ * - **invocation** — the step's result carries `trace_id`; the path is applied
+ *   to the trace's span tree (e.g. `attributes.ag.data.outputs`).
+ * - **annotation** — same as invocation, with `metric.data[step.key][path]` as
+ *   a faster pre-aggregated alternative (path is a flat key inside the metric
+ *   bucket, not a dot-walk — this matches the wire format).
+ *
+ * # Generalization, not special-casing
+ *
+ * The dispatch is on `step.type`, which the *run document* sets. Adding a new
+ * step type is done by registering a new strategy — never by editing this
+ * file's existing branches. The trace walker tolerates multiple envelope
+ * shapes (`{spans: {name: span}}`, `{response: {tree: [...]}}`, span arrays,
+ * deep child trees) so trace navigation doesn't break across run types or
+ * fetch endpoints.
+ *
+ * @packageDocumentation
+ */
+
+import type {EvaluationResult} from "../core"
+
+import type {HydratedScenarioRow, HydratableScenario} from "./hydrateScenariosTransform"
+
+// ============================================================================
+// Schema types (mirroring run.data.steps / run.data.mappings)
+// ============================================================================
+
+export interface RunStep {
+    key: string
+    /**
+     * Drives resolver selection. Built-in resolvers exist for "input",
+     * "invocation", "annotation". Custom workflows can register more.
+     */
+    type: string
+    origin?: string | null
+    references?: Record<string, {id: string; slug?: string; version?: string} | null> | null
+    inputs?: {key: string}[] | null
+}
+
+export interface RunMapping {
+    column?: {kind?: string | null; name?: string | null} | null
+    step?: {key: string; path?: string | null} | null
+}
+
+export interface RunSchema {
+    steps: RunStep[]
+    mappings: RunMapping[]
+}
+
+// ============================================================================
+// Output types
+// ============================================================================
+
+/**
+ * Where a resolved value came from. Useful for diagnostics + telemetry.
+ * Open-ended on purpose — custom resolvers can return any string.
+ */
+export type ResolveSource = string
+
+/**
+ * The column's source-namespace.
+ *
+ * Two scenarios can run multiple evaluators. Each evaluator emits its own
+ * `success`/`error`/etc columns. To avoid name collisions in the UI the
+ * columns are namespaced by their source entity (the testset, the
+ * application, the specific evaluator). Group info is computed from
+ * `step.type` + `step.references` + path heuristics.
+ *
+ * The screenshot's column headers map to ColumnGroup like:
+ *   "Testset testset-large"  → kind="testset", slug=<testset slug>
+ *   "Application comp-1"     → kind="application", slug=<app slug>
+ *   "Exact Match"            → kind="evaluator", slug="exact-match"
+ *   "Metrics"                → kind="metrics" (overrides step-type when path is under attributes.ag.metrics.*)
+ */
+export interface ColumnGroup {
+    /** Source category — drives header rendering and ordering. */
+    kind: "testset" | "application" | "evaluator" | "metrics" | "other"
+    /**
+     * Stable identity for the group within its kind. For testsets it's the
+     * testset slug; for evaluators it's the evaluator slug; for metrics
+     * (the cross-cutting "Metrics" group) it's null because metrics columns
+     * from multiple sources can coexist in the same group.
+     */
+    slug: string | null
+    /** Human-readable group label (e.g. "Application comp-1", "Exact Match"). */
+    label: string
+    /** Stable cache key for this group — useful for grouping in renderers. */
+    key: string
+    /** The step.references that drove the grouping (preserved for downstream code). */
+    refs: Record<string, {id?: string; slug?: string} | null | undefined> | null
+}
+
+export interface ResolvedColumn {
+    /** UI column name (display). */
+    name: string
+    /** UI column kind (display category, e.g. "testset"). */
+    kind: string
+    /** The step this column reads from. */
+    stepKey: string
+    /** The step's declared type — drives strategy choice. */
+    stepType: string
+    /** The path the strategy applied. */
+    path: string
+    /** The resolved value (undefined if no strategy returned). */
+    value: unknown
+    /** Which strategy returned the value, or "missing". */
+    source: ResolveSource
+    /** Source-namespace for this column (testset/app/evaluator/metrics). */
+    group: ColumnGroup
+}
+
+// ============================================================================
+// Strategy contract
+// ============================================================================
+
+export interface ResolveContext<TScenario extends HydratableScenario> {
+    /** The step the current mapping references. */
+    step: RunStep
+    /**
+     * The result for this step within the current scenario, or undefined if
+     * the scenario has no result for this step (in-progress / failed run).
+     */
+    result: EvaluationResult | undefined
+    /** The hydrated row with all joined entities. */
+    row: HydratedScenarioRow<TScenario>
+    /** The mapping's `step.path` value. */
+    path: string
+}
+
+/**
+ * A resolver returns either `null` (this strategy can't resolve — try next)
+ * or a `{value, source}` tuple. Returning `{value: undefined, source: ...}`
+ * is allowed and lets a strategy explicitly say "I looked but the value
+ * isn't there"; the caller distinguishes that from "didn't try" (null).
+ */
+export type StepResolver<TScenario extends HydratableScenario = HydratableScenario> = (
+    ctx: ResolveContext<TScenario>,
+) => {value: unknown; source: ResolveSource} | null
+
+// ============================================================================
+// Built-in strategies
+// ============================================================================
+
+/**
+ * Read a value at a dot-path on an object, descending one key at a time.
+ * Returns undefined if any step is missing.
+ */
+export function getAtPath(obj: unknown, path: string): unknown {
+    if (obj === null || obj === undefined || !path) return undefined
+    const parts = path.split(".")
+    let cur: unknown = obj
+    for (const p of parts) {
+        if (cur === null || cur === undefined) return undefined
+        if (typeof cur !== "object") return undefined
+        cur = (cur as Record<string, unknown>)[p]
+    }
+    return cur
+}
+
+/**
+ * Try to find `path` somewhere inside a trace envelope. Handles every shape
+ * we've seen in the wild:
+ *   - `{spans: {<rootSpanName>: span}}`   — bulk /tracing/spans/query
+ *   - `{spans: [span, ...]}`              — array form (some endpoints)
+ *   - `{response: {tree: [...]}}`         — agenta-format wrapped response
+ *   - the envelope IS the span             — endpoint-stripped form
+ *
+ * For each candidate span found, the path is walked first directly, then
+ * recursively through `spans` (record OR array) and `children` (array).
+ * Returns the FIRST non-undefined match (DFS, depth-first).
+ */
+export function findInTrace(trace: unknown, path: string): unknown {
+    if (!trace || typeof trace !== "object") return undefined
+
+    // 1. Path might resolve directly on the envelope (rare but cheap to try).
+    const direct = getAtPath(trace, path)
+    if (direct !== undefined) return direct
+
+    const t = trace as Record<string, unknown>
+
+    // 2. {spans: {name: span, ...}} or {spans: [span, ...]}
+    const spans = t.spans
+    if (spans !== undefined) {
+        const v = walkSpanCollection(spans, path)
+        if (v !== undefined) return v
+    }
+
+    // 3. {count, traces: {[traceIdNoDashes]: traceData}} — the
+    //    TracesApiResponse envelope written by `prefetchTracesByIds`.
+    //    Drill into each inner trace and walk it as its own envelope.
+    //    Kept distinct from step 2 because the value shape under `traces`
+    //    is a full trace object (with its own `spans`/`response`/etc.),
+    //    not a span collection — so we recurse via findInTrace, not via
+    //    walkSpanCollection.
+    if (typeof t.count === "number" && t.traces && typeof t.traces === "object") {
+        for (const inner of Object.values(t.traces as Record<string, unknown>)) {
+            const v = findInTrace(inner, path)
+            if (v !== undefined) return v
+        }
+    }
+
+    // 4. {response: {tree: [...]}} (agenta format from single-trace endpoint)
+    const response = t.response
+    if (response && typeof response === "object") {
+        const tree = (response as Record<string, unknown>).tree
+        if (Array.isArray(tree)) {
+            for (const node of tree) {
+                const v = walkSpan(node, path)
+                if (v !== undefined) return v
+            }
+        }
+    }
+
+    // 5. Envelope itself might BE a span (already stripped).
+    const v = walkSpan(t, path)
+    if (v !== undefined) return v
+
+    return undefined
+}
+
+function walkSpanCollection(collection: unknown, path: string): unknown {
+    if (collection === null || collection === undefined) return undefined
+    if (Array.isArray(collection)) {
+        for (const c of collection) {
+            const v = walkSpan(c, path)
+            if (v !== undefined) return v
+        }
+        return undefined
+    }
+    if (typeof collection === "object") {
+        for (const k of Object.keys(collection as Record<string, unknown>)) {
+            const v = walkSpan((collection as Record<string, unknown>)[k], path)
+            if (v !== undefined) return v
+        }
+    }
+    return undefined
+}
+
+function walkSpan(span: unknown, path: string): unknown {
+    if (!span || typeof span !== "object") return undefined
+    const direct = getAtPath(span, path)
+    if (direct !== undefined) return direct
+    const obj = span as Record<string, unknown>
+    // Recurse into nested span containers
+    if (obj.spans !== undefined) {
+        const v = walkSpanCollection(obj.spans, path)
+        if (v !== undefined) return v
+    }
+    if (Array.isArray(obj.children)) {
+        for (const c of obj.children) {
+            const v = walkSpan(c, path)
+            if (v !== undefined) return v
+        }
+    }
+    if (Array.isArray(obj.nodes)) {
+        for (const c of obj.nodes) {
+            const v = walkSpan(c, path)
+            if (v !== undefined) return v
+        }
+    }
+    return undefined
+}
+
+/**
+ * Resolver for `input`-type steps. Reads from the joined testcase.
+ *
+ * Path is a dot-path on the testcase object (e.g. `data.country` →
+ * `testcase.data.country`). The hydrate transform already joined the testcase
+ * by `scenario.testcase_id ∪ result.testcase_id`, so we don't need to refetch.
+ */
+export const resolveFromTestcase: StepResolver = ({row, path}) => {
+    if (!row.testcase) return null
+    const value = getAtPath(row.testcase, path)
+    if (value === undefined) return null
+    return {value, source: "testcase"}
+}
+
+/**
+ * Resolver that walks a trace by trace_id from the step's result. Works for
+ * any step type whose data is span-resident (e.g. `invocation`, sometimes
+ * `annotation`).
+ */
+export const resolveFromTrace: StepResolver = ({result, row, path}) => {
+    if (!result?.trace_id) return null
+    const trace = row.traces[result.trace_id]
+    if (trace === undefined) return null
+    const value = findInTrace(trace, path)
+    if (value === undefined) return null
+    return {value, source: "trace"}
+}
+
+/**
+ * Resolver that reads from `metric.data[step.key][path]`.
+ *
+ * Metric.data is `{stepKey: {flatAttributePath: valueOrStatsObject}}`. The
+ * `flatAttributePath` IS the mapping's `step.path` as a SINGLE STRING KEY
+ * (not a dot-walk). That matches what the server emits — paths like
+ * `"attributes.ag.data.outputs.success"` are baked-in flat keys, not nested
+ * objects. Trying to dot-walk would fail.
+ */
+export const resolveFromMetric: StepResolver = ({step, row, path}) => {
+    for (const m of row.metrics) {
+        const data = m.data as Record<string, unknown> | undefined
+        if (!data) continue
+        const bucket = data[step.key] as Record<string, unknown> | undefined
+        if (bucket && bucket[path] !== undefined) {
+            return {value: bucket[path], source: "metric"}
+        }
+    }
+    return null
+}
+
+/**
+ * Compose strategies — try each in order, return the first non-null.
+ */
+export function composeResolvers(...resolvers: StepResolver[]): StepResolver {
+    return (ctx) => {
+        for (const r of resolvers) {
+            const out = r(ctx)
+            if (out !== null) return out
+        }
+        return null
+    }
+}
+
+// ============================================================================
+// Grouping — infer the namespace each column should display under
+// ============================================================================
+
+/**
+ * Title-case a slug for display. "exact-match" → "Exact Match".
+ * Best-effort — callers that have the actual entity name should use that.
+ */
+function slugToTitle(slug: string | null | undefined): string {
+    if (!slug) return ""
+    return slug
+        .split(/[-_]/)
+        .map((p) => (p.length === 0 ? p : p[0].toUpperCase() + p.slice(1)))
+        .join(" ")
+}
+
+/**
+ * Detect "Metrics" columns — paths under `attributes.ag.metrics.*`. These
+ * cross-cut the step-type grouping: an `attributes.ag.metrics.tokens...`
+ * column on an invocation step still belongs to the "Metrics" group, not
+ * "Application", because the UI surfaces them together.
+ */
+function isMetricsPath(path: string): boolean {
+    return /(^|\.)attributes\.ag\.metrics(\.|$)/.test(path)
+}
+
+/**
+ * Compute a column's ColumnGroup from its step + mapping path. Exported so
+ * consumers (the PoC, custom renderers) can run grouping standalone if they
+ * don't need the resolved values.
+ */
+export function computeColumnGroup(step: RunStep | null, path: string): ColumnGroup {
+    const refs = step?.references ?? null
+
+    // Metrics paths override step-type — they go under "Metrics".
+    if (path && isMetricsPath(path)) {
+        return {
+            kind: "metrics",
+            slug: null,
+            label: "Metrics",
+            key: "metrics",
+            refs,
+        }
+    }
+
+    if (!step) {
+        return {kind: "other", slug: null, label: "(no step)", key: "other:none", refs: null}
+    }
+
+    switch (step.type) {
+        case "input": {
+            // Prefer the testset's slug (stable across revisions); fall back
+            // to testset_revision.slug then the step.key.
+            const testsetSlug = refs?.testset?.slug ?? refs?.testset_revision?.slug ?? null
+            const slug = testsetSlug
+            return {
+                kind: "testset",
+                slug,
+                // The UI shows the testset's display name (e.g. "testset-large"). Without
+                // fetching the testset entity we don't have the name — fall back to slug.
+                // Renderers with access to the testset entity should override the label.
+                label: slug ? `Testset ${slug}` : "Testset",
+                key: `testset:${slug ?? step.key}`,
+                refs,
+            }
+        }
+
+        case "invocation": {
+            const appSlug = refs?.application?.slug ?? refs?.application_revision?.slug ?? null
+            return {
+                kind: "application",
+                slug: appSlug,
+                label: appSlug ? `Application ${appSlug}` : "Application",
+                key: `application:${appSlug ?? step.key}`,
+                refs,
+            }
+        }
+
+        case "annotation": {
+            // Each evaluator step gets its own group — that's the whole point.
+            // Two evaluators emitting the same column name (e.g. "success")
+            // remain disambiguated as long as their evaluator slugs differ.
+            const evaluatorSlug = refs?.evaluator?.slug ?? refs?.evaluator_revision?.slug ?? null
+            return {
+                kind: "evaluator",
+                slug: evaluatorSlug,
+                label: evaluatorSlug ? slugToTitle(evaluatorSlug) : "Evaluator",
+                key: `evaluator:${evaluatorSlug ?? step.key}`,
+                refs,
+            }
+        }
+
+        default:
+            return {
+                kind: "other",
+                slug: null,
+                label: `(${step.type})`,
+                key: `other:${step.type}`,
+                refs,
+            }
+    }
+}
+
+// ============================================================================
+// Default registry — keyed by step.type
+//
+// Add a new step type by passing `customResolvers` to `resolveMappings`.
+// Do NOT edit the entries below to handle new shapes; extend instead.
+// ============================================================================
+
+export const DEFAULT_STEP_RESOLVERS: Record<string, StepResolver> = {
+    /**
+     * Input steps carry testcase references on their result. Mappings point
+     * at testcase fields via `data.<column>`.
+     */
+    input: resolveFromTestcase,
+
+    /**
+     * Invocation steps (app calls) have a trace per result. Mappings point at
+     * `attributes.ag.data.*` paths on the trace's spans.
+     */
+    invocation: resolveFromTrace,
+
+    /**
+     * Annotation steps (evaluator results) are dual-source: metrics carry the
+     * pre-aggregated value keyed by step.key + flat path, AND there's an
+     * annotation trace at result.trace_id. Try metric first (cheaper) then
+     * fall back to trace.
+     */
+    annotation: composeResolvers(resolveFromMetric, resolveFromTrace),
+}
+
+// ============================================================================
+// Public entry point
+// ============================================================================
+
+export interface ResolveMappingsOptions {
+    /**
+     * Override or extend the per-step-type resolver registry. Pass a partial
+     * record — keys not provided fall through to `DEFAULT_STEP_RESOLVERS`.
+     *
+     * Use this to support custom step types or override behaviour for known
+     * ones (e.g. force `annotation` to skip the metric lookup).
+     */
+    customResolvers?: Record<string, StepResolver>
+    /**
+     * Fallback resolver invoked when no per-type strategy is registered. By
+     * default returns `null`. Override to e.g. inspect `result.data` directly.
+     */
+    fallbackResolver?: StepResolver
+}
+
+/**
+ * Resolve all UI columns for a single hydrated row, per the run's mappings.
+ *
+ * Inputs:
+ *   - `row`        the joined entities (scenario + results + metrics + testcase + traces)
+ *   - `schema`     run.data.steps + run.data.mappings (the materialization spec)
+ *   - `options`    optional custom resolvers
+ *
+ * Output: one `ResolvedColumn` per mapping, in the original order. Columns
+ * that couldn't be resolved have `value: undefined` and `source: "missing"`.
+ */
+export function resolveMappings<TScenario extends HydratableScenario>(
+    row: HydratedScenarioRow<TScenario>,
+    schema: RunSchema,
+    options: ResolveMappingsOptions = {},
+): ResolvedColumn[] {
+    const resolvers: Record<string, StepResolver> = {
+        ...DEFAULT_STEP_RESOLVERS,
+        ...(options.customResolvers ?? {}),
+    }
+
+    const stepByKey = new Map<string, RunStep>()
+    for (const s of schema.steps) stepByKey.set(s.key, s)
+
+    return schema.mappings.map((m) => {
+        const kind = m.column?.kind ?? "?"
+        const name = m.column?.name ?? "?"
+        const stepKey = m.step?.key ?? ""
+        const path = m.step?.path ?? ""
+        const step = stepByKey.get(stepKey) ?? null
+        const group = computeColumnGroup(step, path)
+
+        if (!step) {
+            return {
+                name,
+                kind,
+                stepKey,
+                stepType: "?",
+                path,
+                value: undefined,
+                source: "missing",
+                group,
+            }
+        }
+
+        const result = (row.results as EvaluationResult[]).find((r) => r.step_key === stepKey)
+        const resolver = resolvers[step.type] ?? options.fallbackResolver ?? null
+
+        if (!resolver) {
+            return {
+                name,
+                kind,
+                stepKey,
+                stepType: step.type,
+                path,
+                value: undefined,
+                source: `missing (no resolver for step.type="${step.type}")`,
+                group,
+            }
+        }
+
+        const out = resolver({
+            step,
+            result,
+            row: row as HydratedScenarioRow<HydratableScenario>,
+            path,
+        })
+        if (out === null) {
+            return {
+                name,
+                kind,
+                stepKey,
+                stepType: step.type,
+                path,
+                value: undefined,
+                source: "missing",
+                group,
+            }
+        }
+        return {
+            name,
+            kind,
+            stepKey,
+            stepType: step.type,
+            path,
+            value: out.value,
+            source: out.source,
+            group,
+        }
+    })
+}
+
+/**
+ * Group resolved columns by their `group.key`, preserving the ORIGINAL
+ * mapping order within each group. Use this when rendering UI that mirrors
+ * the screenshot's grouped-header layout.
+ *
+ * Group ordering: testset groups first, then application groups, then
+ * evaluator groups (in their first-appearance order), then metrics, then
+ * other. Within a kind, groups appear in the order their columns first
+ * appear in the mapping list.
+ */
+export interface ResolvedColumnGroup {
+    group: ColumnGroup
+    columns: ResolvedColumn[]
+}
+
+export function groupResolvedColumns(columns: ResolvedColumn[]): ResolvedColumnGroup[] {
+    const groupsByKey = new Map<string, ResolvedColumnGroup>()
+    const firstAppearance = new Map<string, number>()
+
+    columns.forEach((col, idx) => {
+        const existing = groupsByKey.get(col.group.key)
+        if (existing) {
+            existing.columns.push(col)
+        } else {
+            groupsByKey.set(col.group.key, {group: col.group, columns: [col]})
+            firstAppearance.set(col.group.key, idx)
+        }
+    })
+
+    // Kind ordering matches the UI's left-to-right layout in the screenshot.
+    const kindOrder: ColumnGroup["kind"][] = [
+        "testset",
+        "application",
+        "evaluator",
+        "metrics",
+        "other",
+    ]
+    const kindRank = (k: ColumnGroup["kind"]) => {
+        const idx = kindOrder.indexOf(k)
+        return idx === -1 ? kindOrder.length : idx
+    }
+
+    return Array.from(groupsByKey.values()).sort((a, b) => {
+        const kindCmp = kindRank(a.group.kind) - kindRank(b.group.kind)
+        if (kindCmp !== 0) return kindCmp
+        // Within a kind, preserve first-appearance order
+        return (firstAppearance.get(a.group.key) ?? 0) - (firstAppearance.get(b.group.key) ?? 0)
+    })
+}
+
+// ============================================================================
+// Pre-resolution column grouping — group raw mappings by source.
+//
+// `groupResolvedColumns` above groups columns AFTER a row's values are
+// resolved. `groupRunColumns` works directly off the run schema
+// (steps + mappings), so the UI can build column headers before any
+// scenario data is hydrated.
+// ============================================================================
+
+/** A single UI column leaf, before value resolution. */
+export interface RunColumnLeaf {
+    /** Column display name (from `mapping.column.name`). */
+    name: string
+    /** Source category — testset / application / evaluator / metrics / other. */
+    kind: ColumnGroup["kind"]
+    /** The owning group's slug (null for metrics and some "other" groups). */
+    groupSlug: string | null
+}
+
+/** A group of UI columns sharing a `ColumnGroup` — one nested header. */
+export interface RunColumnGroup {
+    group: ColumnGroup
+    columns: RunColumnLeaf[]
+}
+
+/**
+ * Group a run's raw column mappings by source — testset / application /
+ * evaluator(s) / metrics / other.
+ *
+ * "other"-kind columns (steps with an unrecognised type, or mappings with
+ * no resolvable step) are **included**. They are real columns the
+ * backend-metadata column path also surfaces — dropping them would
+ * silently shrink the visible column set.
+ *
+ * Internal dedup keys (column names containing `_dedup_id`, e.g.
+ * `testcase_dedup_id`) are **excluded** — they are not user-facing
+ * columns. The backend-metadata column path drops them too.
+ *
+ * Group order: testset → application → evaluator(s) → metrics → other.
+ * Within a kind, groups appear in the order their columns first appear in
+ * the mapping list (matching `groupResolvedColumns`).
+ */
+export function groupRunColumns(steps: RunStep[], mappings: RunMapping[]): RunColumnGroup[] {
+    const stepByKey = new Map<string, RunStep>()
+    for (const s of steps) stepByKey.set(s.key, s)
+
+    const byKey = new Map<string, RunColumnGroup>()
+    const firstAppearance = new Map<string, number>()
+
+    mappings.forEach((mapping, idx) => {
+        const columnName = mapping.column?.name
+        if (typeof columnName !== "string" || !columnName) return
+        // Internal dedup keys are not user-facing columns.
+        if (columnName.includes("_dedup_id")) return
+        const step = mapping.step?.key ? (stepByKey.get(mapping.step.key) ?? null) : null
+        const path = mapping.step?.path ?? ""
+        const group = computeColumnGroup(step, path)
+
+        let slot = byKey.get(group.key)
+        if (!slot) {
+            slot = {group, columns: []}
+            byKey.set(group.key, slot)
+            firstAppearance.set(group.key, idx)
+        }
+        slot.columns.push({name: columnName, kind: group.kind, groupSlug: group.slug})
+    })
+
+    const kindOrder: Record<ColumnGroup["kind"], number> = {
+        testset: 0,
+        application: 1,
+        evaluator: 2,
+        metrics: 3,
+        other: 4,
+    }
+    return Array.from(byKey.values()).sort((a, b) => {
+        const k = kindOrder[a.group.kind] - kindOrder[b.group.kind]
+        if (k !== 0) return k
+        return (firstAppearance.get(a.group.key) ?? 0) - (firstAppearance.get(b.group.key) ?? 0)
+    })
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts b/web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts
new file mode 100644
index 0000000000..e36ddebed1
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts
@@ -0,0 +1,324 @@
+/**
+ * Post-hydrate predicate filter — drops materialized rows that don't match
+ * a value-equality predicate against a resolved UI column.
+ *
+ * # Where this fits
+ *
+ * The hydrate transform joins each scenario to its testcase, results,
+ * metrics, and traces. THEN this filter runs. It works on already-joined
+ * data, so it can predicate on values that don't exist on the scenario
+ * itself (e.g. an evaluator's `success` output, a testset column,
+ * `attributes.ag.metrics.tokens.cumulative.total`).
+ *
+ * The pipeline shape:
+ *
+ *   source → [cheap scenario-level filter] → hydrate → predicateFilter → sink
+ *
+ * The cheap filter goes first to avoid wasted hydration; this one comes
+ * after because it needs the joined data.
+ *
+ * # Why this isn't server-side
+ *
+ * `/evaluations/scenarios/query` and `/evaluations/metrics/query` don't
+ * currently accept arbitrary filtering — only ID lookups and run scope.
+ * Filtering on annotation values (e.g. "evaluator output success == false")
+ * therefore requires the hydrate join to materialize the value first.
+ *
+ * For a long-tail scan this is wasteful — we hydrate every scenario then
+ * drop most of them. The eval-filtering RFC's "F1 skip-ahead" optimization
+ * would let the server emit a sparse cursor stream that's already filtered,
+ * but that's a server-side change not in today's API.
+ *
+ * # Stats-blob unwrap
+ *
+ * Some columns resolve to a stats blob (e.g. metric.data carries
+ * `{type: "binary", freq: [{value: false, density: 1}]}` instead of a
+ * literal `false`). This filter unwraps known stat shapes before comparing
+ * so the caller writes `value: false` and gets the natural result.
+ *
+ * @packageDocumentation
+ */
+
+import type {Chunk, Transform} from "../../etl/core/types"
+
+import type {HydratedScenarioRow, HydratableScenario} from "./hydrateScenariosTransform"
+import {
+    resolveMappings,
+    type ColumnGroup,
+    type ResolvedColumn,
+    type RunSchema,
+} from "./resolveMappings"
+
+/**
+ * One value-comparison clause against a single resolved column.
+ *
+ * Targeting rules:
+ *   - `groupKind` always required — "annotation", "testset", "application", "metrics"
+ *   - `groupSlug` optional — when set, narrows to a specific group instance
+ *     (e.g. evaluator slug "exact-match"). If null/undefined, matches the
+ *     first column whose name/kind match regardless of group instance.
+ *   - `columnName` required — the column's display name (e.g. "success").
+ *
+ * Comparison rules:
+ *   - "eq"/"ne" — strict-equality on the (unwrapped) value
+ *   - "in"/"nin" — membership against an array
+ *   - "lt"/"lte"/"gt"/"gte" — numeric comparison after unwrap
+ */
+export interface RowPredicate {
+    groupKind: ColumnGroup["kind"]
+    groupSlug?: string | null
+    columnName: string
+    op: "eq" | "ne" | "in" | "nin" | "lt" | "lte" | "gt" | "gte"
+    value: unknown
+}
+
+/**
+ * Unwrap known stats-blob shapes to their dominant value, so callers can
+ * write `value: false` against an annotation column that resolves through
+ * the metric layer as `{type: "binary", freq: [{value: false, density: 1}]}`.
+ *
+ * Cases handled:
+ *   - `{type: "binary", freq: [{value, density}]}` → value with highest density
+ *   - `{type: "numeric/continuous", mean: N}` → mean
+ *   - `{type: "numeric", mean: N}` → mean
+ *   - everything else passes through unchanged
+ */
+export function unwrapStatsForCompare(v: unknown): unknown {
+    if (v === null || typeof v !== "object") return v
+    const t = (v as {type?: string}).type
+    if (t === "binary") {
+        const freq = (v as {freq?: {value: unknown; density?: number; count?: number}[]}).freq
+        if (Array.isArray(freq) && freq.length > 0) {
+            // Take the entry with highest density (or count if density absent)
+            const sorted = [...freq].sort((a, b) => {
+                const ad = a.density ?? a.count ?? 0
+                const bd = b.density ?? b.count ?? 0
+                return bd - ad
+            })
+            return sorted[0]?.value
+        }
+        return undefined
+    }
+    if (t === "numeric/continuous" || t === "numeric") {
+        const obj = v as {mean?: number; sum?: number; count?: number}
+        return obj.mean ?? obj.sum ?? obj.count
+    }
+    return v
+}
+
+function compare(actual: unknown, op: RowPredicate["op"], expected: unknown): boolean {
+    switch (op) {
+        case "eq":
+            return actual === expected
+        case "ne":
+            return actual !== expected
+        case "in":
+            return Array.isArray(expected) && expected.includes(actual)
+        case "nin":
+            return Array.isArray(expected) && !expected.includes(actual)
+        case "lt":
+            return typeof actual === "number" && typeof expected === "number" && actual < expected
+        case "lte":
+            return typeof actual === "number" && typeof expected === "number" && actual <= expected
+        case "gt":
+            return typeof actual === "number" && typeof expected === "number" && actual > expected
+        case "gte":
+            return typeof actual === "number" && typeof expected === "number" && actual >= expected
+    }
+}
+
+export interface PredicateFilterOptions {
+    /**
+     * One or more predicates, AND-joined. Pass a single object for the
+     * common case. All must match for the row to pass.
+     */
+    predicates: RowPredicate | RowPredicate[]
+    /** Run schema (steps + mappings), used to resolve columns per row. */
+    schema: RunSchema
+    /**
+     * Optional callback for per-chunk filter telemetry. Called once per
+     * chunk with the in/out counts so the PoC can surface filter
+     * effectiveness.
+     */
+    onChunkFiltered?: (info: {
+        chunk: number
+        scanned: number
+        matched: number
+        droppedPredicate: RowPredicate
+    }) => void
+}
+
+/**
+ * Build a `Transform<HydratedScenarioRow, HydratedScenarioRow>` that keeps
+ * only rows satisfying every supplied predicate (logical AND).
+ *
+ * Stateless — the same factory output can be reused across pipeline runs.
+ */
+export function makeRowPredicateFilter<TScenario extends HydratableScenario>(
+    options: PredicateFilterOptions,
+): Transform<HydratedScenarioRow<TScenario>, HydratedScenarioRow<TScenario>> {
+    const predicates = Array.isArray(options.predicates) ? options.predicates : [options.predicates]
+    const schema = options.schema
+    let chunkIdx = 0
+
+    return async (chunk: Chunk<HydratedScenarioRow<TScenario>>) => {
+        chunkIdx++
+        const passing = chunk.items.filter((row) => {
+            const cols = resolveMappings(row, schema)
+            for (const p of predicates) {
+                const target = cols.find((c) => {
+                    if (c.group.kind !== p.groupKind) return false
+                    if (p.groupSlug !== undefined && p.groupSlug !== null) {
+                        if (c.group.slug !== p.groupSlug) return false
+                    }
+                    return c.name === p.columnName
+                })
+                if (!target) return false // missing column → fail predicate
+                const unwrapped = unwrapStatsForCompare(target.value)
+                if (!compare(unwrapped, p.op, p.value)) return false
+            }
+            return true
+        })
+
+        if (options.onChunkFiltered) {
+            for (const p of predicates) {
+                options.onChunkFiltered({
+                    chunk: chunkIdx,
+                    scanned: chunk.items.length,
+                    matched: passing.length,
+                    droppedPredicate: p,
+                })
+            }
+        }
+
+        return {...chunk, items: passing}
+    }
+}
+
+// ============================================================================
+// Multi-predicate AND/OR composition (Phase 2 / T4 — decision D8)
+//
+// `RowPredicate` above is a single clause. A `PredicateGroup` joins several
+// clauses with ONE boolean operator. v1 is intentionally FLAT — `conditions`
+// are leaf predicates, not nested groups. That covers the real cases
+// ("score > 0.8 AND exact_match == true", "country in ['US','CA']") without
+// an arbitrary-tree filter UI. Nested groups can come later if needed.
+// ============================================================================
+
+/**
+ * A flat group of predicates joined by a single boolean operator.
+ *
+ * One nesting level only: `conditions` are leaf `RowPredicate`s.
+ *
+ * An **empty** group (`conditions: []`) is treated as *no constraint* —
+ * every row passes, regardless of `op`. An empty filter shows all rows.
+ */
+export interface PredicateGroup {
+    op: "and" | "or"
+    conditions: RowPredicate[]
+}
+
+/** A filter is either a single predicate or a flat AND/OR group. */
+export type RowFilter = RowPredicate | PredicateGroup
+
+/** Narrow a `RowFilter` to a `PredicateGroup`. */
+export function isPredicateGroup(filter: RowFilter): filter is PredicateGroup {
+    return Array.isArray((filter as PredicateGroup).conditions)
+}
+
+/** Find the resolved column a predicate targets (by kind + optional slug + name). */
+function findTargetColumn(
+    cols: ResolvedColumn[],
+    predicate: RowPredicate,
+): ResolvedColumn | undefined {
+    return cols.find((c) => {
+        if (c.group.kind !== predicate.groupKind) return false
+        if (predicate.groupSlug != null && c.group.slug !== predicate.groupSlug) return false
+        return c.name === predicate.columnName
+    })
+}
+
+/**
+ * Evaluate one predicate against a row's already-resolved columns.
+ *
+ * A column the predicate references but the schema doesn't surface, or one
+ * that resolved to no value, compares against `undefined` — same semantics
+ * as `makeRowPredicateFilter` (so `eq`/`lt`/… fail, `ne` passes).
+ *
+ * This is *pure*: it does not know about hydration state. "Keep a row
+ * visible until its slices are hydrated" is a wiring-layer concern — the
+ * caller decides that before calling this.
+ */
+export function evaluateRowPredicate(predicate: RowPredicate, cols: ResolvedColumn[]): boolean {
+    const target = findTargetColumn(cols, predicate)
+    const actual = unwrapStatsForCompare(target?.value)
+    return compare(actual, predicate.op, predicate.value)
+}
+
+/**
+ * Evaluate a flat AND/OR group against a row's resolved columns.
+ *
+ * - `op: "and"` — every condition must match.
+ * - `op: "or"`  — at least one condition must match.
+ * - empty `conditions` — no constraint, the row passes.
+ */
+export function evaluatePredicateGroup(group: PredicateGroup, cols: ResolvedColumn[]): boolean {
+    if (group.conditions.length === 0) return true
+    return group.op === "and"
+        ? group.conditions.every((p) => evaluateRowPredicate(p, cols))
+        : group.conditions.some((p) => evaluateRowPredicate(p, cols))
+}
+
+/** Evaluate any `RowFilter` (single predicate or AND/OR group) against resolved columns. */
+export function evaluateRowFilter(filter: RowFilter, cols: ResolvedColumn[]): boolean {
+    return isPredicateGroup(filter)
+        ? evaluatePredicateGroup(filter, cols)
+        : evaluateRowPredicate(filter, cols)
+}
+
+/**
+ * Convenience: resolve a hydrated row's columns from the run schema, then
+ * evaluate the filter. This is the row-level entry point the scenario
+ * table's `filteredRows` derivation uses.
+ */
+export function matchesRowFilter<TScenario extends HydratableScenario>(
+    filter: RowFilter,
+    schema: RunSchema,
+    row: HydratedScenarioRow<TScenario>,
+): boolean {
+    return evaluateRowFilter(filter, resolveMappings(row, schema))
+}
+
+export interface PredicateGroupFilterOptions {
+    /** The active filter — a single predicate or a flat AND/OR group. */
+    filter: RowFilter
+    /** Run schema (steps + mappings), used to resolve columns per row. */
+    schema: RunSchema
+    /** Optional per-chunk telemetry callback. */
+    onChunkFiltered?: (info: {chunk: number; scanned: number; matched: number}) => void
+}
+
+/**
+ * Build a `Transform` that keeps only rows satisfying a `RowFilter`
+ * (single predicate or AND/OR group). The ETL-pipeline counterpart of the
+ * row-level `matchesRowFilter` — use this for headless / chunked runs.
+ */
+export function makePredicateGroupFilter<TScenario extends HydratableScenario>(
+    options: PredicateGroupFilterOptions,
+): Transform<HydratedScenarioRow<TScenario>, HydratedScenarioRow<TScenario>> {
+    const {filter, schema} = options
+    let chunkIdx = 0
+
+    return async (chunk: Chunk<HydratedScenarioRow<TScenario>>) => {
+        chunkIdx++
+        const passing = chunk.items.filter((row) =>
+            evaluateRowFilter(filter, resolveMappings(row, schema)),
+        )
+        options.onChunkFiltered?.({
+            chunk: chunkIdx,
+            scanned: chunk.items.length,
+            matched: passing.length,
+        })
+        return {...chunk, items: passing}
+    }
+}
diff --git a/web/packages/agenta-entities/src/evaluationRun/index.ts b/web/packages/agenta-entities/src/evaluationRun/index.ts
index 3f9bf9844c..44a38d964e 100644
--- a/web/packages/agenta-entities/src/evaluationRun/index.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/index.ts
@@ -32,6 +32,21 @@ export {
     type AnnotationColumnDef as EvaluationRunAnnotationColumnDef,
 } from "./state/molecule"
 
+// Per-scenario read-only molecules (cache-aware bulk prefetch).
+// Used by ETL hydrate + downstream cell renderers.
+export {
+    evaluationResultMolecule,
+    type EvaluationResultMolecule,
+    type PrefetchResultsArgs,
+    type PrefetchResultsOutcome,
+} from "./state/resultMolecule"
+export {
+    evaluationMetricMolecule,
+    type EvaluationMetricMolecule,
+    type PrefetchMetricsArgs,
+    type PrefetchMetricsOutcome,
+} from "./state/metricMolecule"
+
 // ============================================================================
 // SCHEMAS & TYPES
 // ============================================================================
@@ -66,6 +81,8 @@ export {
     type EvaluationResult,
     evaluationResultsResponseSchema,
     type EvaluationResultsResponse,
+    // Evaluation Metrics
+    type EvaluationMetric,
     // Param types
     type EvaluationRunDetailParams,
     type EvaluationRunQueryParams,
diff --git a/web/packages/agenta-entities/src/evaluationRun/state/__tests__/molecules.leak.test.ts b/web/packages/agenta-entities/src/evaluationRun/state/__tests__/molecules.leak.test.ts
new file mode 100644
index 0000000000..37f7f2ce3c
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/state/__tests__/molecules.leak.test.ts
@@ -0,0 +1,312 @@
+/**
+ * Leak detection for molecule prefetch actions.
+ *
+ * The pure-engine leak test (etl/__tests__/runLoop.leak.test.ts) covers the
+ * runtime — Source/Transform/Sink with synthetic data. It does NOT cover the
+ * entity-cache layer we wired in (result/metric/testcase/trace prefetch
+ * actions backed by TanStack Query).
+ *
+ * Two distinct risks to test here:
+ *
+ *   1. **Unbounded cache growth across runs.** Each call to
+ *      `prefetchByScenarioIds` adds a TanStack entry per scenario. Without
+ *      explicit eviction, entries persist for the process lifetime. We
+ *      verify that `evictByRunId` returns the cache to baseline size,
+ *      and that heap stabilizes when we cycle through fresh runs with
+ *      eviction between each.
+ *
+ *   2. **Cache write-back doesn't compound.** When the same scenarios are
+ *      re-prefetched (100% hit), the cache size MUST stay the same — not
+ *      grow. We verify this directly.
+ *
+ * Run via: pnpm test:etl:longrun  (slow; needs --expose-gc to be reliable).
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+// QueryClient is re-exported from @tanstack/react-query (a workspace peer
+// dep). The bare @tanstack/query-core also exposes it but doesn't resolve
+// under `node --import tsx --test` (the script used by test:etl:longrun).
+import {QueryClient} from "@tanstack/react-query"
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import type {EvaluationMetric, EvaluationResult} from "../../core"
+import {inspectCache} from "../../etl/cacheDiagnostics"
+import {evaluationMetricMolecule} from "../metricMolecule"
+import {evaluationResultMolecule} from "../resultMolecule"
+
+const hasGc = typeof (globalThis as {gc?: () => void}).gc === "function"
+const forceGc = () => (globalThis as {gc?: () => void}).gc?.()
+
+const store = getDefaultStore()
+
+function installQc(): QueryClient {
+    const qc = new QueryClient({
+        defaultOptions: {queries: {retry: false, gcTime: Infinity, staleTime: Infinity}},
+    })
+    store.set(queryClientAtom, qc)
+    return qc
+}
+
+function regressionSlope(samples: number[]): number {
+    if (samples.length < 2) return 0
+    const n = samples.length
+    const xs = samples.map((_, i) => i)
+    const meanX = xs.reduce((a, b) => a + b, 0) / n
+    const meanY = samples.reduce((a, b) => a + b, 0) / n
+    const num = xs.reduce((acc, x, i) => acc + (x - meanX) * (samples[i] - meanY), 0)
+    const den = xs.reduce((acc, x) => acc + (x - meanX) ** 2, 0)
+    return den === 0 ? 0 : num / den
+}
+
+// =============================================================================
+// Risk 1: rerunning the same prefetch does NOT grow cache
+// =============================================================================
+
+describe("Leak: repeated prefetch of same scenarios doesn't grow cache", () => {
+    it("100 rerolls with the SAME scenario IDs → cache size constant", async () => {
+        const qc = installQc()
+        // Pre-populate the cache so the prefetches go full-hit. No api stubs
+        // needed — we're testing the cache-read path, not the network path.
+        for (let i = 0; i < 100; i++) {
+            qc.setQueryData(["evaluation-results", "p1", "run1", `s${i}`], [
+                {run_id: "run1", scenario_id: `s${i}`, step_key: "step-a", status: "ok"},
+            ] as EvaluationResult[])
+            qc.setQueryData(["evaluation-metrics", "p1", "run1", `s${i}`], [
+                {id: `m${i}`, run_id: "run1", scenario_id: `s${i}`, status: "ok"},
+            ] as unknown as EvaluationMetric[])
+        }
+        const scenarioIds = Array.from({length: 100}, (_, i) => `s${i}`)
+
+        const baseline = inspectCache({prefixes: ["evaluation-results", "evaluation-metrics"]})
+
+        for (let i = 0; i < 100; i++) {
+            await evaluationResultMolecule.actions.prefetchByScenarioIds({
+                projectId: "p1",
+                runId: "run1",
+                scenarioIds,
+            })
+            await evaluationMetricMolecule.actions.prefetchByScenarioIds({
+                projectId: "p1",
+                runId: "run1",
+                scenarioIds,
+            })
+        }
+
+        const after = inspectCache({prefixes: ["evaluation-results", "evaluation-metrics"]})
+
+        assert.equal(after.totalEntries, baseline.totalEntries, "rerun must not add entries")
+        assert.equal(
+            after.totalApproxBytes,
+            baseline.totalApproxBytes,
+            "rerun must not change byte size",
+        )
+    })
+})
+
+// =============================================================================
+// Risk 2: evictByRunId returns the cache to baseline
+// =============================================================================
+
+describe("Leak: evictByRunId fully releases run-scoped cache entries", () => {
+    it("populate → evict → cache is baseline-empty", () => {
+        const qc = installQc()
+        for (let i = 0; i < 200; i++) {
+            qc.setQueryData(["evaluation-results", "p1", "run1", `s${i}`], [
+                {run_id: "run1", scenario_id: `s${i}`, step_key: "x", status: "ok"},
+            ] as EvaluationResult[])
+            qc.setQueryData(["evaluation-metrics", "p1", "run1", `s${i}`], [
+                {id: `m${i}`, run_id: "run1", scenario_id: `s${i}`, status: "ok"},
+            ] as unknown as EvaluationMetric[])
+        }
+
+        const before = inspectCache({prefixes: ["evaluation-results", "evaluation-metrics"]})
+        assert.equal(before.totalEntries, 400)
+
+        const removedResults = evaluationResultMolecule.actions.evictByRunId({
+            projectId: "p1",
+            runId: "run1",
+        })
+        const removedMetrics = evaluationMetricMolecule.actions.evictByRunId({
+            projectId: "p1",
+            runId: "run1",
+        })
+
+        assert.equal(removedResults, 200)
+        assert.equal(removedMetrics, 200)
+
+        const after = inspectCache({prefixes: ["evaluation-results", "evaluation-metrics"]})
+        assert.equal(after.totalEntries, 0, "evict must clear everything for the run")
+    })
+
+    it("evictByRunId is run-scoped — other runs untouched", () => {
+        const qc = installQc()
+        // Two runs in the same project
+        qc.setQueryData(
+            ["evaluation-results", "p1", "runA", "s1"],
+            [{run_id: "runA"} as EvaluationResult],
+        )
+        qc.setQueryData(
+            ["evaluation-results", "p1", "runB", "s1"],
+            [{run_id: "runB"} as EvaluationResult],
+        )
+
+        const removed = evaluationResultMolecule.actions.evictByRunId({
+            projectId: "p1",
+            runId: "runA",
+        })
+        assert.equal(removed, 1)
+
+        // runB still cached
+        const runB = evaluationResultMolecule.get.byScenario({
+            projectId: "p1",
+            runId: "runB",
+            scenarioId: "s1",
+        })
+        assert.ok(runB, "runB cache survives runA eviction")
+        const runA = evaluationResultMolecule.get.byScenario({
+            projectId: "p1",
+            runId: "runA",
+            scenarioId: "s1",
+        })
+        assert.equal(runA, null, "runA cache cleared")
+    })
+})
+
+// =============================================================================
+// Risk 3: long-run iterations with eviction → heap stable
+// =============================================================================
+
+describe("Leak: 100 fresh-run iterations with evict-between → heap slope ~zero", () => {
+    it(
+        "heap should not grow linearly when caller dutifully evicts after each run",
+        {timeout: 60_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            installQc()
+            const ITERATIONS = 100
+            const SCENARIOS_PER_RUN = 50
+            const WARMUP = 10
+            const SAMPLE_INTERVAL = 10
+
+            const samples: number[] = []
+
+            for (let iter = 0; iter < ITERATIONS; iter++) {
+                const runId = `run-${iter}`
+                const scenarioIds = Array.from(
+                    {length: SCENARIOS_PER_RUN},
+                    (_, i) => `s-${iter}-${i}`,
+                )
+                // Seed the cache directly (no network) — simulates the
+                // prefetch action writing back after fetching misses.
+                const qc = store.get(queryClientAtom)
+                for (const sid of scenarioIds) {
+                    qc.setQueryData(["evaluation-results", "p1", runId, sid], [
+                        {run_id: runId, scenario_id: sid, step_key: "x", status: "ok"},
+                    ] as EvaluationResult[])
+                    qc.setQueryData(["evaluation-metrics", "p1", runId, sid], [
+                        {id: sid, run_id: runId, scenario_id: sid, status: "ok"},
+                    ] as unknown as EvaluationMetric[])
+                }
+
+                // Read everything back via the molecule (exercises the cache-hit path)
+                await evaluationResultMolecule.actions.prefetchByScenarioIds({
+                    projectId: "p1",
+                    runId,
+                    scenarioIds,
+                })
+                await evaluationMetricMolecule.actions.prefetchByScenarioIds({
+                    projectId: "p1",
+                    runId,
+                    scenarioIds,
+                })
+
+                // Evict, mimicking what a well-behaved ETL caller would do
+                evaluationResultMolecule.actions.evictByRunId({projectId: "p1", runId})
+                evaluationMetricMolecule.actions.evictByRunId({projectId: "p1", runId})
+
+                if (iter >= WARMUP && iter % SAMPLE_INTERVAL === 0) {
+                    forceGc()
+                    samples.push(process.memoryUsage().heapUsed)
+                }
+            }
+
+            assert.ok(samples.length >= 5, `expected ≥5 samples, got ${samples.length}`)
+            const slopeBytesPerSample = regressionSlope(samples)
+            const slopeBytesPerIter = slopeBytesPerSample / SAMPLE_INTERVAL
+
+            // Budget: 100 KB per iteration. A real leak (e.g. holding all
+            // scenarios in heap across iterations) would be MB-scale.
+            const BUDGET_KB_PER_ITER = 100
+
+            console.log(
+                `\n  samples (MB): [${samples.map((s) => (s / 1024 / 1024).toFixed(1)).join(", ")}]`,
+            )
+            console.log(
+                `  slope: ${(slopeBytesPerIter / 1024).toFixed(2)} KB/iter (budget ${BUDGET_KB_PER_ITER} KB/iter)`,
+            )
+
+            assert.ok(
+                slopeBytesPerIter < BUDGET_KB_PER_ITER * 1024,
+                `heap grows ${(slopeBytesPerIter / 1024).toFixed(1)} KB/iter (budget ${BUDGET_KB_PER_ITER} KB). Eviction not releasing memory.`,
+            )
+        },
+    )
+
+    it(
+        "WITHOUT eviction: heap DOES grow (sanity check — proves eviction is load-bearing)",
+        {timeout: 60_000, skip: !hasGc ? "needs --expose-gc" : false},
+        async () => {
+            installQc()
+            const ITERATIONS = 50
+            const SCENARIOS_PER_RUN = 50
+
+            const baselineSize = (() => {
+                forceGc()
+                return process.memoryUsage().heapUsed
+            })()
+
+            for (let iter = 0; iter < ITERATIONS; iter++) {
+                const runId = `run-leak-${iter}`
+                const scenarioIds = Array.from(
+                    {length: SCENARIOS_PER_RUN},
+                    (_, i) => `s-${iter}-${i}`,
+                )
+                const qc = store.get(queryClientAtom)
+                for (const sid of scenarioIds) {
+                    qc.setQueryData(["evaluation-results", "p1", runId, sid], [
+                        {run_id: runId, scenario_id: sid, step_key: "x", status: "ok"},
+                    ] as EvaluationResult[])
+                }
+                await evaluationResultMolecule.actions.prefetchByScenarioIds({
+                    projectId: "p1",
+                    runId,
+                    scenarioIds,
+                })
+                // NO eviction — this is the contrast
+            }
+
+            forceGc()
+            const finalSize = process.memoryUsage().heapUsed
+            const growthMB = (finalSize - baselineSize) / 1024 / 1024
+
+            const cache = inspectCache({prefixes: ["evaluation-results"]})
+
+            // Total cache entries = ITERATIONS * SCENARIOS_PER_RUN
+            assert.equal(
+                cache.totalEntries,
+                ITERATIONS * SCENARIOS_PER_RUN,
+                "cache accumulates every entry without eviction",
+            )
+
+            console.log(
+                `\n  WITHOUT eviction: ${cache.totalEntries} cache entries, heap +${growthMB.toFixed(1)} MB`,
+            )
+
+            // We don't fail this test — it's documenting current behaviour.
+            // The signal: cache.totalEntries grew linearly. The lesson:
+            // long-run scripts MUST call evictByRunId.
+        },
+    )
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/state/__tests__/molecules.test.ts b/web/packages/agenta-entities/src/evaluationRun/state/__tests__/molecules.test.ts
new file mode 100644
index 0000000000..3188728ef8
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/state/__tests__/molecules.test.ts
@@ -0,0 +1,248 @@
+/**
+ * Unit tests for the per-scenario read-only molecules.
+ *
+ * Scope: lock in the **cache contract** — what gets read, what gets written,
+ * what `invalidate()` does. End-to-end fetch flow is exercised in the PoC
+ * against a real backend.
+ *
+ * We avoid mocking the api module (ESM bindings are read-only). Instead we
+ * exercise the cache directly via `queryClient.setQueryData` and verify the
+ * molecule reads it correctly. Network behavior is implicit — if the cache
+ * is full, no network call is made (we verify via assertion that
+ * `prefetchByScenarioIds` resolves synchronously with `fetchMs === 0` and
+ * `cacheHits === scenarioIds.length`).
+ */
+
+import assert from "node:assert/strict"
+import {afterEach, beforeEach, describe, it} from "node:test"
+
+import {QueryClient} from "@tanstack/query-core"
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import type {EvaluationMetric, EvaluationResult} from "../../core"
+import {evaluationMetricMolecule} from "../metricMolecule"
+import {evaluationResultMolecule} from "../resultMolecule"
+
+const store = getDefaultStore()
+const realQueryClient = store.get(queryClientAtom)
+let testQc: QueryClient
+
+beforeEach(() => {
+    testQc = new QueryClient({
+        defaultOptions: {queries: {retry: false, gcTime: Infinity, staleTime: Infinity}},
+    })
+    store.set(queryClientAtom, testQc)
+})
+
+afterEach(() => {
+    store.set(queryClientAtom, realQueryClient)
+})
+
+function makeResult(scenarioId: string, stepKey: string, extras: Partial<EvaluationResult> = {}) {
+    return {
+        run_id: "run1",
+        scenario_id: scenarioId,
+        step_key: stepKey,
+        status: "success",
+        ...extras,
+    } as EvaluationResult
+}
+
+function makeMetric(scenarioId: string | null, extras: Partial<EvaluationMetric> = {}) {
+    return {
+        id: `m-${scenarioId ?? "agg"}`,
+        run_id: "run1",
+        scenario_id: scenarioId,
+        status: "success",
+        ...extras,
+    } as EvaluationMetric
+}
+
+// ============================================================================
+// evaluationResultMolecule
+// ============================================================================
+
+describe("evaluationResultMolecule", () => {
+    const projectId = "p1"
+    const runId = "run1"
+
+    it("get.byScenario returns null when cache empty", () => {
+        const out = evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId: "s1"})
+        assert.equal(out, null)
+    })
+
+    it("get.byScenario returns cached array when populated externally", () => {
+        const rows = [makeResult("s1", "step-a")]
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s1"], rows)
+        const out = evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId: "s1"})
+        assert.deepEqual(out, rows)
+    })
+
+    it("get.byScenario returns empty array when cache has []", () => {
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s1"], [])
+        const out = evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId: "s1"})
+        assert.deepEqual(out, [])
+    })
+
+    it("prefetchByScenarioIds: empty input → no work", async () => {
+        const out = await evaluationResultMolecule.actions.prefetchByScenarioIds({
+            projectId,
+            runId,
+            scenarioIds: [],
+        })
+        assert.equal(out.cacheHits, 0)
+        assert.equal(out.cacheMisses, 0)
+        assert.equal(out.fetchMs, 0)
+        assert.equal(out.results.length, 0)
+    })
+
+    it("prefetchByScenarioIds: full cache → 100% hits, no fetch", async () => {
+        const s1Rows = [makeResult("s1", "step-a"), makeResult("s1", "step-b")]
+        const s2Rows = [makeResult("s2", "step-a")]
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s1"], s1Rows)
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s2"], s2Rows)
+
+        const out = await evaluationResultMolecule.actions.prefetchByScenarioIds({
+            projectId,
+            runId,
+            scenarioIds: ["s1", "s2"],
+        })
+        assert.equal(out.cacheHits, 2)
+        assert.equal(out.cacheMisses, 0)
+        assert.equal(out.fetchMs, 0, "no network when fully cached")
+        assert.equal(out.results.length, 3)
+        assert.deepEqual(out.byScenarioId.get("s1"), s1Rows)
+        assert.deepEqual(out.byScenarioId.get("s2"), s2Rows)
+    })
+
+    it("prefetchByScenarioIds: scenario with [] in cache counts as hit (not refetched)", async () => {
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s1"], [])
+        const out = await evaluationResultMolecule.actions.prefetchByScenarioIds({
+            projectId,
+            runId,
+            scenarioIds: ["s1"],
+        })
+        assert.equal(out.cacheHits, 1)
+        assert.equal(out.cacheMisses, 0)
+        assert.equal(out.fetchMs, 0)
+    })
+
+    it("invalidate() drops a single scenario's cache entry", () => {
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s1"], [makeResult("s1", "x")])
+        testQc.setQueryData(["evaluation-results", projectId, runId, "s2"], [makeResult("s2", "x")])
+
+        evaluationResultMolecule.actions.invalidate({projectId, runId, scenarioId: "s1"})
+
+        // s1 cleared, s2 untouched
+        assert.equal(
+            evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId: "s1"}),
+            null,
+        )
+        const s2 = evaluationResultMolecule.get.byScenario({projectId, runId, scenarioId: "s2"})
+        assert.ok(Array.isArray(s2))
+        assert.equal(s2?.length, 1)
+    })
+
+    it("cache key isolates by projectId + runId", () => {
+        testQc.setQueryData(["evaluation-results", "p1", "run1", "s1"], [makeResult("s1", "x")])
+        const sameProjectDifferentRun = evaluationResultMolecule.get.byScenario({
+            projectId: "p1",
+            runId: "run2",
+            scenarioId: "s1",
+        })
+        assert.equal(sameProjectDifferentRun, null)
+
+        const differentProjectSameRun = evaluationResultMolecule.get.byScenario({
+            projectId: "p2",
+            runId: "run1",
+            scenarioId: "s1",
+        })
+        assert.equal(differentProjectSameRun, null)
+    })
+})
+
+// ============================================================================
+// evaluationMetricMolecule
+// ============================================================================
+
+describe("evaluationMetricMolecule", () => {
+    const projectId = "p1"
+    const runId = "run1"
+
+    it("get.byScenario returns null when cache empty", () => {
+        const out = evaluationMetricMolecule.get.byScenario({projectId, runId, scenarioId: "s1"})
+        assert.equal(out, null)
+    })
+
+    it("get.byScenario returns cached metrics", () => {
+        const rows = [makeMetric("s1")]
+        testQc.setQueryData(["evaluation-metrics", projectId, runId, "s1"], rows)
+        const out = evaluationMetricMolecule.get.byScenario({projectId, runId, scenarioId: "s1"})
+        assert.deepEqual(out, rows)
+    })
+
+    it("prefetchByScenarioIds: full cache → 100% hits, no fetch", async () => {
+        testQc.setQueryData(["evaluation-metrics", projectId, runId, "s1"], [makeMetric("s1")])
+        testQc.setQueryData(["evaluation-metrics", projectId, runId, "s2"], [makeMetric("s2")])
+
+        const out = await evaluationMetricMolecule.actions.prefetchByScenarioIds({
+            projectId,
+            runId,
+            scenarioIds: ["s1", "s2"],
+        })
+        assert.equal(out.cacheHits, 2)
+        assert.equal(out.cacheMisses, 0)
+        assert.equal(out.fetchMs, 0)
+        assert.equal(out.metrics.length, 2)
+    })
+
+    it("invalidate() drops a metric's cache entry", () => {
+        testQc.setQueryData(["evaluation-metrics", projectId, runId, "s1"], [makeMetric("s1")])
+        evaluationMetricMolecule.actions.invalidate({projectId, runId, scenarioId: "s1"})
+        assert.equal(
+            evaluationMetricMolecule.get.byScenario({projectId, runId, scenarioId: "s1"}),
+            null,
+        )
+    })
+
+    it("does not group run-level aggregates (scenario_id=null) under any scenario", async () => {
+        // Pre-populate cache for s1, no metric for s2.
+        testQc.setQueryData(["evaluation-metrics", projectId, runId, "s1"], [makeMetric("s1")])
+        const out = await evaluationMetricMolecule.actions.prefetchByScenarioIds({
+            projectId,
+            runId,
+            scenarioIds: ["s1"],
+        })
+        // Verify the cached s1 metric came through and is keyed properly.
+        assert.equal(out.byScenarioId.get("s1")?.length, 1)
+        assert.equal(out.byScenarioId.get(null as unknown as string), undefined)
+    })
+})
+
+// ============================================================================
+// Cache key shape — locking these in so different cache key shapes don't
+// silently fragment the cache.
+// ============================================================================
+
+describe("cache key shape (locked-in contract)", () => {
+    it("result molecule key: ['evaluation-results', projectId, runId, scenarioId]", () => {
+        testQc.setQueryData(["evaluation-results", "p", "r", "s"], [makeResult("s", "x")])
+        const out = evaluationResultMolecule.get.byScenario({
+            projectId: "p",
+            runId: "r",
+            scenarioId: "s",
+        })
+        assert.ok(out)
+    })
+
+    it("metric molecule key: ['evaluation-metrics', projectId, runId, scenarioId]", () => {
+        testQc.setQueryData(["evaluation-metrics", "p", "r", "s"], [makeMetric("s")])
+        const out = evaluationMetricMolecule.get.byScenario({
+            projectId: "p",
+            runId: "r",
+            scenarioId: "s",
+        })
+        assert.ok(out)
+    })
+})
diff --git a/web/packages/agenta-entities/src/evaluationRun/state/index.ts b/web/packages/agenta-entities/src/evaluationRun/state/index.ts
index 06d36f7f6c..48e9a75e63 100644
--- a/web/packages/agenta-entities/src/evaluationRun/state/index.ts
+++ b/web/packages/agenta-entities/src/evaluationRun/state/index.ts
@@ -5,3 +5,17 @@ export {
     scenarioStepsQueryAtomFamily,
     invalidateEvaluationRunCache,
 } from "./molecule"
+
+// Per-scenario read-only entity caches with cache-aware prefetch
+export {
+    evaluationResultMolecule,
+    type EvaluationResultMolecule,
+    type PrefetchResultsArgs,
+    type PrefetchResultsOutcome,
+} from "./resultMolecule"
+export {
+    evaluationMetricMolecule,
+    type EvaluationMetricMolecule,
+    type PrefetchMetricsArgs,
+    type PrefetchMetricsOutcome,
+} from "./metricMolecule"
diff --git a/web/packages/agenta-entities/src/evaluationRun/state/metricMolecule.ts b/web/packages/agenta-entities/src/evaluationRun/state/metricMolecule.ts
new file mode 100644
index 0000000000..b046b301d4
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/state/metricMolecule.ts
@@ -0,0 +1,194 @@
+/**
+ * evaluationMetricMolecule — minimal entity layer for per-scenario metrics.
+ *
+ * Same shape as `evaluationResultMolecule`. Metrics are read-only from the
+ * UI's perspective. Cache key: `["evaluation-metrics", projectId, runId, scenarioId]`.
+ * Value: `EvaluationMetric[]` (typically one per scenario, but the API
+ * doesn't constrain it — could be multiple).
+ *
+ * @packageDocumentation
+ */
+
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {queryEvaluationMetrics} from "../api"
+import type {EvaluationMetric} from "../core"
+
+const KEY_PREFIX = "evaluation-metrics"
+
+function cacheKey(projectId: string, runId: string, scenarioId: string) {
+    return [KEY_PREFIX, projectId, runId, scenarioId] as const
+}
+
+function getQc() {
+    return getDefaultStore().get(queryClientAtom)
+}
+
+export interface PrefetchMetricsArgs {
+    projectId: string
+    runId: string
+    scenarioIds: string[]
+}
+
+export interface PrefetchMetricsOutcome {
+    metrics: EvaluationMetric[]
+    byScenarioId: Map<string, EvaluationMetric[]>
+    cacheHits: number
+    cacheMisses: number
+    fetchMs: number
+}
+
+export const evaluationMetricMolecule = {
+    get: {
+        byScenario(args: {
+            projectId: string
+            runId: string
+            scenarioId: string
+        }): EvaluationMetric[] | null {
+            try {
+                return (
+                    getQc().getQueryData<EvaluationMetric[]>(
+                        cacheKey(args.projectId, args.runId, args.scenarioId),
+                    ) ?? null
+                )
+            } catch {
+                return null
+            }
+        },
+    },
+
+    actions: {
+        async prefetchByScenarioIds(args: PrefetchMetricsArgs): Promise<PrefetchMetricsOutcome> {
+            const {projectId, runId, scenarioIds} = args
+            if (scenarioIds.length === 0) {
+                return {
+                    metrics: [],
+                    byScenarioId: new Map(),
+                    cacheHits: 0,
+                    cacheMisses: 0,
+                    fetchMs: 0,
+                }
+            }
+
+            let qc: ReturnType<typeof getQc> | null = null
+            try {
+                qc = getQc()
+            } catch {}
+
+            const byScenarioId = new Map<string, EvaluationMetric[]>()
+            const misses: string[] = []
+            let hits = 0
+
+            if (qc) {
+                for (const sid of scenarioIds) {
+                    const cached = qc.getQueryData<EvaluationMetric[]>(
+                        cacheKey(projectId, runId, sid),
+                    )
+                    if (cached !== undefined) {
+                        byScenarioId.set(sid, cached)
+                        hits++
+                    } else {
+                        misses.push(sid)
+                    }
+                }
+            } else {
+                misses.push(...scenarioIds)
+            }
+
+            let fetchMs = 0
+            if (misses.length > 0) {
+                const start = performance.now()
+                const fetched = await queryEvaluationMetrics({
+                    projectId,
+                    runId,
+                    scenarioIds: misses,
+                })
+                fetchMs = performance.now() - start
+
+                for (const m of fetched) {
+                    if (!m.scenario_id) continue // run-level aggregates have no scenario_id
+                    const arr = byScenarioId.get(m.scenario_id) ?? []
+                    arr.push(m)
+                    byScenarioId.set(m.scenario_id, arr)
+                }
+                if (qc) {
+                    for (const sid of misses) {
+                        qc.setQueryData(
+                            cacheKey(projectId, runId, sid),
+                            byScenarioId.get(sid) ?? [],
+                        )
+                    }
+                }
+            }
+
+            const flat: EvaluationMetric[] = []
+            byScenarioId.forEach((arr) => flat.push(...arr))
+
+            return {
+                metrics: flat,
+                byScenarioId,
+                cacheHits: hits,
+                cacheMisses: misses.length,
+                fetchMs,
+            }
+        },
+
+        invalidate(args: {projectId: string; runId: string; scenarioId: string}): void {
+            try {
+                getQc().removeQueries({
+                    queryKey: cacheKey(args.projectId, args.runId, args.scenarioId),
+                })
+            } catch {}
+        },
+
+        /**
+         * Bulk-evict every cached metric for a run. See resultMolecule for
+         * rationale. Returns the count of removed entries.
+         */
+        evictByRunId(args: {projectId: string; runId: string}): number {
+            try {
+                const cache = getQc().getQueryCache()
+                const toRemove = cache.findAll({
+                    queryKey: [KEY_PREFIX, args.projectId, args.runId],
+                    exact: false,
+                })
+                toRemove.forEach((q) => cache.remove(q))
+                return toRemove.length
+            } catch {
+                return 0
+            }
+        },
+
+        /**
+         * Bulk-evict cached metrics for a specific set of scenarios — the
+         * per-chunk counterpart of `prefetchByScenarioIds`. See
+         * `evaluationResultMolecule.actions.evictByScenarioIds` for the
+         * rationale. Returns the count of entries removed.
+         */
+        evictByScenarioIds(args: {
+            projectId: string
+            runId: string
+            scenarioIds: string[]
+        }): number {
+            let removed = 0
+            try {
+                const qc = getQc()
+                for (const sid of args.scenarioIds) {
+                    const key = cacheKey(args.projectId, args.runId, sid)
+                    if (qc.getQueryData(key) !== undefined) {
+                        qc.removeQueries({queryKey: key, exact: true})
+                        removed++
+                    }
+                }
+            } catch {
+                // No queryClient — nothing to evict.
+            }
+            return removed
+        },
+    },
+
+    _internal: {cacheKey},
+}
+
+export type EvaluationMetricMolecule = typeof evaluationMetricMolecule
diff --git a/web/packages/agenta-entities/src/evaluationRun/state/resultMolecule.ts b/web/packages/agenta-entities/src/evaluationRun/state/resultMolecule.ts
new file mode 100644
index 0000000000..2bfb57f656
--- /dev/null
+++ b/web/packages/agenta-entities/src/evaluationRun/state/resultMolecule.ts
@@ -0,0 +1,247 @@
+/**
+ * evaluationResultMolecule — minimal entity layer for evaluation results.
+ *
+ * Results are *read-only* from the UI's perspective (the user doesn't edit
+ * a result; the eval engine produces them). So this molecule's surface is
+ * tiny:
+ *
+ *   .get.byScenario(args)                   imperative cache read
+ *   .actions.prefetchByScenarioIds(args)    cache-aware bulk fetch
+ *   .actions.invalidate(args)               drop a scenario's cache entry
+ *
+ * # Cache identity
+ *
+ * Uses the shared Jotai `queryClientAtom`, same store every other molecule
+ * uses. Cache key: `["evaluation-results", projectId, runId, scenarioId]`.
+ * The value at each key is `EvaluationResult[]` (the steps for that scenario).
+ *
+ * Empty arrays are cached too. A scenario with no results yet (run still in
+ * progress) returns `[]` from cache rather than refetching every time.
+ *
+ * # Why the molecule name doesn't follow `*Molecule` exactly
+ *
+ * Existing molecules (testcase, trace) wrap `createMolecule` which provides
+ * drafts, controllers, selection, etc. — appropriate for editable entities.
+ * Results have no edit surface, so we skip the heavy infrastructure. The
+ * shape (`.get.*`, `.actions.*`) still matches the convention so callers
+ * read consistently across molecules.
+ *
+ * @packageDocumentation
+ */
+
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {queryEvaluationResults} from "../api"
+import type {EvaluationResult} from "../core"
+
+const KEY_PREFIX = "evaluation-results"
+
+function cacheKey(projectId: string, runId: string, scenarioId: string) {
+    return [KEY_PREFIX, projectId, runId, scenarioId] as const
+}
+
+function getQc() {
+    return getDefaultStore().get(queryClientAtom)
+}
+
+export interface PrefetchResultsArgs {
+    projectId: string
+    runId: string
+    scenarioIds: string[]
+}
+
+export interface PrefetchResultsOutcome {
+    /** All results, ungrouped (cached + freshly fetched). */
+    results: EvaluationResult[]
+    /** Results grouped by scenario_id. */
+    byScenarioId: Map<string, EvaluationResult[]>
+    cacheHits: number
+    cacheMisses: number
+    /** Network time for the bulk fetch; 0 if all scenarios were cached. */
+    fetchMs: number
+}
+
+export const evaluationResultMolecule = {
+    get: {
+        /**
+         * Synchronous cache lookup. Returns `null` if the scenario hasn't been
+         * prefetched yet (caller should fall back to a prefetch).
+         */
+        byScenario(args: {
+            projectId: string
+            runId: string
+            scenarioId: string
+        }): EvaluationResult[] | null {
+            try {
+                return (
+                    getQc().getQueryData<EvaluationResult[]>(
+                        cacheKey(args.projectId, args.runId, args.scenarioId),
+                    ) ?? null
+                )
+            } catch {
+                return null
+            }
+        },
+    },
+
+    actions: {
+        /**
+         * Cache-aware bulk prefetch. Steps:
+         *   1. partition input scenarioIds into hits vs misses
+         *   2. POST /evaluations/results/query with the misses only
+         *   3. group fetched rows by scenario_id
+         *   4. write cache entries for every miss (including empties)
+         *   5. return cached + fetched together
+         */
+        async prefetchByScenarioIds(args: PrefetchResultsArgs): Promise<PrefetchResultsOutcome> {
+            const {projectId, runId, scenarioIds} = args
+            if (scenarioIds.length === 0) {
+                return {
+                    results: [],
+                    byScenarioId: new Map(),
+                    cacheHits: 0,
+                    cacheMisses: 0,
+                    fetchMs: 0,
+                }
+            }
+
+            let qc: ReturnType<typeof getQc> | null = null
+            try {
+                qc = getQc()
+            } catch {
+                // No queryClient available — degrade to full fetch
+            }
+
+            const byScenarioId = new Map<string, EvaluationResult[]>()
+            const misses: string[] = []
+            let hits = 0
+
+            if (qc) {
+                for (const sid of scenarioIds) {
+                    const cached = qc.getQueryData<EvaluationResult[]>(
+                        cacheKey(projectId, runId, sid),
+                    )
+                    if (cached !== undefined) {
+                        byScenarioId.set(sid, cached)
+                        hits++
+                    } else {
+                        misses.push(sid)
+                    }
+                }
+            } else {
+                misses.push(...scenarioIds)
+            }
+
+            let fetchMs = 0
+            if (misses.length > 0) {
+                const start = performance.now()
+                const fetched = await queryEvaluationResults({
+                    projectId,
+                    runId,
+                    scenarioIds: misses,
+                })
+                fetchMs = performance.now() - start
+
+                // Group by scenario_id
+                for (const r of fetched) {
+                    const arr = byScenarioId.get(r.scenario_id) ?? []
+                    arr.push(r)
+                    byScenarioId.set(r.scenario_id, arr)
+                }
+                // Write cache for every miss — including empty arrays for
+                // scenarios with no rows yet (so we don't re-fetch them).
+                if (qc) {
+                    for (const sid of misses) {
+                        qc.setQueryData(
+                            cacheKey(projectId, runId, sid),
+                            byScenarioId.get(sid) ?? [],
+                        )
+                    }
+                }
+            }
+
+            // Flatten ordered output
+            const flat: EvaluationResult[] = []
+            byScenarioId.forEach((arr) => flat.push(...arr))
+
+            return {
+                results: flat,
+                byScenarioId,
+                cacheHits: hits,
+                cacheMisses: misses.length,
+                fetchMs,
+            }
+        },
+
+        /** Drop a scenario's cache entry — next read will refetch. */
+        invalidate(args: {projectId: string; runId: string; scenarioId: string}): void {
+            try {
+                getQc().removeQueries({
+                    queryKey: cacheKey(args.projectId, args.runId, args.scenarioId),
+                })
+            } catch {
+                // No queryClient
+            }
+        },
+
+        /**
+         * Bulk-evict every cached result for a run. Use this after finishing a
+         * long-running ETL pass to release memory — cache entries don't have
+         * subscribers in a script context, so TanStack's default gcTime never
+         * fires and entries accumulate.
+         *
+         * Returns the number of cache entries removed.
+         */
+        evictByRunId(args: {projectId: string; runId: string}): number {
+            try {
+                // Prefix match: every key starts with `[KEY_PREFIX, projectId, runId, ...]`
+                const cache = getQc().getQueryCache()
+                const toRemove = cache.findAll({
+                    queryKey: [KEY_PREFIX, args.projectId, args.runId],
+                    exact: false,
+                })
+                toRemove.forEach((q) => cache.remove(q))
+                return toRemove.length
+            } catch {
+                return 0
+            }
+        },
+
+        /**
+         * Bulk-evict cached results for a specific set of scenarios — the
+         * per-chunk counterpart of `prefetchByScenarioIds`. An ETL
+         * chunk-release hook (see `ChunkReleaseHook`) calls this once the
+         * sink has consumed a chunk, so heap stays bounded by chunk size
+         * across an arbitrarily long scan instead of growing with the
+         * dataset.
+         *
+         * Returns the number of cache entries actually removed.
+         */
+        evictByScenarioIds(args: {
+            projectId: string
+            runId: string
+            scenarioIds: string[]
+        }): number {
+            let removed = 0
+            try {
+                const qc = getQc()
+                for (const sid of args.scenarioIds) {
+                    const key = cacheKey(args.projectId, args.runId, sid)
+                    if (qc.getQueryData(key) !== undefined) {
+                        qc.removeQueries({queryKey: key, exact: true})
+                        removed++
+                    }
+                }
+            } catch {
+                // No queryClient — nothing to evict.
+            }
+            return removed
+        },
+    },
+
+    /** Exposed for test code only — don't depend on this from app code. */
+    _internal: {cacheKey},
+}
+
+export type EvaluationResultMolecule = typeof evaluationResultMolecule
diff --git a/web/packages/agenta-entities/src/shared/entityBridge.ts b/web/packages/agenta-entities/src/shared/entityBridge.ts
index ffaaf902d5..433422f74a 100644
--- a/web/packages/agenta-entities/src/shared/entityBridge.ts
+++ b/web/packages/agenta-entities/src/shared/entityBridge.ts
@@ -85,7 +85,11 @@ export interface BaseMolecule {
         update: OpaqueWritableAtom
         /** Discard entity draft */
         discard: OpaqueWritableAtom
-        [key: string]: OpaqueWritableAtom
+        // Molecules also expose non-atom helpers here (e.g. prefetchByIds:
+        // an async function, not a WritableAtom). The index signature accepts
+        // anything so the heterogeneous bag stays expressible without forcing
+        // every molecule into the OpaqueWritableAtom shape.
+        [key: string]: unknown
     }
 }
 
diff --git a/web/packages/agenta-entities/src/shared/molecule/instrumentedAtomFamily.ts b/web/packages/agenta-entities/src/shared/molecule/instrumentedAtomFamily.ts
new file mode 100644
index 0000000000..241580ea60
--- /dev/null
+++ b/web/packages/agenta-entities/src/shared/molecule/instrumentedAtomFamily.ts
@@ -0,0 +1,166 @@
+/**
+ * Instrumented atomFamily — a drop-in replacement for `jotai-family`'s
+ * `atomFamily` that tracks active params in a Set so callers can ask
+ * "how many entries does this family hold right now?"
+ *
+ * # Why this exists
+ *
+ * `atomFamily(create)` is the load-bearing mechanism for entity-keyed
+ * reactive state in this codebase. It memoizes one atom per unique param.
+ * Without `.remove(param)`, the underlying map grows monotonically — every
+ * unique id ever requested keeps an atom alive for the process lifetime.
+ *
+ * The base library exposes `.remove()` for eviction but provides no way
+ * to *inspect* current size. That makes memory diagnosis impossible:
+ * "is this family holding 50 ids or 50,000?" has no answer from outside.
+ *
+ * This wrapper closes that gap:
+ *   - Same callable API: `family(param) → Atom`
+ *   - Same `.remove(param)` semantics
+ *   - Adds `.size()` — current number of memoized params
+ *   - Adds `.params()` — iterator over the active params (for spot-checks)
+ *   - Adds `.clear()` — bulk-remove everything
+ *   - Optionally registers itself globally so a diagnostic helper can list
+ *     all families and their sizes by name
+ *
+ * # Migration
+ *
+ * For atom families you want diagnosable, replace:
+ *   import {atomFamily} from "jotai-family"
+ *   const myFamily = atomFamily((id) => atom(...))
+ *
+ * with:
+ *   import {instrumentedAtomFamily} from "../../shared/molecule/instrumentedAtomFamily"
+ *   const myFamily = instrumentedAtomFamily((id) => atom(...), {name: "myFamily"})
+ *
+ * Existing callers continue to work — the returned object is callable with
+ * the same signature and exposes `.remove()` the same way.
+ *
+ * @packageDocumentation
+ */
+
+import type {Atom} from "jotai"
+import {atomFamily as baseAtomFamily} from "jotai-family"
+
+// ============================================================================
+// Registry (module-scoped, lazy)
+// ============================================================================
+
+const registry = new Map<string, InstrumentedAtomFamily<unknown, Atom<unknown>>>()
+
+export interface AtomFamilyStats {
+    name: string
+    size: number
+}
+
+/**
+ * Snapshot of every instrumented family currently registered.
+ *
+ * Names are best-effort — caller-provided via the `name` option. Without a
+ * name, the registry stores under an auto-generated key like `family-3`,
+ * which is fine for counting but not great for spotting which family is
+ * leaking. Always pass `name` when adding new instrumented families.
+ *
+ * Results are sorted by size descending so leaks stand out first.
+ */
+export function inspectAtomFamilies(): AtomFamilyStats[] {
+    return Array.from(registry.entries())
+        .map(([name, family]) => ({name, size: family.size()}))
+        .sort((a, b) => b.size - a.size)
+}
+
+/**
+ * Bulk-clear all instrumented families. Mostly useful in tests between
+ * scenarios that need a clean slate. Don't call this in production code —
+ * it'll unsubscribe every active atom subscriber in the process.
+ */
+export function clearAllAtomFamilies(): number {
+    let removed = 0
+    for (const family of registry.values()) {
+        removed += family.size()
+        family.clear()
+    }
+    return removed
+}
+
+// ============================================================================
+// Wrapper
+// ============================================================================
+
+export interface InstrumentedAtomFamilyOptions<TParam = unknown> {
+    /**
+     * Identifier used in the diagnostic registry. Pass something stable and
+     * descriptive — e.g. `"trace.traceEntityAtomFamily"`. If omitted, an
+     * auto-generated counter is used.
+     */
+    name?: string
+    /**
+     * Skip registry registration. Use when a family is local to a function
+     * scope (e.g. inside a factory) and shouldn't pollute the global view.
+     */
+    skipRegistry?: boolean
+    /**
+     * Custom equality predicate for param deduplication. Mirrors the
+     * optional 2nd argument of `jotai-family`'s `atomFamily`. Without this,
+     * params are compared by reference identity (Object.is) which means
+     * structurally-equal-but-different-reference params would each create
+     * a separate atom (and a separate Set entry here).
+     */
+    areEqual?: (a: TParam, b: TParam) => boolean
+}
+
+export interface InstrumentedAtomFamily<TParam, TAtom> {
+    /** Get-or-create the atom for `param`. Tracks `param` in the size set. */
+    (param: TParam): TAtom
+    /** Number of memoized params (the size of the underlying map). */
+    size: () => number
+    /** Iterator over active params — for spot-checks during diagnostics. */
+    params: () => IterableIterator<TParam>
+    /** Drop a single param's atom. Mirrors `atomFamily.remove`. */
+    remove: (param: TParam) => void
+    /** Drop every param's atom. */
+    clear: () => void
+    /** The diagnostic name (mostly for debug logs). */
+    readonly name: string
+}
+
+let anon = 0
+
+export function instrumentedAtomFamily<TParam, TAtom extends Atom<unknown>>(
+    create: (param: TParam) => TAtom,
+    options: InstrumentedAtomFamilyOptions<TParam> = {},
+): InstrumentedAtomFamily<TParam, TAtom> {
+    const family = baseAtomFamily(create, options.areEqual)
+    // We need our own set because jotai-family doesn't expose iteration.
+    // When `areEqual` is supplied, the underlying family dedups by that
+    // predicate, but our Set still tracks by reference. For diagnostic
+    // purposes (counting), the slight over-count under structural equality
+    // is acceptable; real production code typically uses object literals
+    // that hash by identity for the keys anyway.
+    const params = new Set<TParam>()
+
+    const fn = ((param: TParam) => {
+        params.add(param)
+        return family(param)
+    }) as InstrumentedAtomFamily<TParam, TAtom>
+
+    const name = options.name ?? `family-${++anon}`
+    Object.defineProperty(fn, "name", {value: name, configurable: false})
+
+    fn.size = () => params.size
+    fn.params = () => params.values()
+    fn.remove = (param: TParam) => {
+        params.delete(param)
+        family.remove(param)
+    }
+    fn.clear = () => {
+        for (const p of params) family.remove(p)
+        params.clear()
+    }
+
+    if (!options.skipRegistry) {
+        registry.set(name, fn as InstrumentedAtomFamily<unknown, Atom<unknown>>)
+    }
+
+    return fn
+}
diff --git a/web/packages/agenta-entities/src/shared/paginated/createInfiniteDatasetStore.ts b/web/packages/agenta-entities/src/shared/paginated/createInfiniteDatasetStore.ts
index c8d0b5b688..d84d522497 100644
--- a/web/packages/agenta-entities/src/shared/paginated/createInfiniteDatasetStore.ts
+++ b/web/packages/agenta-entities/src/shared/paginated/createInfiniteDatasetStore.ts
@@ -11,8 +11,9 @@ import type {Key} from "react"
 
 import type {Atom, PrimitiveAtom} from "jotai"
 import {atom, useAtom, useAtomValue} from "jotai"
-import {atomFamily} from "jotai-family"
 
+// Use the instrumented wrapper so each store can be `dispose()`-d.
+import {instrumentedAtomFamily} from "../molecule/instrumentedAtomFamily"
 import type {InfiniteTableFetchResult, InfiniteTableRowBase, WindowingState} from "../tableTypes"
 
 import {createInfiniteTableStore} from "./createInfiniteTableStore"
@@ -84,13 +85,48 @@ export interface InfiniteDatasetStore<Row extends InfiniteTableRowBase, ApiRow,
             params: ScopeParams,
         ) => [Key[], (next: Key[] | ((prev: Key[]) => Key[])) => void]
     }
+    /**
+     * Release every atomFamily entry this store owns and cascade into the
+     * inner table store. Returns the total number of entries removed.
+     */
+    dispose: () => number
+    /** Diagnostic — names + sizes for every owned family (and inner store). */
+    familySizes: () => {name: string; size: number}[]
 }
 
 export const createInfiniteDatasetStore = <Row extends InfiniteTableRowBase, ApiRow, Meta>(
     config: InfiniteDatasetStoreConfig<Row, ApiRow, Meta>,
 ): InfiniteDatasetStore<Row, ApiRow, Meta> => {
-    const selectionAtomFamily = atomFamily(
+    // Per-store family registry for dispose() / familySizes(). See
+    // createInfiniteTableStore.ts for the same pattern.
+    interface ManagedFamily {
+        clear: () => void
+        size: () => number
+        readonly name: string
+    }
+    const ownedFamilies: ManagedFamily[] = []
+    // Constrain `A extends Atom<unknown>` to match `instrumentedAtomFamily`'s
+    // signature so the atom-type generic survives the wrapper. Returning the
+    // erased `…<P, never>` shape (as the earlier version did) collapsed every
+    // `get(family(params))` call to `unknown` — that's the leak the `as never`
+    // casts were papering over.
+    const trackFamily = <P, A extends Atom<unknown>>(
+        create: (p: P) => A,
+        name: string,
+        areEqual?: (a: P, b: P) => boolean,
+    ) => {
+        const fam = instrumentedAtomFamily<P, A>(create, {
+            name,
+            skipRegistry: true,
+            areEqual,
+        })
+        ownedFamilies.push(fam as unknown as ManagedFamily)
+        return fam
+    }
+
+    const selectionAtomFamily = trackFamily(
         ({scopeId}: ScopeParams) => atom<Key[]>([]),
+        "infiniteDataset.selectionAtomFamily",
         (a, b) => a.scopeId === b.scopeId,
     )
 
@@ -190,7 +226,7 @@ export const createInfiniteDatasetStore = <Row extends InfiniteTableRowBase, Api
 
     // Create wrapper atoms that merge client rows if clientRowsAtom is provided
     // Use atomFamily to cache derived atoms by params
-    const rowsWithClientAtomFamily = atomFamily(
+    const rowsWithClientAtomFamily = trackFamily(
         (params: TablePagesParams) => {
             const baseRowsAtom = tableStore.atoms.combinedRowsAtomFamily(params)
 
@@ -247,10 +283,11 @@ export const createInfiniteDatasetStore = <Row extends InfiniteTableRowBase, Api
                 return cachedResult
             })
         },
+        "infiniteDataset.rowsWithClientAtomFamily",
         (a, b) => a.scopeId === b.scopeId && a.pageSize === b.pageSize,
     )
 
-    const paginationWithClientAtomFamily = atomFamily(
+    const paginationWithClientAtomFamily = trackFamily(
         (params: TablePagesParams) => {
             const basePaginationAtom = tableStore.atoms.paginationInfoAtomFamily(params)
             const baseRowsAtom = tableStore.atoms.combinedRowsAtomFamily(params)
@@ -288,6 +325,7 @@ export const createInfiniteDatasetStore = <Row extends InfiniteTableRowBase, Api
                 }
             })
         },
+        "infiniteDataset.paginationWithClientAtomFamily",
         (a, b) => a.scopeId === b.scopeId && a.pageSize === b.pageSize,
     )
 
@@ -317,5 +355,28 @@ export const createInfiniteDatasetStore = <Row extends InfiniteTableRowBase, Api
             usePagination,
             useRowSelection,
         },
+        // Release every atomFamily entry this store owns AND cascade into
+        // the inner table store. Returns total count removed.
+        dispose() {
+            let total = 0
+            for (const f of ownedFamilies) {
+                total += f.size()
+                f.clear()
+            }
+            if (typeof (tableStore as {dispose?: () => number}).dispose === "function") {
+                total += (tableStore as unknown as {dispose: () => number}).dispose()
+            }
+            return total
+        },
+        // Diagnostic — own + inner table store
+        familySizes() {
+            const own = ownedFamilies.map((f) => ({name: f.name, size: f.size()}))
+            const innerFn = (
+                tableStore as unknown as {
+                    familySizes?: () => {name: string; size: number}[]
+                }
+            ).familySizes
+            return typeof innerFn === "function" ? [...own, ...innerFn.call(tableStore)] : own
+        },
     }
 }
diff --git a/web/packages/agenta-entities/src/shared/paginated/createInfiniteTableStore.ts b/web/packages/agenta-entities/src/shared/paginated/createInfiniteTableStore.ts
index 1d4e2a92b2..6535fdbe6a 100644
--- a/web/packages/agenta-entities/src/shared/paginated/createInfiniteTableStore.ts
+++ b/web/packages/agenta-entities/src/shared/paginated/createInfiniteTableStore.ts
@@ -7,13 +7,18 @@
  * Copied from @agenta/ui to avoid dependency.
  */
 
-import {atom} from "jotai"
+import {atom, getDefaultStore} from "jotai"
 import type {Atom, WritableAtom} from "jotai"
-import {atomFamily} from "jotai-family"
-import {atomWithQuery} from "jotai-tanstack-query"
+import {atomWithQuery, queryClientAtom} from "jotai-tanstack-query"
 import type {AtomWithQueryResult} from "jotai-tanstack-query"
 import {v4 as uuidv4} from "uuid"
 
+// Use the instrumented wrapper so each paginated/table store can be
+// `dispose()`-d to release its atomFamily entries. `skipRegistry: true`
+// keeps these out of the global diagnostic surface (we'd otherwise get
+// one registry entry per scopeId, which gets noisy). The factory collects
+// its own families and exposes a private clear via its return shape.
+import {instrumentedAtomFamily} from "../molecule/instrumentedAtomFamily"
 import type {
     InfiniteTableFetchParams,
     InfiniteTableFetchResult,
@@ -74,6 +79,10 @@ export interface InfiniteTableStore<TableRow extends InfiniteTableRowBase, ApiRo
         ) => WritableAtom<AtomWithQueryResult<InfiniteTableFetchResult<ApiRow>>, [], void>
     }
     createInitialPage: (pageSize: number) => InfiniteTablePage
+    /** Release every atomFamily entry this store owns. Returns count removed. */
+    dispose: () => number
+    /** Diagnostic: active param counts per internal family. */
+    familySizes: () => {name: string; size: number}[]
 }
 
 interface CreateInfiniteTableStoreOptions<
@@ -162,6 +171,46 @@ export const createInfiniteTableStore = <
             return a.scopeId === b.scopeId && a.pageSize === b.pageSize
         })
 
+    // Per-store collector for atom families so dispose() can release them.
+    // Each family is name-tagged for diagnostic visibility via .name.
+    interface ManagedFamily {
+        clear: () => void
+        size: () => number
+        readonly name: string
+    }
+    const ownedFamilies: ManagedFamily[] = []
+    // Accepts both `atomFamily(create, name)` and the original
+    // `atomFamily(create, areEqual, name)` shape — see equivalent comment
+    // in createPaginatedEntityStore for why this matters (without
+    // preserving the original areEqual fns, params would dedup by
+    // reference and break memoization → pagination state loss).
+    // Constrain `A extends Atom<unknown>` to match `instrumentedAtomFamily`'s
+    // signature so the atom-type generic survives the wrapper. Returning the
+    // erased `…<P, never>` shape (as the earlier version did) collapsed every
+    // `get(family(params))` call to `unknown` — that's the leak the `as never`
+    // casts were papering over.
+    const atomFamily = <P, A extends Atom<unknown>>(
+        create: (p: P) => A,
+        areEqualOrName?: ((a: P, b: P) => boolean) | string,
+        nameArg?: string,
+    ) => {
+        let resolvedName: string | undefined
+        let resolvedAreEqual: ((a: P, b: P) => boolean) | undefined
+        if (typeof areEqualOrName === "function") {
+            resolvedAreEqual = areEqualOrName
+            resolvedName = nameArg
+        } else if (typeof areEqualOrName === "string") {
+            resolvedName = areEqualOrName
+        }
+        const fam = instrumentedAtomFamily<P, A>(create, {
+            name: resolvedName,
+            skipRegistry: true,
+            areEqual: resolvedAreEqual,
+        })
+        ownedFamilies.push(fam as unknown as ManagedFamily)
+        return fam
+    }
+
     const tableRowsQueryAtomFamily = atomFamily(
         (params: TableRowAtomKey) =>
             atomWithQuery<InfiniteTableFetchResult<ApiRow>>((get) => {
@@ -213,6 +262,7 @@ export const createInfiniteTableStore = <
                 }
             }),
         rowsKeyEquals,
+        "infiniteTable.tableRowsQueryAtomFamily",
     )
 
     const tableSkeletonRowsAtomFamily = atomFamily(
@@ -221,6 +271,7 @@ export const createInfiniteTableStore = <
                 return ensureSkeletonRows(key)
             }),
         rowsKeyEquals,
+        "infiniteTable.tableSkeletonRowsAtomFamily",
     )
 
     const tableRowsAtomFamily = atomFamily(
@@ -244,34 +295,39 @@ export const createInfiniteTableStore = <
                 })
             }),
         rowsKeyEquals,
+        "infiniteTable.tableRowsAtomFamily",
     )
 
-    const tablePagesAtomFamily = atomFamily(({scopeId, pageSize}: TablePagesKey) => {
-        const baseAtom = atom<{pages: InfiniteTablePage[]}>({
-            pages: [
-                {
-                    offset: 0,
-                    limit: pageSize,
-                    cursor: null,
-                    windowing: null,
-                },
-            ],
-        })
+    const tablePagesAtomFamily = atomFamily(
+        ({scopeId, pageSize}: TablePagesKey) => {
+            const baseAtom = atom<{pages: InfiniteTablePage[]}>({
+                pages: [
+                    {
+                        offset: 0,
+                        limit: pageSize,
+                        cursor: null,
+                        windowing: null,
+                    },
+                ],
+            })
 
-        return atom(
-            (get) => get(baseAtom),
-            (
-                get,
-                set,
-                update:
-                    | {pages: InfiniteTablePage[]}
-                    | ((prev: {pages: InfiniteTablePage[]}) => {pages: InfiniteTablePage[]}),
-            ) => {
-                const nextValue = typeof update === "function" ? update(get(baseAtom)) : update
-                set(baseAtom, nextValue)
-            },
-        )
-    }, pagesKeyEquals)
+            return atom(
+                (get) => get(baseAtom),
+                (
+                    get,
+                    set,
+                    update:
+                        | {pages: InfiniteTablePage[]}
+                        | ((prev: {pages: InfiniteTablePage[]}) => {pages: InfiniteTablePage[]}),
+                ) => {
+                    const nextValue = typeof update === "function" ? update(get(baseAtom)) : update
+                    set(baseAtom, nextValue)
+                },
+            )
+        },
+        pagesKeyEquals,
+        "infiniteTable.tablePagesAtomFamily",
+    )
 
     const tableCombinedRowsAtomFamily = atomFamily(
         ({scopeId, pageSize}: TablePagesKey) =>
@@ -287,6 +343,7 @@ export const createInfiniteTableStore = <
                 return combined
             }),
         pagesKeyEquals,
+        "infiniteTable.tableCombinedRowsAtomFamily",
     )
 
     const tablePaginationInfoAtomFamily = atomFamily(
@@ -324,6 +381,7 @@ export const createInfiniteTableStore = <
                 }
             }),
         pagesKeyEquals,
+        "infiniteTable.tablePaginationInfoAtomFamily",
     )
 
     const createInitialPage = (pageSize: number): InfiniteTablePage => ({
@@ -362,6 +420,7 @@ export const createInfiniteTableStore = <
                 })
             }),
         pagesKeyEquals,
+        "infiniteTable.tableScheduleNextPageAtomFamily",
     )
 
     return {
@@ -375,5 +434,31 @@ export const createInfiniteTableStore = <
             rowsQueryAtomFamily: tableRowsQueryAtomFamily,
         },
         createInitialPage,
+        // Release every atom family entry this store owns. Call after a
+        // long-run loop finishes (or per-iteration in an ETL pass that
+        // rotates scopeId) to avoid the linear growth that comes from
+        // jotai-family's monotonic memoization.
+        dispose() {
+            let total = 0
+            for (const f of ownedFamilies) {
+                total += f.size()
+                f.clear()
+            }
+            // Also remove TanStack queries this store wrote. Each query
+            // keyed by `[options.key, scopeId, ...]` — clearing the
+            // prefix releases all of them. Without this, long-running ETL
+            // accumulates ~50 KB/iter of TanStack observer state.
+            try {
+                const qc = getDefaultStore().get(queryClientAtom)
+                if (qc) qc.removeQueries({queryKey: [options.key]})
+            } catch {
+                // No queryClient available
+            }
+            return total
+        },
+        // Diagnostic: per-family active param counts for this store instance.
+        familySizes() {
+            return ownedFamilies.map((f) => ({name: f.name, size: f.size()}))
+        },
     }
 }
diff --git a/web/packages/agenta-entities/src/shared/paginated/createPaginatedEntityStore.ts b/web/packages/agenta-entities/src/shared/paginated/createPaginatedEntityStore.ts
index 4822a0016b..6d8cca2c68 100644
--- a/web/packages/agenta-entities/src/shared/paginated/createPaginatedEntityStore.ts
+++ b/web/packages/agenta-entities/src/shared/paginated/createPaginatedEntityStore.ts
@@ -10,8 +10,10 @@ import type {Key} from "react"
 import type {Atom, PrimitiveAtom, WritableAtom} from "jotai"
 import {atom} from "jotai"
 import {getDefaultStore} from "jotai"
-import {atomFamily} from "jotai-family"
 
+// Use the instrumented wrapper so each store can be `dispose()`-d.
+// See createInfiniteTableStore.ts for rationale.
+import {instrumentedAtomFamily} from "../molecule/instrumentedAtomFamily"
 import type {InfiniteTableFetchResult, InfiniteTableRowBase, WindowingState} from "../tableTypes"
 
 import {createSimpleTableStore} from "./createSimpleTableStore"
@@ -309,6 +311,19 @@ export interface PaginatedEntityStore<
          */
         refresh: WritableAtom<number, [], void>
     }
+
+    /**
+     * Release every atomFamily entry this store + its underlying table store
+     * own. Returns the total count of params removed. Call after a long-run
+     * ETL pass to release accumulated closures from rotated scopeIds.
+     */
+    dispose: () => number
+
+    /**
+     * Diagnostic: per-family active param counts for this store instance.
+     * Includes both this store's families and the inner table store's.
+     */
+    familySizes: () => {name: string; size: number}[]
 }
 
 // ============================================================================
@@ -396,6 +411,49 @@ export function createPaginatedEntityStore<
     // Helper to create params key for atomFamily
     const paramsKey = (params: PaginatedControllerParams) => `${params.scopeId}:${params.pageSize}`
 
+    // Per-store family registry — dispose() iterates this to release every
+    // entry. See createInfiniteTableStore.ts for the same pattern.
+    interface ManagedFamily {
+        clear: () => void
+        size: () => number
+        readonly name: string
+    }
+    const ownedFamilies: ManagedFamily[] = []
+    // Accept both call shapes to be drop-in compatible with jotai-family:
+    //   atomFamily(create, name)                — added by our migration
+    //   atomFamily(create, areEqual, name)      — preserves original equality fn
+    // The original jotai-family signature is (create, areEqual?). If we
+    // dropped the areEqual through migration, params objects compared by
+    // reference identity instead of structural equality, so every call
+    // would create a fresh atom and break memoization (visible as
+    // pagination state being lost between chunks).
+    // Constrain `A extends Atom<unknown>` to match `instrumentedAtomFamily`'s
+    // signature so the atom-type generic survives the wrapper. Returning the
+    // erased `…<P, never>` shape (as the earlier version did) collapsed every
+    // `get(family(params))` call to `unknown` — that's the leak the `as never`
+    // casts were papering over.
+    const atomFamily = <P, A extends Atom<unknown>>(
+        create: (p: P) => A,
+        areEqualOrName?: ((a: P, b: P) => boolean) | string,
+        nameArg?: string,
+    ) => {
+        let resolvedName: string | undefined
+        let resolvedAreEqual: ((a: P, b: P) => boolean) | undefined
+        if (typeof areEqualOrName === "function") {
+            resolvedAreEqual = areEqualOrName
+            resolvedName = nameArg
+        } else if (typeof areEqualOrName === "string") {
+            resolvedName = areEqualOrName
+        }
+        const fam = instrumentedAtomFamily<P, A>(create, {
+            name: resolvedName,
+            skipRegistry: true,
+            areEqual: resolvedAreEqual,
+        })
+        ownedFamilies.push(fam as unknown as ManagedFamily)
+        return fam
+    }
+
     // Rows selector atom family
     const rowsAtomFamily = atomFamily(
         (params: PaginatedControllerParams) =>
@@ -404,6 +462,7 @@ export function createPaginatedEntityStore<
                 return get(rowsAtom)
             }),
         (a, b) => paramsKey(a) === paramsKey(b),
+        "paginatedEntity.rowsAtomFamily",
     )
 
     // Pagination state selector atom family
@@ -414,6 +473,7 @@ export function createPaginatedEntityStore<
                 return get(paginationAtom)
             }),
         (a, b) => paramsKey(a) === paramsKey(b),
+        "paginatedEntity.paginationAtomFamily",
     )
 
     // Selection atom family (uses underlying store's selection)
@@ -421,6 +481,7 @@ export function createPaginatedEntityStore<
         (params: PaginatedControllerParams) =>
             datasetStore.atoms.selectionAtom({scopeId: params.scopeId}),
         (a, b) => a.scopeId === b.scopeId,
+        "paginatedEntity.selectionAtomFamily",
     )
 
     // Combined state atom family (rows + pagination) - read-only
@@ -437,6 +498,7 @@ export function createPaginatedEntityStore<
                 }
             }),
         (a, b) => paramsKey(a) === paramsKey(b),
+        "paginatedEntity.stateAtomFamily",
     )
 
     // List counts atom family - unified count summary
@@ -487,6 +549,7 @@ export function createPaginatedEntityStore<
                 }
             }),
         (a, b) => paramsKey(a) === paramsKey(b),
+        "paginatedEntity.listCountsAtomFamily",
     )
 
     // Controller atom family - combines all state + dispatch
@@ -546,6 +609,7 @@ export function createPaginatedEntityStore<
                 },
             ),
         (a, b) => paramsKey(a) === paramsKey(b),
+        "paginatedEntity.controllerAtomFamily",
     )
 
     return {
@@ -566,6 +630,36 @@ export function createPaginatedEntityStore<
         actions: {
             refresh: refreshAtom,
         },
+
+        // Release every atomFamily entry this store + its underlying
+        // infiniteTableStore own. After a long-running ETL pass that rotates
+        // scopeId per iteration, call dispose() to release the accumulated
+        // closures — otherwise heap grows ~50 KB per iteration from the
+        // 13 internal atom families (6 here + 7 in createInfiniteTableStore).
+        dispose() {
+            let total = 0
+            for (const f of ownedFamilies) {
+                total += f.size()
+                f.clear()
+            }
+            // Cascade into the table store
+            const inner = (datasetStore as unknown as {dispose?: () => number})?.dispose
+            if (typeof inner === "function") {
+                total += inner.call(datasetStore)
+            }
+            return total
+        },
+
+        // Diagnostic: per-family active param counts for this store instance.
+        familySizes() {
+            const own = ownedFamilies.map((f) => ({name: f.name, size: f.size()}))
+            const inner = (
+                datasetStore as unknown as {
+                    familySizes?: () => {name: string; size: number}[]
+                }
+            )?.familySizes
+            return typeof inner === "function" ? [...own, ...inner.call(datasetStore)] : own
+        },
     }
 }
 
diff --git a/web/packages/agenta-entities/src/simpleQueue/etl/addMatchingTracesToQueue.ts b/web/packages/agenta-entities/src/simpleQueue/etl/addMatchingTracesToQueue.ts
new file mode 100644
index 0000000000..f00b3c7136
--- /dev/null
+++ b/web/packages/agenta-entities/src/simpleQueue/etl/addMatchingTracesToQueue.ts
@@ -0,0 +1,185 @@
+/**
+ * addAllMatchingTracesToQueue — the batch-add-to-queue ETL pipeline.
+ *
+ * Composes the generic ETL primitives — `makeSourceFromCursorFetch` →
+ * `makeUniqueKeyTransform` → `makeBufferedBatchSink` — driven by `runLoop`.
+ * Pages every trace matching a filter, dedups by `trace_id`, and flushes
+ * trace ids to an annotation queue.
+ *
+ * Transport is injected: `fetchPage` (the trace-query adapter, owned by the
+ * observability layer) and `addTraces` (the queue mutation). The pipeline
+ * itself depends only on `@agenta/entities/etl` — pure and unit-testable.
+ *
+ * See docs/designs/etl-batch-add-traces.md.
+ *
+ * @packageDocumentation
+ */
+
+import {
+    makeBufferedBatchSink,
+    makeSourceFromCursorFetch,
+    makeUniqueKeyTransform,
+    runLoop,
+    type CursorPage,
+    type LoopResult,
+} from "../../etl"
+
+/** Minimal shape the pipeline reads from a scanned trace row. */
+export interface ScannedTrace {
+    trace_id?: string
+}
+
+/** One page of traces matching the filter. */
+export type TracePage = CursorPage<ScannedTrace>
+
+/** Fetches one page of matching traces; `cursor` is `null` for the first page. */
+export type TracePageFetcher = (cursor: string | null, signal: AbortSignal) => Promise<TracePage>
+
+/** Queue mutation — resolves to the queue id on success, `null` on failure. */
+export type AddTracesToQueue = (queueId: string, traceIds: string[]) => Promise<string | null>
+
+export interface AddMatchingTracesProgress {
+    /** Trace rows scanned so far. */
+    scanned: number
+    /** Unique trace ids submitted to the queue so far. */
+    queued: number
+}
+
+export interface AddMatchingTracesResult {
+    /** Unique trace ids submitted to the queue. */
+    queued: number
+    /** Trace rows scanned. */
+    scanned: number
+    /** How the run ended. */
+    stoppedBy: "done" | "cancelled" | "cap"
+}
+
+export interface AddMatchingTracesOptions {
+    /** Fetches one page of traces matching the filter. */
+    fetchPage: TracePageFetcher
+    /** Queue mutation that adds a batch of trace ids. */
+    addTraces: AddTracesToQueue
+    /** Target queue. Must be a `kind = traces` queue. */
+    queueId: string
+    /** Abort signal — cancels the scan and stops further flushes. */
+    signal?: AbortSignal
+    /** Trace ids per flush. Default 250. */
+    batchSize?: number
+    /** Delay between page requests, ms. Default 100. */
+    pageDelayMs?: number
+    /** Trace ids already in the queue — excluded from the add. */
+    excludeTraceIds?: ReadonlySet<string>
+    /**
+     * Hard cap on unique trace ids queued in one run. The scan stops once
+     * this many matching ids have been collected. Default 1 000.
+     */
+    maxItems?: number
+    /** Live progress callback, fired after each page and once at the end. */
+    onProgress?: (progress: AddMatchingTracesProgress) => void
+}
+
+/**
+ * Hard cap on unique trace ids queued in one run. Keeps a single batch-add
+ * bounded — to queue more, narrow the filter and run again.
+ */
+export const DEFAULT_MAX_ITEMS = 1_000
+
+/**
+ * Scan every trace matching the filter and add them to an annotation queue.
+ * Resolves when the scan completes, is cancelled, or hits the cap. Throws
+ * `BatchFlushError` (re-exported from `@agenta/entities/etl`) if a flush
+ * fails mid-run.
+ */
+export const addAllMatchingTracesToQueue = async ({
+    fetchPage,
+    addTraces,
+    queueId,
+    signal,
+    batchSize,
+    pageDelayMs,
+    excludeTraceIds,
+    maxItems = DEFAULT_MAX_ITEMS,
+    onProgress,
+}: AddMatchingTracesOptions): Promise<AddMatchingTracesResult> => {
+    const source = makeSourceFromCursorFetch<ScannedTrace>({fetchPage, pageDelayMs})
+    const transform = makeUniqueKeyTransform<ScannedTrace>({
+        selectKey: (trace) => trace.trace_id,
+        exclude: excludeTraceIds,
+        // The transform stops emitting past the cap, so the queued count
+        // never overshoots `maxItems` even on a large final page.
+        limit: maxItems,
+    })
+    const sinkHandle = makeBufferedBatchSink<string>({
+        batchSize,
+        signal,
+        flush: async (batch) => {
+            const result = await addTraces(queueId, batch)
+            // The queue mutation signals a handled failure with `null`.
+            if (result == null) {
+                throw new Error(`addTraces returned null for queue ${queueId}`)
+            }
+        },
+    })
+
+    const gen = runLoop<ScannedTrace, string>(
+        source,
+        [transform],
+        sinkHandle.sink,
+        undefined,
+        signal,
+    )
+
+    let scanned = 0
+    let cappedOut = false
+
+    try {
+        while (true) {
+            const step = await gen.next()
+            if (step.done) {
+                scanned = step.value.scanned
+                break
+            }
+            scanned = step.value.scanned
+            // `Progress.loaded` is flushed-so-far; the final partial batch is
+            // reconciled from the sink handle after the loop.
+            onProgress?.({scanned, queued: step.value.loaded})
+
+            // Cap on matched (deduplicated) ids, not rows scanned — the
+            // transform's own `limit` already stops it emitting past the cap,
+            // so this just ends the scan early instead of paging on uselessly.
+            // A `cursor` of null means the source is already exhausted: that
+            // is a clean `done`, not a cap, even if `matched` landed exactly
+            // on the limit.
+            if (step.value.matched >= maxItems && step.value.cursor !== null) {
+                cappedOut = true
+                // Stop the generator — its `finally` runs `sink.finalize`,
+                // flushing the buffered remainder.
+                await gen.return({
+                    scanned,
+                    matched: step.value.matched,
+                    loaded: sinkHandle.getFlushedCount(),
+                    cursor: null,
+                    done: true,
+                } satisfies LoopResult)
+                break
+            }
+        }
+    } catch (err) {
+        // An abort surfaces as a thrown error from an in-flight request — if
+        // the caller cancelled, treat any failure as cancellation.
+        if (signal?.aborted) {
+            return {queued: sinkHandle.getFlushedCount(), scanned, stoppedBy: "cancelled"}
+        }
+        // BatchFlushError (and anything else) surfaces to the caller.
+        throw err
+    }
+
+    const queued = sinkHandle.getFlushedCount()
+    onProgress?.({scanned, queued})
+
+    return {
+        queued,
+        scanned,
+        stoppedBy: cappedOut ? "cap" : signal?.aborted ? "cancelled" : "done",
+    }
+}
diff --git a/web/packages/agenta-entities/src/simpleQueue/etl/index.ts b/web/packages/agenta-entities/src/simpleQueue/etl/index.ts
new file mode 100644
index 0000000000..cef6596f56
--- /dev/null
+++ b/web/packages/agenta-entities/src/simpleQueue/etl/index.ts
@@ -0,0 +1,23 @@
+/**
+ * @agenta/entities/simpleQueue/etl
+ *
+ * The batch-add-to-queue ETL pipeline — scans every trace matching a filter
+ * and adds the deduplicated trace ids to an annotation queue. Composes the
+ * generic ETL primitives from `@agenta/entities/etl`.
+ *
+ * @packageDocumentation
+ */
+
+export {addAllMatchingTracesToQueue, DEFAULT_MAX_ITEMS} from "./addMatchingTracesToQueue"
+export type {
+    AddMatchingTracesOptions,
+    AddMatchingTracesProgress,
+    AddMatchingTracesResult,
+    AddTracesToQueue,
+    ScannedTrace,
+    TracePage,
+    TracePageFetcher,
+} from "./addMatchingTracesToQueue"
+
+// Re-exported so consumers have one import surface for the feature.
+export {BatchFlushError} from "../../etl"
diff --git a/web/packages/agenta-entities/src/testcase/api/api.ts b/web/packages/agenta-entities/src/testcase/api/api.ts
index ba2df27025..b7e6ef7c4e 100644
--- a/web/packages/agenta-entities/src/testcase/api/api.ts
+++ b/web/packages/agenta-entities/src/testcase/api/api.ts
@@ -9,7 +9,11 @@ import {axios, getAgentaApiUrl} from "@agenta/shared/api"
 import {getDefaultStore} from "jotai/vanilla"
 import {queryClientAtom} from "jotai-tanstack-query"
 
-import {safeParseWithLogging} from "../../shared"
+// Import from the pure zodSchema source rather than the shared barrel. The
+// shared barrel transitively re-exports paginated/table helpers that depend on
+// agenta-ui (CSS modules), which breaks Node-side execution (scripts, tests,
+// ETL adapters). The api layer must stay Node-safe.
+import {safeParseWithLogging} from "../../shared/utils/zodSchema"
 import {testcasesResponseSchema, type Testcase, type TestcasesResponse} from "../core"
 import type {
     TestcaseDetailParams,
@@ -98,6 +102,19 @@ export async function fetchTestcasesBatch(
             for (const testcase of validatedResponse.testcases) {
                 results.set(testcase.id, testcase)
             }
+            // Populate the TanStack cache so subsequent reads via
+            // `testcaseMolecule.get.data(id)` and `prefetchTestcasesByIds`
+            // hit cache. Matches the cache-write behaviour of
+            // `fetchTestcasesPage` — the two batch fetchers now consistent.
+            try {
+                const store = getDefaultStore()
+                const queryClient = store.get(queryClientAtom)
+                for (const tc of validatedResponse.testcases) {
+                    queryClient.setQueryData(["testcase", projectId, tc.id], tc)
+                }
+            } catch {
+                // Silently ignore if query client not available (SSR, scripts).
+            }
         }
     } catch (error) {
         console.error("[fetchTestcasesBatch] Failed to fetch testcases:", error)
diff --git a/web/packages/agenta-entities/src/testcase/core/schema.ts b/web/packages/agenta-entities/src/testcase/core/schema.ts
index 4b0c14c5de..7469c12343 100644
--- a/web/packages/agenta-entities/src/testcase/core/schema.ts
+++ b/web/packages/agenta-entities/src/testcase/core/schema.ts
@@ -21,12 +21,13 @@
 
 import {z} from "zod"
 
+// See testcase/api/api.ts for rationale — the shared barrel pulls in CSS deps.
 import {
     createEntitySchemaSet,
     COMMON_SERVER_FIELDS,
     jsonValueSchema,
     safeParseWithLogging,
-} from "../../shared"
+} from "../../shared/utils/zodSchema"
 
 // ============================================================================
 // HELPER SCHEMAS
diff --git a/web/packages/agenta-entities/src/testcase/index.ts b/web/packages/agenta-entities/src/testcase/index.ts
index 2f995c1d79..db5361b873 100644
--- a/web/packages/agenta-entities/src/testcase/index.ts
+++ b/web/packages/agenta-entities/src/testcase/index.ts
@@ -130,3 +130,10 @@ export {testcasePaginatedStore} from "./state"
 export {testcaseDataController} from "./state"
 
 export type {TestcaseTableRow, TestcasePaginatedMeta, TestcaseDataConfig} from "./state"
+
+/**
+ * Cache-aware bulk prefetch for testcases by ID list. Used by the ETL
+ * hydrate pipeline + downstream cell renderers. Writes results to the
+ * shared TanStack cache at `["testcase", projectId, testcaseId]`.
+ */
+export {prefetchTestcasesByIds} from "./state"
diff --git a/web/packages/agenta-entities/src/testcase/state/index.ts b/web/packages/agenta-entities/src/testcase/state/index.ts
index 6b86bfd7cf..2ca3b386fd 100644
--- a/web/packages/agenta-entities/src/testcase/state/index.ts
+++ b/web/packages/agenta-entities/src/testcase/state/index.ts
@@ -8,6 +8,14 @@
 // Molecule (primary API)
 export {testcaseMolecule, type TestcaseMolecule, type CreateTestcasesOptions} from "./molecule"
 
+// Cache-aware bulk prefetch (reads TanStack cache, fetches misses)
+export {
+    prefetchTestcasesByIds,
+    invalidateTestcase,
+    type PrefetchTestcasesArgs,
+    type PrefetchTestcasesOutcome,
+} from "./prefetch"
+
 // Store atoms (for advanced use cases)
 export {
     // Context
diff --git a/web/packages/agenta-entities/src/testcase/state/molecule.ts b/web/packages/agenta-entities/src/testcase/state/molecule.ts
index f77163510b..64e604c69a 100644
--- a/web/packages/agenta-entities/src/testcase/state/molecule.ts
+++ b/web/packages/agenta-entities/src/testcase/state/molecule.ts
@@ -30,16 +30,22 @@ import {
     getItemsAtPath,
     type DataPath,
 } from "@agenta/shared/utils"
+import type {PathItem} from "@agenta/shared/utils"
 import {atom} from "jotai"
 import {getDefaultStore} from "jotai/vanilla"
 import {atomFamily} from "jotai-family"
 
-import {createMolecule, extendMolecule, createControllerAtomFamily} from "../../shared"
-import type {StoreOptions, PathItem, LoadableRow, LoadableColumn} from "../../shared"
+// Deep-import from shared/molecule to bypass the contaminated shared barrel.
+import type {LoadableRow, LoadableColumn} from "../../shared/entityBridge"
+import {createControllerAtomFamily} from "../../shared/molecule/createControllerAtomFamily"
+import {createMolecule} from "../../shared/molecule/createMolecule"
+import {extendMolecule} from "../../shared/molecule/extendMolecule"
+import type {StoreOptions} from "../../shared/molecule/types"
 import type {Column, Testcase} from "../core"
 import {createLocalTestcase} from "../core"
 
 import {testcasesRevisionIdAtom, initializeEmptyRevisionAtom} from "./paginatedStore"
+import {evictTestcasesByIds, prefetchTestcasesByIds} from "./prefetch"
 import {
     // Query and entity atoms
     testcaseQueryAtomFamily,
@@ -892,6 +898,25 @@ export const testcaseMolecule = {
         discardSelectionDraft: discardSelectionDraftAtom,
         /** Initialize empty revision with default testcase (for "create from scratch" flow) */
         initializeEmptyRevision: initializeEmptyRevisionAtom,
+        /**
+         * Cache-aware bulk prefetch by testcase ID list. Same shape as
+         * `evaluationResultMolecule.actions.prefetchByScenarioIds` and
+         * `evaluationMetricMolecule.actions.prefetchByScenarioIds` —
+         * makes the prefetch surface symmetric across the 4 ETL-hydrated
+         * entity types. Writes to the shared TanStack cache at
+         * `["testcase", projectId, testcaseId]`.
+         *
+         * Wraps the existing `prefetchTestcasesByIds` standalone function
+         * (kept for backwards compatibility with non-molecule consumers).
+         */
+        prefetchByIds: prefetchTestcasesByIds,
+        /**
+         * Cache-aware bulk eviction by testcase ID list — the per-chunk
+         * counterpart of `prefetchByIds`. An ETL chunk-release hook calls
+         * this once the sink has consumed a chunk so heap stays bounded
+         * by chunk size. Wraps `evictTestcasesByIds`.
+         */
+        evictByIds: evictTestcasesByIds,
     },
 
     /**
diff --git a/web/packages/agenta-entities/src/testcase/state/prefetch.ts b/web/packages/agenta-entities/src/testcase/state/prefetch.ts
new file mode 100644
index 0000000000..390965ff3c
--- /dev/null
+++ b/web/packages/agenta-entities/src/testcase/state/prefetch.ts
@@ -0,0 +1,138 @@
+/**
+ * Cache-aware bulk-prefetch for testcases.
+ *
+ * The existing `fetchTestcasesBatch` api function already writes to the
+ * shared TanStack cache (`["testcase", projectId, id]`) but never reads it
+ * — so concurrent calls refetch everything. This wrapper closes that gap:
+ *
+ *   1. Read each requested id from the cache
+ *   2. Partition into hits vs misses
+ *   3. Bulk-fetch ONLY the misses
+ *   4. Merge cached + fetched and return
+ *
+ * Co-existence with `fetchTestcasesBatch` is safe — it writes the same cache
+ * keys after the network call, so newly-fetched rows land in the cache for
+ * the next reader regardless of which path called it.
+ *
+ * @packageDocumentation
+ */
+
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {fetchTestcasesBatch} from "../api"
+import type {Testcase} from "../core"
+
+function cacheKey(projectId: string, id: string) {
+    return ["testcase", projectId, id] as const
+}
+
+function getQc() {
+    return getDefaultStore().get(queryClientAtom)
+}
+
+export interface PrefetchTestcasesArgs {
+    projectId: string
+    testcaseIds: string[]
+}
+
+export interface PrefetchTestcasesOutcome {
+    /** All testcases, keyed by id. Cached entries are merged with freshly fetched. */
+    testcases: Map<string, Testcase>
+    cacheHits: number
+    cacheMisses: number
+    fetchMs: number
+}
+
+export async function prefetchTestcasesByIds(
+    args: PrefetchTestcasesArgs,
+): Promise<PrefetchTestcasesOutcome> {
+    const {projectId, testcaseIds} = args
+
+    if (testcaseIds.length === 0) {
+        return {testcases: new Map(), cacheHits: 0, cacheMisses: 0, fetchMs: 0}
+    }
+
+    let qc: ReturnType<typeof getQc> | null = null
+    try {
+        qc = getQc()
+    } catch {
+        // No Jotai store — degrade to full fetch
+    }
+
+    const out = new Map<string, Testcase>()
+    const misses: string[] = []
+
+    if (qc) {
+        for (const id of testcaseIds) {
+            const cached = qc.getQueryData<Testcase>(cacheKey(projectId, id))
+            if (cached) {
+                out.set(id, cached)
+            } else {
+                misses.push(id)
+            }
+        }
+    } else {
+        misses.push(...testcaseIds)
+    }
+
+    let fetchMs = 0
+    if (misses.length > 0) {
+        const start = performance.now()
+        const fetched = await fetchTestcasesBatch({projectId, testcaseIds: misses})
+        fetchMs = performance.now() - start
+        fetched.forEach((tc, id) => out.set(id, tc))
+        // fetchTestcasesBatch already writes to TanStack cache, so no extra work here.
+    }
+
+    return {
+        testcases: out,
+        cacheHits: testcaseIds.length - misses.length,
+        cacheMisses: misses.length,
+        fetchMs,
+    }
+}
+
+/**
+ * Invalidate a single testcase's cache entry — next read will refetch.
+ */
+export function invalidateTestcase({
+    projectId,
+    testcaseId,
+}: {
+    projectId: string
+    testcaseId: string
+}) {
+    try {
+        getQc().removeQueries({queryKey: cacheKey(projectId, testcaseId)})
+    } catch {}
+}
+
+/**
+ * Bulk-evict testcase cache entries — the per-chunk counterpart of
+ * `prefetchTestcasesByIds`. An ETL chunk-release hook calls this once a
+ * chunk is consumed so heap stays bounded by chunk size, not dataset
+ * size. Returns the number of entries actually removed.
+ */
+export function evictTestcasesByIds({
+    projectId,
+    testcaseIds,
+}: {
+    projectId: string
+    testcaseIds: string[]
+}): number {
+    let removed = 0
+    try {
+        const qc = getQc()
+        for (const id of testcaseIds) {
+            const key = cacheKey(projectId, id)
+            if (qc.getQueryData(key) !== undefined) {
+                qc.removeQueries({queryKey: key, exact: true})
+                removed++
+            }
+        }
+    } catch {
+        // No queryClient — nothing to evict.
+    }
+    return removed
+}
diff --git a/web/packages/agenta-entities/src/testcase/state/store.ts b/web/packages/agenta-entities/src/testcase/state/store.ts
index 90c22ac711..38e07d65b1 100644
--- a/web/packages/agenta-entities/src/testcase/state/store.ts
+++ b/web/packages/agenta-entities/src/testcase/state/store.ts
@@ -25,7 +25,11 @@ import {atomFamily} from "jotai-family"
 import {atomWithQuery, queryClientAtom} from "jotai-tanstack-query"
 import get from "lodash/get"
 
-import {createEntityDraftState, normalizeValueForComparison} from "../../shared"
+// Deep-import — shared/index barrel pulls in @agenta/ui CSS modules.
+import {
+    createEntityDraftState,
+    normalizeValueForComparison,
+} from "../../shared/molecule/createEntityDraftState"
 import {pendingColumnOpsAtomFamily} from "../../testset/state/revisionTableState"
 import {testcaseSchema, SYSTEM_FIELDS, type Testcase} from "../core"
 
diff --git a/web/packages/agenta-entities/src/trace/api/api.ts b/web/packages/agenta-entities/src/trace/api/api.ts
index 4c3efba766..0028971021 100644
--- a/web/packages/agenta-entities/src/trace/api/api.ts
+++ b/web/packages/agenta-entities/src/trace/api/api.ts
@@ -16,7 +16,8 @@
 
 import {axios, getAgentaApiUrl} from "@agenta/shared/api"
 
-import {safeParseWithLogging} from "../../shared"
+// See testcase/api/api.ts for rationale — the shared barrel pulls in CSS deps.
+import {safeParseWithLogging} from "../../shared/utils/zodSchema"
 import {
     spansResponseSchema,
     tracesResponseSchema,
diff --git a/web/packages/agenta-entities/src/trace/core/schema.ts b/web/packages/agenta-entities/src/trace/core/schema.ts
index ac18593fba..efaa4cc5cc 100644
--- a/web/packages/agenta-entities/src/trace/core/schema.ts
+++ b/web/packages/agenta-entities/src/trace/core/schema.ts
@@ -20,7 +20,12 @@
 
 import {z} from "zod"
 
-import {timestampFieldsSchema, auditFieldsSchema, safeParseWithLogging} from "../../shared"
+// See testcase/api/api.ts for rationale — the shared barrel pulls in CSS deps.
+import {
+    timestampFieldsSchema,
+    auditFieldsSchema,
+    safeParseWithLogging,
+} from "../../shared/utils/zodSchema"
 
 // --- ENUMS -------------------------------------------------------------------
 
diff --git a/web/packages/agenta-entities/src/trace/etl/adaptivePacing.ts b/web/packages/agenta-entities/src/trace/etl/adaptivePacing.ts
new file mode 100644
index 0000000000..9a3844ad10
--- /dev/null
+++ b/web/packages/agenta-entities/src/trace/etl/adaptivePacing.ts
@@ -0,0 +1,66 @@
+/**
+ * Adaptive request pacing — choose how long to sleep before the next page
+ * request based on the throttle bucket's headroom.
+ *
+ * The frontend can read `X-RateLimit-Remaining` / `X-RateLimit-Limit` on
+ * every successful response (the backend's throttling middleware emits both
+ * for any tenant scoped to a plan). This module turns that signal into a
+ * delay: short when the bucket has burst capacity left, long when it's
+ * drained toward the sustained refill rate.
+ *
+ * Bucket-aware, not tier-aware. The same logic works across free / pro /
+ * business / enterprise — the server tells us how full the bucket is, we
+ * don't need to know the plan.
+ *
+ * @packageDocumentation
+ */
+
+/**
+ * Latest throttle bucket reading. `null` when the backend didn't emit the
+ * corresponding header (OSS deployments without EE throttling, the very
+ * first request before any header has been seen, …).
+ */
+export interface AdaptivePacingRateLimit {
+    /** `X-RateLimit-Remaining` — tokens left in the bucket. */
+    remaining: number | null
+    /** `X-RateLimit-Limit` — bucket capacity. */
+    limit: number | null
+}
+
+/** Floor delay (ms) when the bucket looks full or unknown. */
+export const ADAPTIVE_FLOOR_DELAY_MS = 100
+/**
+ * Ceiling delay (ms) when the bucket is drained — paced at EE's TRACING_SLOW
+ * sustained refill rate of 1 token/second (the same across free / pro /
+ * business / enterprise plans), so the scan can't outrun the refill.
+ */
+export const ADAPTIVE_CEILING_DELAY_MS = 1_000
+/**
+ * Below this fill ratio the delay starts ramping up. Above it the floor
+ * wins. Keeps the scan running at full speed while there's meaningful
+ * burst headroom.
+ */
+export const ADAPTIVE_RAMP_START_FILL = 0.5
+
+/**
+ * Pick the next page-fetch delay (ms) given the latest bucket state.
+ *
+ * - `fill ≥ RAMP_START_FILL` → FLOOR  (run fast, burst capacity available)
+ * - `0 < fill < RAMP_START_FILL` → linear ramp toward CEILING
+ * - `fill ≤ 0` → CEILING  (paced at the sustained refill rate)
+ * - headers unavailable → FLOOR  (the 429-retry wrapper is the safety net)
+ */
+export const computeAdaptivePageDelayMs = (rateLimit: AdaptivePacingRateLimit): number => {
+    const {remaining, limit} = rateLimit
+    if (remaining == null || limit == null || limit <= 0) return ADAPTIVE_FLOOR_DELAY_MS
+
+    const fill = remaining / limit
+    if (fill >= ADAPTIVE_RAMP_START_FILL) return ADAPTIVE_FLOOR_DELAY_MS
+    if (fill <= 0) return ADAPTIVE_CEILING_DELAY_MS
+
+    // Linear ramp from FLOOR at `fill = RAMP_START_FILL` to CEILING at `fill = 0`.
+    const t = 1 - fill / ADAPTIVE_RAMP_START_FILL
+    const delay =
+        ADAPTIVE_FLOOR_DELAY_MS + t * (ADAPTIVE_CEILING_DELAY_MS - ADAPTIVE_FLOOR_DELAY_MS)
+    return Math.round(delay)
+}
diff --git a/web/packages/agenta-entities/src/trace/etl/exportMatchingTraces.ts b/web/packages/agenta-entities/src/trace/etl/exportMatchingTraces.ts
new file mode 100644
index 0000000000..cbabc0c5d7
--- /dev/null
+++ b/web/packages/agenta-entities/src/trace/etl/exportMatchingTraces.ts
@@ -0,0 +1,239 @@
+/**
+ * exportMatchingTraces — the bulk-trace export ETL pipeline.
+ *
+ * Composes the generic ETL primitives — `makeSourceFromCursorFetch` →
+ * flattenSpanTrees (transform) → dedupAndCap (transform) →
+ * `makeBufferedBatchSink` — driven by `runLoop`. Pages every trace matching a
+ * filter, flattens each tree into the flat list of its descendant spans,
+ * dedups by row key, caps at `maxRows`, and flushes deduplicated rows in
+ * fixed-size batches to a caller-provided `flushBatch` transport.
+ *
+ * Format-agnostic by design: the caller's `flushBatch` does the CSV/JSON/...
+ * encoding and accumulation. Mirrors `addAllMatchingTracesToQueue` — both
+ * pipelines share the same Source contract; the difference is what the sink
+ * does with each batch.
+ *
+ * @packageDocumentation
+ */
+
+import {
+    makeBufferedBatchSink,
+    makeSourceFromCursorFetch,
+    runLoop,
+    type Chunk,
+    type CursorPage,
+    type LoopResult,
+    type Transform,
+} from "../../etl"
+
+/**
+ * Minimum shape the pipeline reads from each scanned row. The actual rows
+ * passed through to `flushBatch` are whatever `T` the caller chose — the
+ * pipeline only ever reads `trace_id`, `span_id`, `parent_id`, `start_time`,
+ * and `children` here.
+ */
+export interface ScannedExportRow {
+    trace_id?: string
+    span_id?: string
+    parent_id?: string
+    start_time?: string | number
+    children?: readonly ScannedExportRow[]
+}
+
+/** One page of root traces matching the filter. */
+export type ExportTracePage<T extends ScannedExportRow> = CursorPage<T>
+
+/** Fetches one page of matching traces; `cursor` is `null` for the first page. */
+export type ExportTracePageFetcher<T extends ScannedExportRow> = (
+    cursor: string | null,
+    signal: AbortSignal,
+) => Promise<ExportTracePage<T>>
+
+/** A single batch of deduplicated rows handed to the caller's transport. */
+export type FlushBatch<T extends ScannedExportRow> = (rows: T[]) => Promise<void>
+
+export interface ExportMatchingTracesProgress {
+    /** Root traces yielded by the source so far (counts tree roots, not spans). */
+    scanned: number
+    /**
+     * Unique rows the pipeline has produced for the sink so far (post-
+     * dedup, post-cap). Reflects the user-visible "rows exported"
+     * regardless of the sink's batching cadence — the buffered batch sink
+     * holds rows until a full batch flushes, so reporting the flushed
+     * count instead leaves the UI stuck at 0 for the first 500 rows of
+     * every run.
+     */
+    rows: number
+}
+
+export interface ExportMatchingTracesResult {
+    /** Unique rows successfully flushed. */
+    rowCount: number
+    /** Root traces scanned. */
+    scanned: number
+    /** True iff the scan stopped because `maxRows` was reached. */
+    limitReached: boolean
+    /** How the run ended. */
+    stoppedBy: "done" | "cancelled" | "limit"
+}
+
+export interface ExportMatchingTracesOptions<T extends ScannedExportRow> {
+    /** Fetches one page of matching traces. */
+    fetchPage: ExportTracePageFetcher<T>
+    /** Transport that consumes one batch of deduplicated rows. */
+    flushBatch: FlushBatch<T>
+    /**
+     * Dedup key per row. Defaults to `trace_id:span_id`, falling back to
+     * `trace_id:parent_id:start_time` when one of the ids is missing.
+     */
+    selectKey?: (trace: T) => string
+    /** Abort signal — cancels the scan and stops further flushes. */
+    signal?: AbortSignal
+    /** Delay between page requests, ms. Default 100. */
+    pageDelayMs?: number
+    /** Rows per `flushBatch` call. Default 500. */
+    batchSize?: number
+    /** Hard cap on unique rows emitted to `flushBatch`. Default 20 000. */
+    maxRows?: number
+    /** Live progress callback, fired after each page and once at the end. */
+    onProgress?: (progress: ExportMatchingTracesProgress) => void
+}
+
+/** Hard cap on unique rows in one export. */
+export const DEFAULT_MAX_ROWS = 20_000
+/** Rows per `flushBatch` call. */
+export const DEFAULT_BATCH_SIZE = 500
+
+const defaultSelectKey = (trace: ScannedExportRow): string => {
+    if (trace.trace_id && trace.span_id) {
+        return `${trace.trace_id}:${trace.span_id}`
+    }
+    return `${trace.trace_id ?? ""}:${trace.parent_id ?? ""}:${trace.start_time ?? ""}`
+}
+
+const collectDescendants = <T extends ScannedExportRow>(node: T, out: T[]): void => {
+    out.push(node)
+    const children = node.children
+    if (children && Array.isArray(children)) {
+        for (const child of children) collectDescendants(child as T, out)
+    }
+}
+
+/** Expand each tree root into the flat list of all its descendant spans. */
+const makeFlattenSpanTreesTransform = <T extends ScannedExportRow>(): Transform<T, T> => {
+    return (chunk: Chunk<T>): Chunk<T> => {
+        const flat: T[] = []
+        for (const root of chunk.items) collectDescendants(root, flat)
+        return {items: flat, cursor: chunk.cursor, meta: chunk.meta}
+    }
+}
+
+/**
+ * Row-preserving dedup with an emission cap. Mirrors `makeUniqueKeyTransform`
+ * but emits the original rows (not just the keys) so the downstream sink can
+ * format them. Inline here because no other pipeline needs row-preserving
+ * dedup today — promote to a shared primitive when a second caller shows up.
+ */
+const makeDedupAndCapTransform = <T extends ScannedExportRow>(
+    selectKey: (t: T) => string,
+    maxRows: number,
+): Transform<T, T> => {
+    const seen = new Set<string>()
+    let emitted = 0
+    return (chunk: Chunk<T>): Chunk<T> => {
+        const out: T[] = []
+        for (const item of chunk.items) {
+            if (emitted >= maxRows) break
+            const key = selectKey(item)
+            if (seen.has(key)) continue
+            seen.add(key)
+            out.push(item)
+            emitted += 1
+        }
+        return {items: out, cursor: chunk.cursor, meta: chunk.meta}
+    }
+}
+
+/**
+ * Scan every trace matching the filter, flatten each tree, dedup, cap, and
+ * flush rows in fixed-size batches to the caller's transport. Resolves when
+ * the scan completes, is cancelled, or hits the row cap. Throws
+ * `BatchFlushError` (re-exported from `@agenta/entities/etl`) if a flush
+ * fails mid-run.
+ */
+export const exportMatchingTraces = async <T extends ScannedExportRow>({
+    fetchPage,
+    flushBatch,
+    selectKey = defaultSelectKey as (t: T) => string,
+    signal,
+    pageDelayMs,
+    batchSize = DEFAULT_BATCH_SIZE,
+    maxRows = DEFAULT_MAX_ROWS,
+    onProgress,
+}: ExportMatchingTracesOptions<T>): Promise<ExportMatchingTracesResult> => {
+    const source = makeSourceFromCursorFetch<T>({fetchPage, pageDelayMs})
+    const flatten = makeFlattenSpanTreesTransform<T>()
+    const dedupCap = makeDedupAndCapTransform<T>(selectKey, maxRows)
+    const sinkHandle = makeBufferedBatchSink<T>({batchSize, signal, flush: flushBatch})
+
+    const gen = runLoop<T, T>(source, [flatten, dedupCap], sinkHandle.sink, undefined, signal)
+
+    let scanned = 0
+    let limitReached = false
+
+    try {
+        while (true) {
+            const step = await gen.next()
+            if (step.done) {
+                scanned = step.value.scanned
+                break
+            }
+            scanned = step.value.scanned
+            // Report `matched` (post-dedup, post-cap) so the UI sees
+            // immediate progress per page — `loaded` lags by the buffered
+            // batch, which makes the toast appear stuck at 0 until the
+            // first full batch flushes.
+            onProgress?.({scanned, rows: step.value.matched})
+
+            // Cap on matched (deduplicated) rows, not raw spans. The transform
+            // already stops emitting past the cap; this just ends the scan
+            // early so the next page isn't fetched uselessly. A `cursor` of
+            // null means the source is already exhausted — a clean `done`,
+            // not a cap, even if `matched` lands exactly on the limit.
+            if (step.value.matched >= maxRows && step.value.cursor !== null) {
+                limitReached = true
+                await gen.return({
+                    scanned,
+                    matched: step.value.matched,
+                    loaded: sinkHandle.getFlushedCount(),
+                    cursor: null,
+                    done: true,
+                } satisfies LoopResult)
+                break
+            }
+        }
+    } catch (err) {
+        // An abort surfaces as a thrown error from an in-flight request —
+        // when the caller cancelled, treat any failure as cancellation.
+        if (signal?.aborted) {
+            return {
+                rowCount: sinkHandle.getFlushedCount(),
+                scanned,
+                limitReached: false,
+                stoppedBy: "cancelled",
+            }
+        }
+        // BatchFlushError (and anything else) surfaces to the caller.
+        throw err
+    }
+
+    const rowCount = sinkHandle.getFlushedCount()
+    onProgress?.({scanned, rows: rowCount})
+
+    return {
+        rowCount,
+        scanned,
+        limitReached,
+        stoppedBy: limitReached ? "limit" : signal?.aborted ? "cancelled" : "done",
+    }
+}
diff --git a/web/packages/agenta-entities/src/trace/etl/index.ts b/web/packages/agenta-entities/src/trace/etl/index.ts
new file mode 100644
index 0000000000..1284abd688
--- /dev/null
+++ b/web/packages/agenta-entities/src/trace/etl/index.ts
@@ -0,0 +1,39 @@
+/**
+ * @agenta/entities/trace/etl
+ *
+ * Bulk-trace export ETL pipelines — pages every trace matching a filter,
+ * flattens each tree, dedups, caps, and flushes rows in fixed-size batches to
+ * a caller-provided transport. Composes the generic primitives from
+ * `@agenta/entities/etl`. Format-agnostic — the caller's `flushBatch` does
+ * the CSV/JSON/... encoding.
+ *
+ * @packageDocumentation
+ */
+
+export {DEFAULT_BATCH_SIZE, DEFAULT_MAX_ROWS, exportMatchingTraces} from "./exportMatchingTraces"
+export {
+    ADAPTIVE_CEILING_DELAY_MS,
+    ADAPTIVE_FLOOR_DELAY_MS,
+    ADAPTIVE_RAMP_START_FILL,
+    computeAdaptivePageDelayMs,
+} from "./adaptivePacing"
+export type {AdaptivePacingRateLimit} from "./adaptivePacing"
+export {
+    inferQueueMaxFromPlan,
+    QUEUE_MAX_BUSINESS,
+    QUEUE_MAX_ENTERPRISE,
+    QUEUE_MAX_HOBBY,
+    QUEUE_MAX_PRO,
+} from "./tierQueueCap"
+export type {
+    ExportMatchingTracesOptions,
+    ExportMatchingTracesProgress,
+    ExportMatchingTracesResult,
+    ExportTracePage,
+    ExportTracePageFetcher,
+    FlushBatch,
+    ScannedExportRow,
+} from "./exportMatchingTraces"
+
+// Re-exported so consumers have one import surface for the feature.
+export {BatchFlushError} from "../../etl"
diff --git a/web/packages/agenta-entities/src/trace/etl/tierQueueCap.ts b/web/packages/agenta-entities/src/trace/etl/tierQueueCap.ts
new file mode 100644
index 0000000000..c1a8188c8c
--- /dev/null
+++ b/web/packages/agenta-entities/src/trace/etl/tierQueueCap.ts
@@ -0,0 +1,49 @@
+/**
+ * inferQueueMaxFromPlan — per-plan cap on how many traces a single
+ * "add all matching traces to annotation queue" run will batch.
+ *
+ * Annotation queues are for human review, so the cap reflects "how many
+ * traces is a realistic batch on this tier" (bigger plans → bigger teams
+ * → bigger batches), not "how many the system can technically handle".
+ * The system limit is separately enforced by the export pipeline's
+ * 20k-row safety ceiling.
+ *
+ * The frontend already fetches the plan slug via the billing subscription
+ * query — this module is the pure mapping from that slug to the cap, so
+ * it can be unit-tested without any React / store coupling.
+ *
+ * @packageDocumentation
+ */
+
+/** Conservative cap for hobby / free / unknown / pre-billing-load states. */
+export const QUEUE_MAX_HOBBY = 1_000
+/** Pro tier cap. */
+export const QUEUE_MAX_PRO = 2_500
+/** Business tier cap. */
+export const QUEUE_MAX_BUSINESS = 10_000
+/** Enterprise tier cap — matches the bulk-export ceiling. */
+export const QUEUE_MAX_ENTERPRISE = 20_000
+
+/**
+ * Map a subscription plan slug to its per-run queue-batch cap.
+ *
+ * Plan slugs come from `/billing/subscription` and look like
+ * `cloud_v0_pro`, `cloud_v0_business`, `cloud_v0_agenta_ai`,
+ * `self_hosted_enterprise`, … — see `DefaultPlan` in
+ * `api/ee/src/core/entitlements/types.py`.
+ *
+ * Unknown / undefined / OSS-without-billing slugs return the hobby cap —
+ * a safe default that matches the historical behavior.
+ */
+export const inferQueueMaxFromPlan = (plan: string | null | undefined): number => {
+    if (!plan || typeof plan !== "string") return QUEUE_MAX_HOBBY
+    const slug = plan.toLowerCase()
+    // Enterprise (Agenta-managed or self-hosted) gets the highest cap.
+    if (slug.includes("enterprise") || slug.includes("agenta_ai")) {
+        return QUEUE_MAX_ENTERPRISE
+    }
+    if (slug.includes("business")) return QUEUE_MAX_BUSINESS
+    if (slug.includes("pro")) return QUEUE_MAX_PRO
+    // Everything else (hobby, unknown, …) gets the conservative default.
+    return QUEUE_MAX_HOBBY
+}
diff --git a/web/packages/agenta-entities/src/trace/index.ts b/web/packages/agenta-entities/src/trace/index.ts
index eb67094610..6be72564cd 100644
--- a/web/packages/agenta-entities/src/trace/index.ts
+++ b/web/packages/agenta-entities/src/trace/index.ts
@@ -180,4 +180,7 @@ export {
     // Error classes
     SpanNotFoundError,
     TraceNotFoundError,
+    // Cache-aware bulk prefetch (ETL hydrate path). Writes results to the
+    // shared TanStack cache at `["trace-entity", projectId, traceId]`.
+    prefetchTracesByIds,
 } from "./state"
diff --git a/web/packages/agenta-entities/src/trace/state/index.ts b/web/packages/agenta-entities/src/trace/state/index.ts
index 14d826f2d4..5dae76d98d 100644
--- a/web/packages/agenta-entities/src/trace/state/index.ts
+++ b/web/packages/agenta-entities/src/trace/state/index.ts
@@ -29,3 +29,14 @@ export {
     // Span query atom (used internally by molecule)
     spanQueryAtomFamily,
 } from "./store"
+
+// ============================================================================
+// PREFETCH (cache-aware bulk)
+// ============================================================================
+
+export {
+    prefetchTracesByIds,
+    invalidateTrace,
+    type PrefetchTracesArgs,
+    type PrefetchTracesOutcome,
+} from "./prefetch"
diff --git a/web/packages/agenta-entities/src/trace/state/molecule.ts b/web/packages/agenta-entities/src/trace/state/molecule.ts
index a072f6b4cd..21043633de 100644
--- a/web/packages/agenta-entities/src/trace/state/molecule.ts
+++ b/web/packages/agenta-entities/src/trace/state/molecule.ts
@@ -47,18 +47,20 @@ import {atom} from "jotai"
 import {getDefaultStore} from "jotai/vanilla"
 import {atomFamily} from "jotai-family"
 
-import {
-    createMolecule,
-    extendMolecule,
-    normalizeValueForComparison,
-    createControllerAtomFamily,
-    type AtomFamily,
-    type StoreOptions,
-    type FlexibleWritableAtomFamily,
-} from "../../shared"
+// Deep-import from shared/molecule to bypass the contaminated shared barrel.
+import {createControllerAtomFamily} from "../../shared/molecule/createControllerAtomFamily"
+import {normalizeValueForComparison} from "../../shared/molecule/createEntityDraftState"
+import {createMolecule} from "../../shared/molecule/createMolecule"
+import {extendMolecule} from "../../shared/molecule/extendMolecule"
+import type {
+    AtomFamily,
+    StoreOptions,
+    FlexibleWritableAtomFamily,
+} from "../../shared/molecule/types"
 import type {TraceSpan} from "../core"
 import {extractAgData, extractInputs, extractOutputs} from "../utils"
 
+import {evictTracesByIds, prefetchTracesByIds} from "./prefetch"
 import {spanQueryAtomFamily} from "./store"
 
 // ============================================================================
@@ -462,6 +464,29 @@ export const traceSpanMolecule = {
          */
         dataAtom: localDataAtomFamily,
     },
+
+    /**
+     * Bulk actions on the trace cache (the ["trace-entity", projectId, traceId]
+     * shared TanStack slot). Symmetric with the other ETL-hydrated entities:
+     *
+     *   evaluationResultMolecule.actions.prefetchByScenarioIds
+     *   evaluationMetricMolecule.actions.prefetchByScenarioIds
+     *   testcaseMolecule.actions.prefetchByIds
+     *   traceSpanMolecule.actions.prefetchByIds  ← here
+     *
+     * The standalone `prefetchTracesByIds` export stays for backwards
+     * compatibility; this is the convention-aligned entry point.
+     */
+    actions: {
+        prefetchByIds: prefetchTracesByIds,
+        /**
+         * Cache-aware bulk eviction by trace ID list — the per-chunk
+         * counterpart of `prefetchByIds`. An ETL chunk-release hook calls
+         * this once the sink has consumed a chunk so heap stays bounded
+         * by chunk size. Wraps `evictTracesByIds`.
+         */
+        evictByIds: evictTracesByIds,
+    },
 }
 
 // ============================================================================
diff --git a/web/packages/agenta-entities/src/trace/state/prefetch.ts b/web/packages/agenta-entities/src/trace/state/prefetch.ts
new file mode 100644
index 0000000000..30d3603216
--- /dev/null
+++ b/web/packages/agenta-entities/src/trace/state/prefetch.ts
@@ -0,0 +1,199 @@
+/**
+ * Cache-aware bulk-prefetch for traces.
+ *
+ * Composes two layers:
+ *
+ *   1. **TanStack Query cache** at `["trace-entity", projectId, traceId]`,
+ *      reused with `traceEntityAtomFamily(traceId)` so a trace already viewed
+ *      by the user doesn't get refetched here.
+ *
+ *   2. **`traceBatchFetcher`** (in ./store) which uses `createBatchFetcher`
+ *      to coalesce concurrent single-trace requests into one bulk
+ *      `/tracing/spans/query` call with `trace_id IN [...]`.
+ *
+ * Flow per call:
+ *   1. Read each requested traceId from TanStack cache.
+ *   2. For misses, fire `traceBatchFetcher({projectId, traceId})` per missing id.
+ *      The batch fetcher coalesces them into one network round-trip.
+ *   3. Write the resulting envelopes back to the cache so future readers
+ *      (including React subscribers via `traceEntityAtomFamily`) see them.
+ *
+ * The bulk fetcher uses dashed/canonical IDs as the network key but the
+ * cache stores entries by **dashed** trace_id. This action takes dashed IDs
+ * (as they appear in `result.trace_id`) and returns a Map keyed by dashed
+ * trace_id — caller-friendly.
+ *
+ * @packageDocumentation
+ */
+
+import {getDefaultStore} from "jotai/vanilla"
+import {queryClientAtom} from "jotai-tanstack-query"
+
+import {fetchAllPreviewTraces} from "../api"
+import type {TracesApiResponse} from "../core"
+
+function cacheKey(projectId: string, traceId: string) {
+    return ["trace-entity", projectId, traceId] as const
+}
+
+function getQc() {
+    return getDefaultStore().get(queryClientAtom)
+}
+
+export interface PrefetchTracesArgs {
+    projectId: string
+    /** Dashed trace_ids as they appear in `result.trace_id`. */
+    traceIds: string[]
+}
+
+export interface PrefetchTracesOutcome {
+    /** Trace envelopes keyed by dashed trace_id. */
+    traces: Map<string, TracesApiResponse>
+    cacheHits: number
+    cacheMisses: number
+    /** Wall-clock for the batch fetch (single network round trip thanks to coalescing). */
+    fetchMs: number
+}
+
+export async function prefetchTracesByIds(
+    args: PrefetchTracesArgs,
+): Promise<PrefetchTracesOutcome> {
+    const {projectId, traceIds} = args
+
+    if (traceIds.length === 0) {
+        return {traces: new Map(), cacheHits: 0, cacheMisses: 0, fetchMs: 0}
+    }
+
+    let qc: ReturnType<typeof getQc> | null = null
+    try {
+        qc = getQc()
+    } catch {
+        // No Jotai store available — fall through to fetch-everything
+    }
+
+    const out = new Map<string, TracesApiResponse>()
+    const misses: string[] = []
+
+    if (qc) {
+        for (const tid of traceIds) {
+            const cached = qc.getQueryData<TracesApiResponse>(cacheKey(projectId, tid))
+            if (cached) {
+                out.set(tid, cached)
+            } else {
+                misses.push(tid)
+            }
+        }
+    } else {
+        misses.push(...traceIds)
+    }
+
+    let fetchMs = 0
+    if (misses.length > 0) {
+        const start = performance.now()
+
+        // Bulk-fetch all misses in ONE network call.
+        //
+        // We deliberately do NOT route through `traceBatchFetcher` here. That
+        // fetcher exists to *coalesce* concurrent per-id calls (e.g. many
+        // React components calling `traceEntityAtomFamily(id)` in the same
+        // microtask) into a single bulk request, with `maxBatchSize: 50`
+        // splitting larger batches into multiple network calls. For
+        // already-bulk inputs (our case), that splitting becomes a regression:
+        // 100 trace_ids → 2 round trips instead of 1.
+        //
+        // Calling `fetchAllPreviewTraces` directly with an `IN` filter on all
+        // ids gives us a single round trip regardless of input size. We still
+        // write each result to the shared `["trace-entity", projectId, traceId]`
+        // cache key, so atom subscribers using `traceEntityAtomFamily` see the
+        // same data the batch-fetcher path would have produced.
+        const canonicalIds = misses.map((id) => id.replace(/-/g, ""))
+        try {
+            const data = await fetchAllPreviewTraces(
+                {
+                    focus: "trace",
+                    format: "agenta",
+                    filter: JSON.stringify({
+                        conditions: [{field: "trace_id", operator: "in", value: canonicalIds}],
+                    }),
+                },
+                "",
+                projectId,
+            )
+
+            const tracesObj = (data as {traces?: Record<string, unknown>} | null)?.traces ?? {}
+
+            // Rekey by dashed trace_id (the value callers see in
+            // `result.trace_id`) and populate cache.
+            misses.forEach((traceId, idx) => {
+                const canon = canonicalIds[idx]
+                const traceData = tracesObj[canon]
+                if (traceData) {
+                    // Match `fetchPreviewTrace` envelope shape so atom
+                    // subscribers parse it consistently.
+                    const envelope = {
+                        count: 1,
+                        traces: {[canon]: traceData},
+                    } as unknown as TracesApiResponse
+                    out.set(traceId, envelope)
+                    if (qc) qc.setQueryData(cacheKey(projectId, traceId), envelope)
+                } else if (qc) {
+                    // Negative cache — trace genuinely not yet ingested.
+                    qc.setQueryData(cacheKey(projectId, traceId), null)
+                }
+            })
+        } catch (e) {
+            // On error, leave cache untouched. Caller can decide to retry.
+            console.warn(
+                `[prefetchTracesByIds] bulk fetch failed: ${e instanceof Error ? e.message : e}`,
+            )
+        }
+
+        fetchMs = performance.now() - start
+    }
+
+    return {
+        traces: out,
+        cacheHits: traceIds.length - misses.length,
+        cacheMisses: misses.length,
+        fetchMs,
+    }
+}
+
+export function invalidateTrace({projectId, traceId}: {projectId: string; traceId: string}) {
+    try {
+        getQc().removeQueries({queryKey: cacheKey(projectId, traceId)})
+    } catch {}
+}
+
+/**
+ * Bulk-evict trace cache entries — the per-chunk counterpart of
+ * `prefetchTracesByIds`. An ETL chunk-release hook calls this once a
+ * chunk is consumed so heap stays bounded by chunk size, not dataset
+ * size. Takes dashed trace_ids (as they appear in `result.trace_id`).
+ * Returns the number of entries removed — includes negative-cache
+ * (`null`) entries written for traces not yet ingested.
+ */
+export function evictTracesByIds({
+    projectId,
+    traceIds,
+}: {
+    projectId: string
+    traceIds: string[]
+}): number {
+    let removed = 0
+    try {
+        const qc = getQc()
+        for (const tid of traceIds) {
+            const key = cacheKey(projectId, tid)
+            // `!== undefined` (not truthiness) so negative-cache `null`
+            // entries are evicted too.
+            if (qc.getQueryData(key) !== undefined) {
+                qc.removeQueries({queryKey: key, exact: true})
+                removed++
+            }
+        }
+    } catch {
+        // No queryClient — nothing to evict.
+    }
+    return removed
+}
diff --git a/web/packages/agenta-entities/src/trace/state/store.ts b/web/packages/agenta-entities/src/trace/state/store.ts
index 24d95ac1b7..50013ca910 100644
--- a/web/packages/agenta-entities/src/trace/state/store.ts
+++ b/web/packages/agenta-entities/src/trace/state/store.ts
@@ -21,10 +21,18 @@
 import {projectIdAtom, sessionAtom} from "@agenta/shared/state"
 import {createBatchFetcher} from "@agenta/shared/utils"
 import {atom, getDefaultStore} from "jotai"
-import {atomFamily} from "jotai-family"
 import {atomWithQuery, queryClientAtom} from "jotai-tanstack-query"
 
-import {createEntityDraftState, normalizeValueForComparison} from "../../shared"
+// Deep imports — the shared/index barrel re-exports React components
+// (UserAuthorLabel) and CSS-coupled paginated table helpers, which break
+// Node-side execution. Pure molecule utilities are Node-safe.
+// instrumentedAtomFamily wraps jotai-family's atomFamily with size/clear
+// visibility — see ../../shared/molecule/instrumentedAtomFamily.
+import {
+    createEntityDraftState,
+    normalizeValueForComparison,
+} from "../../shared/molecule/createEntityDraftState"
+import {instrumentedAtomFamily} from "../../shared/molecule/instrumentedAtomFamily"
 import {fetchAllPreviewTraces} from "../api"
 import {isSpansResponse} from "../api/helpers"
 import type {SpanRequest, TraceRequest, TracesApiResponse} from "../core"
@@ -150,9 +158,14 @@ const spanBatchFetcher = createBatchFetcher<
 
 /**
  * Batch fetcher that combines concurrent trace requests into a single API call
- * Uses the /tracing/spans/query endpoint with trace_id IN filter
+ * Uses the /tracing/spans/query endpoint with trace_id IN filter.
+ *
+ * Exported for use by `trace/state/prefetch.ts` and other entity-layer
+ * prefetch helpers. Consumers should prefer the higher-level
+ * `prefetchTracesByIds()` action which adds TanStack cache integration on
+ * top of the per-id coalescing this fetcher already does.
  */
-const traceBatchFetcher = createBatchFetcher<
+export const traceBatchFetcher = createBatchFetcher<
     TraceRequest,
     TracesApiResponse | null,
     Map<string, TracesApiResponse | null>
@@ -395,42 +408,44 @@ export class TraceNotFoundError extends Error {
  *
  * This provides the "server state" for each span entity.
  */
-export const spanQueryAtomFamily = atomFamily((spanId: string) =>
-    atomWithQuery((get) => {
-        const projectId = get(projectIdAtom)
-        const queryClient = get(queryClientAtom)
-
-        // Try to find in any cached trace data
-        const cachedData = spanId ? findSpanInCache(queryClient, spanId) : undefined
-
-        return {
-            queryKey: ["span", projectId, spanId],
-            queryFn: async (): Promise<TraceSpan | null> => {
-                if (!projectId || !spanId) return null
-                const result = await spanBatchFetcher({projectId, spanId})
-                // Throw if not found - triggers retry (span may not be ingested yet)
-                if (!result) {
-                    throw new SpanNotFoundError(spanId)
-                }
-                return result
-            },
-            // Use cached data as initial data - prevents fetch if already in cache
-            initialData: cachedData ?? undefined,
-            // Always fetch if we have projectId and spanId (cache redirect handles deduplication)
-            enabled: Boolean(get(sessionAtom) && projectId && spanId),
-            staleTime: 60_000, // 1 minute
-            gcTime: 5 * 60_000, // 5 minutes
-            // Retry configuration for spans not yet ingested
-            retry: (failureCount, error) => {
-                // Only retry SpanNotFoundError, not other errors
-                if (error instanceof SpanNotFoundError && failureCount < 3) {
-                    return true
-                }
-                return false
-            },
-            retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 8000), // 1s, 2s, 4s
-        }
-    }),
+export const spanQueryAtomFamily = instrumentedAtomFamily(
+    (spanId: string) =>
+        atomWithQuery((get) => {
+            const projectId = get(projectIdAtom)
+            const queryClient = get(queryClientAtom)
+
+            // Try to find in any cached trace data
+            const cachedData = spanId ? findSpanInCache(queryClient, spanId) : undefined
+
+            return {
+                queryKey: ["span", projectId, spanId],
+                queryFn: async (): Promise<TraceSpan | null> => {
+                    if (!projectId || !spanId) return null
+                    const result = await spanBatchFetcher({projectId, spanId})
+                    // Throw if not found - triggers retry (span may not be ingested yet)
+                    if (!result) {
+                        throw new SpanNotFoundError(spanId)
+                    }
+                    return result
+                },
+                // Use cached data as initial data - prevents fetch if already in cache
+                initialData: cachedData ?? undefined,
+                // Always fetch if we have projectId and spanId (cache redirect handles deduplication)
+                enabled: Boolean(get(sessionAtom) && projectId && spanId),
+                staleTime: 60_000, // 1 minute
+                gcTime: 5 * 60_000, // 5 minutes
+                // Retry configuration for spans not yet ingested
+                retry: (failureCount, error) => {
+                    // Only retry SpanNotFoundError, not other errors
+                    if (error instanceof SpanNotFoundError && failureCount < 3) {
+                        return true
+                    }
+                    return false
+                },
+                retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 8000), // 1s, 2s, 4s
+            }
+        }),
+    {name: "trace.spanQueryAtomFamily"},
 )
 
 // ============================================================================
@@ -490,24 +505,26 @@ export const updateTraceSpanAtom = traceSpanDraftState.updateAtom
  *
  * Equivalent to testcaseEntityAtomFamily pattern
  */
-export const traceSpanEntityAtomFamily = atomFamily((spanId: string) =>
-    atom((get): TraceSpan | null => {
-        // Use query atom directly as single source of truth for server data
-        const queryState = get(spanQueryAtomFamily(spanId))
-        const serverState = queryState.data ?? null
-        const draftAttrs = get(traceSpanDraftAtomFamily(spanId))
-
-        if (draftAttrs && serverState) {
-            // Merge draft attributes into server state
-            return {
-                ...serverState,
-                attributes: {...serverState.attributes, ...draftAttrs},
+export const traceSpanEntityAtomFamily = instrumentedAtomFamily(
+    (spanId: string) =>
+        atom((get): TraceSpan | null => {
+            // Use query atom directly as single source of truth for server data
+            const queryState = get(spanQueryAtomFamily(spanId))
+            const serverState = queryState.data ?? null
+            const draftAttrs = get(traceSpanDraftAtomFamily(spanId))
+
+            if (draftAttrs && serverState) {
+                // Merge draft attributes into server state
+                return {
+                    ...serverState,
+                    attributes: {...serverState.attributes, ...draftAttrs},
+                }
             }
-        }
 
-        // Return server state (or null if not loaded)
-        return serverState
-    }),
+            // Return server state (or null if not loaded)
+            return serverState
+        }),
+    {name: "trace.traceSpanEntityAtomFamily"},
 )
 
 // ============================================================================
@@ -518,33 +535,39 @@ export const traceSpanEntityAtomFamily = atomFamily((spanId: string) =>
  * Atom family to extract inputs from a span by ID
  * Usage: const inputs = useAtomValue(spanInputsAtomFamily(spanId))
  */
-export const spanInputsAtomFamily = atomFamily((spanId: string) =>
-    atom((get) => {
-        const span = get(traceSpanEntityAtomFamily(spanId))
-        return extractInputs(span)
-    }),
+export const spanInputsAtomFamily = instrumentedAtomFamily(
+    (spanId: string) =>
+        atom((get) => {
+            const span = get(traceSpanEntityAtomFamily(spanId))
+            return extractInputs(span)
+        }),
+    {name: "trace.spanInputsAtomFamily"},
 )
 
 /**
  * Atom family to extract outputs from a span by ID
  * Usage: const outputs = useAtomValue(spanOutputsAtomFamily(spanId))
  */
-export const spanOutputsAtomFamily = atomFamily((spanId: string) =>
-    atom((get) => {
-        const span = get(traceSpanEntityAtomFamily(spanId))
-        return extractOutputs(span)
-    }),
+export const spanOutputsAtomFamily = instrumentedAtomFamily(
+    (spanId: string) =>
+        atom((get) => {
+            const span = get(traceSpanEntityAtomFamily(spanId))
+            return extractOutputs(span)
+        }),
+    {name: "trace.spanOutputsAtomFamily"},
 )
 
 /**
  * Atom family to extract all ag.data from a span by ID
  * Usage: const agData = useAtomValue(spanAgDataAtomFamily(spanId))
  */
-export const spanAgDataAtomFamily = atomFamily((spanId: string) =>
-    atom((get) => {
-        const span = get(traceSpanEntityAtomFamily(spanId))
-        return extractAgData(span)
-    }),
+export const spanAgDataAtomFamily = instrumentedAtomFamily(
+    (spanId: string) =>
+        atom((get) => {
+            const span = get(traceSpanEntityAtomFamily(spanId))
+            return extractAgData(span)
+        }),
+    {name: "trace.spanAgDataAtomFamily"},
 )
 
 // ============================================================================
@@ -564,56 +587,62 @@ export const spanAgDataAtomFamily = atomFamily((spanId: string) =>
  * Uses batch fetching to combine multiple concurrent trace requests into a single API call.
  * Usage: const traceQuery = useAtomValue(traceEntityAtomFamily(traceId))
  */
-export const traceEntityAtomFamily = atomFamily((traceId: string | null) =>
-    atomWithQuery((get) => {
-        const projectId = get(projectIdAtom)
-        const queryClient = get(queryClientAtom)
-
-        return {
-            queryKey: ["trace-entity", projectId, traceId ?? "none"],
-            enabled: Boolean(get(sessionAtom) && traceId && projectId),
-            staleTime: 60_000,
-            gcTime: 5 * 60_000,
-            refetchOnWindowFocus: false,
-            structuralSharing: true,
-            queryFn: async () => {
-                if (!traceId || !projectId) return null
-
-                // Use batch fetcher to combine concurrent trace requests
-                // Returns the same format as fetchPreviewTrace: { traces: { [traceId]: { spans: {...} } } }
-                const response = await traceBatchFetcher({projectId, traceId})
-
-                // Throw if not found - triggers retry (trace may not be ingested yet)
-                if (!response || !response.traces || Object.keys(response.traces).length === 0) {
-                    throw new TraceNotFoundError(traceId)
-                }
+export const traceEntityAtomFamily = instrumentedAtomFamily(
+    (traceId: string | null) =>
+        atomWithQuery((get) => {
+            const projectId = get(projectIdAtom)
+            const queryClient = get(queryClientAtom)
 
-                // Extract all spans from the trace response and populate query cache
-                Object.values(response.traces).forEach((traceEntry) => {
-                    if (traceEntry?.spans) {
-                        Object.values(traceEntry.spans).forEach((spanData) => {
-                            const span = traceSpanSchema.safeParse(spanData)
-                            if (span.success) {
-                                const queryKey = ["span", projectId, span.data.span_id]
-                                queryClient.setQueryData(queryKey, span.data)
-                            }
-                        })
+            return {
+                queryKey: ["trace-entity", projectId, traceId ?? "none"],
+                enabled: Boolean(get(sessionAtom) && traceId && projectId),
+                staleTime: 60_000,
+                gcTime: 5 * 60_000,
+                refetchOnWindowFocus: false,
+                structuralSharing: true,
+                queryFn: async () => {
+                    if (!traceId || !projectId) return null
+
+                    // Use batch fetcher to combine concurrent trace requests
+                    // Returns the same format as fetchPreviewTrace: { traces: { [traceId]: { spans: {...} } } }
+                    const response = await traceBatchFetcher({projectId, traceId})
+
+                    // Throw if not found - triggers retry (trace may not be ingested yet)
+                    if (
+                        !response ||
+                        !response.traces ||
+                        Object.keys(response.traces).length === 0
+                    ) {
+                        throw new TraceNotFoundError(traceId)
                     }
-                })
-
-                return response
-            },
-            // Retry configuration for traces not yet ingested
-            retry: (failureCount, error) => {
-                // Only retry TraceNotFoundError, not other errors
-                if (error instanceof TraceNotFoundError && failureCount < 5) {
-                    return true
-                }
-                return false
-            },
-            retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 10000), // 1s, 2s, 4s, 8s, 10s
-        }
-    }),
+
+                    // Extract all spans from the trace response and populate query cache
+                    Object.values(response.traces).forEach((traceEntry) => {
+                        if (traceEntry?.spans) {
+                            Object.values(traceEntry.spans).forEach((spanData) => {
+                                const span = traceSpanSchema.safeParse(spanData)
+                                if (span.success) {
+                                    const queryKey = ["span", projectId, span.data.span_id]
+                                    queryClient.setQueryData(queryKey, span.data)
+                                }
+                            })
+                        }
+                    })
+
+                    return response
+                },
+                // Retry configuration for traces not yet ingested
+                retry: (failureCount, error) => {
+                    // Only retry TraceNotFoundError, not other errors
+                    if (error instanceof TraceNotFoundError && failureCount < 5) {
+                        return true
+                    }
+                    return false
+                },
+                retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 10000), // 1s, 2s, 4s, 8s, 10s
+            }
+        }),
+    {name: "trace.traceEntityAtomFamily"},
 )
 
 // ============================================================================
@@ -662,12 +691,14 @@ const findRootSpanFromResponse = (
  *
  * Usage: const rootSpan = useAtomValue(traceRootSpanAtomFamily(traceId))
  */
-export const traceRootSpanAtomFamily = atomFamily((traceId: string | null) =>
-    atom((get): TraceSpan | null => {
-        if (!traceId) return null
-        const traceQuery = get(traceEntityAtomFamily(traceId))
-        return findRootSpanFromResponse(traceQuery.data, traceId)
-    }),
+export const traceRootSpanAtomFamily = instrumentedAtomFamily(
+    (traceId: string | null) =>
+        atom((get): TraceSpan | null => {
+            if (!traceId) return null
+            const traceQuery = get(traceEntityAtomFamily(traceId))
+            return findRootSpanFromResponse(traceQuery.data, traceId)
+        }),
+    {name: "trace.traceRootSpanAtomFamily"},
 )
 
 /**
@@ -675,11 +706,13 @@ export const traceRootSpanAtomFamily = atomFamily((traceId: string | null) =>
  *
  * Usage: const inputs = useAtomValue(traceInputsAtomFamily(traceId))
  */
-export const traceInputsAtomFamily = atomFamily((traceId: string | null) =>
-    atom((get) => {
-        const rootSpan = get(traceRootSpanAtomFamily(traceId))
-        return extractInputs(rootSpan)
-    }),
+export const traceInputsAtomFamily = instrumentedAtomFamily(
+    (traceId: string | null) =>
+        atom((get) => {
+            const rootSpan = get(traceRootSpanAtomFamily(traceId))
+            return extractInputs(rootSpan)
+        }),
+    {name: "trace.traceInputsAtomFamily"},
 )
 
 /**
@@ -687,9 +720,11 @@ export const traceInputsAtomFamily = atomFamily((traceId: string | null) =>
  *
  * Usage: const outputs = useAtomValue(traceOutputsAtomFamily(traceId))
  */
-export const traceOutputsAtomFamily = atomFamily((traceId: string | null) =>
-    atom((get) => {
-        const rootSpan = get(traceRootSpanAtomFamily(traceId))
-        return extractOutputs(rootSpan)
-    }),
+export const traceOutputsAtomFamily = instrumentedAtomFamily(
+    (traceId: string | null) =>
+        atom((get) => {
+            const rootSpan = get(traceRootSpanAtomFamily(traceId))
+            return extractOutputs(rootSpan)
+        }),
+    {name: "trace.traceOutputsAtomFamily"},
 )
diff --git a/web/packages/agenta-entities/src/workflow/core/__tests__/schemaType.test.ts b/web/packages/agenta-entities/src/workflow/core/__tests__/schemaType.test.ts
new file mode 100644
index 0000000000..426a3cb856
--- /dev/null
+++ b/web/packages/agenta-entities/src/workflow/core/__tests__/schemaType.test.ts
@@ -0,0 +1,42 @@
+/**
+ * resolveSchemaType — regression guard for nullable evaluator output
+ * types.
+ *
+ * An evaluator output property declared nullable (`type: ["boolean",
+ * "null"]` or `anyOf: [{type: "boolean"}, {type: "null"}]`) must still
+ * resolve to its primitive type — otherwise the scenario filter bar
+ * mistypes a boolean field and offers numeric operators.
+ */
+
+import assert from "node:assert/strict"
+import {describe, it} from "node:test"
+
+import {resolveSchemaType} from "../schemaType"
+
+describe("resolveSchemaType", () => {
+    it("returns a plain string type", () => {
+        assert.equal(resolveSchemaType({type: "boolean"}), "boolean")
+        assert.equal(resolveSchemaType({type: "number"}), "number")
+    })
+
+    it("treats a bare 'null' type as no type", () => {
+        assert.equal(resolveSchemaType({type: "null"}), undefined)
+    })
+
+    it("unwraps a nullable array type — first non-null entry", () => {
+        assert.equal(resolveSchemaType({type: ["boolean", "null"]}), "boolean")
+        assert.equal(resolveSchemaType({type: ["null", "number"]}), "number")
+    })
+
+    it("unwraps a nullable anyOf / oneOf union", () => {
+        assert.equal(resolveSchemaType({anyOf: [{type: "boolean"}, {type: "null"}]}), "boolean")
+        assert.equal(resolveSchemaType({oneOf: [{type: "null"}, {type: "string"}]}), "string")
+    })
+
+    it("returns undefined when no type is resolvable", () => {
+        assert.equal(resolveSchemaType({}), undefined)
+        assert.equal(resolveSchemaType(null), undefined)
+        assert.equal(resolveSchemaType(undefined), undefined)
+        assert.equal(resolveSchemaType({anyOf: [{type: "null"}]}), undefined)
+    })
+})
diff --git a/web/packages/agenta-entities/src/workflow/core/evaluatorResolution.ts b/web/packages/agenta-entities/src/workflow/core/evaluatorResolution.ts
index 99474b7c7c..76c44cb3dd 100644
--- a/web/packages/agenta-entities/src/workflow/core/evaluatorResolution.ts
+++ b/web/packages/agenta-entities/src/workflow/core/evaluatorResolution.ts
@@ -14,6 +14,7 @@
 import {resolveSchemaRef} from "../../runnable/portHelpers"
 
 import {resolveOutputSchema, resolveOutputSchemaProperties} from "./schema"
+import {resolveSchemaType} from "./schemaType"
 
 // ============================================================================
 // TYPES
@@ -135,7 +136,7 @@ export const extractMetrics = (evaluator: {
             kind: "metric" as const,
             path: key,
             stepKey: evaluator.slug || evaluator.id || "metric",
-            metricType: typeof schema?.type === "string" ? schema.type : METRIC_TYPE_FALLBACK,
+            metricType: resolveSchemaType(schema) ?? METRIC_TYPE_FALLBACK,
             displayLabel: typeof schema?.title === "string" ? schema.title : titleize(key),
             description: typeof schema?.description === "string" ? schema.description : undefined,
         }
diff --git a/web/packages/agenta-entities/src/workflow/core/schemaType.ts b/web/packages/agenta-entities/src/workflow/core/schemaType.ts
new file mode 100644
index 0000000000..b50e37204d
--- /dev/null
+++ b/web/packages/agenta-entities/src/workflow/core/schemaType.ts
@@ -0,0 +1,39 @@
+/**
+ * resolveSchemaType — resolve a JSON-schema node's primitive type.
+ *
+ * Tolerates the nullable encodings an evaluator output schema may use:
+ *   - `type: "boolean"`                                    — plain
+ *   - `type: ["boolean", "null"]`                          — array / nullable
+ *   - `anyOf | oneOf: [{type: "boolean"}, {type: "null"}]` — union / nullable
+ *
+ * Returns the first non-`"null"` type found, or `undefined` when none is
+ * resolvable.
+ *
+ * Kept in its own dependency-free module so it is unit-testable without
+ * pulling in the evaluator-resolution import graph.
+ *
+ * @packageDocumentation
+ */
+
+export const resolveSchemaType = (
+    schema: Record<string, unknown> | null | undefined,
+): string | undefined => {
+    if (!schema || typeof schema !== "object") return undefined
+
+    const type = schema.type
+    if (typeof type === "string") return type === "null" ? undefined : type
+    if (Array.isArray(type)) {
+        const first = type.find((t) => typeof t === "string" && t !== "null")
+        if (typeof first === "string") return first
+    }
+
+    for (const key of ["anyOf", "oneOf"] as const) {
+        const branches = schema[key]
+        if (!Array.isArray(branches)) continue
+        for (const branch of branches) {
+            const resolved = resolveSchemaType(branch as Record<string, unknown>)
+            if (resolved) return resolved
+        }
+    }
+    return undefined
+}
diff --git a/web/packages/agenta-entities/tests/unit/adaptive-pacing.test.ts b/web/packages/agenta-entities/tests/unit/adaptive-pacing.test.ts
new file mode 100644
index 0000000000..c13be823dd
--- /dev/null
+++ b/web/packages/agenta-entities/tests/unit/adaptive-pacing.test.ts
@@ -0,0 +1,107 @@
+/**
+ * Unit tests for the adaptive page-fetch delay used by the bulk-trace export.
+ *
+ * Pure function — covers the bucket-fill ramp from "full" (floor delay) to
+ * "empty" (sustained refill rate). The same logic runs across every EE plan
+ * tier because the input is the live `X-RateLimit-*` reading, not the plan.
+ */
+
+import {describe, expect, it} from "vitest"
+
+import {
+    ADAPTIVE_CEILING_DELAY_MS,
+    ADAPTIVE_FLOOR_DELAY_MS,
+    ADAPTIVE_RAMP_START_FILL,
+    computeAdaptivePageDelayMs,
+} from "../../src/trace/etl"
+
+describe("computeAdaptivePageDelayMs", () => {
+    it("returns the floor delay when the bucket is full", () => {
+        expect(computeAdaptivePageDelayMs({remaining: 120, limit: 120})).toBe(
+            ADAPTIVE_FLOOR_DELAY_MS,
+        )
+    })
+
+    it("returns the floor delay when fill is above the ramp threshold", () => {
+        // EE Free plan has TRACING_SLOW capacity = 120. With 100 remaining
+        // (fill ≈ 0.83) we're still in burst-OK territory.
+        expect(computeAdaptivePageDelayMs({remaining: 100, limit: 120})).toBe(
+            ADAPTIVE_FLOOR_DELAY_MS,
+        )
+    })
+
+    it("returns the floor delay when fill equals the ramp threshold", () => {
+        expect(
+            computeAdaptivePageDelayMs({
+                remaining: Math.round(120 * ADAPTIVE_RAMP_START_FILL),
+                limit: 120,
+            }),
+        ).toBe(ADAPTIVE_FLOOR_DELAY_MS)
+    })
+
+    it("ramps the delay up as the bucket drains below the threshold", () => {
+        const halfDrained = computeAdaptivePageDelayMs({
+            remaining: Math.round(120 * (ADAPTIVE_RAMP_START_FILL / 2)),
+            limit: 120,
+        })
+        // Halfway down the ramp — delay should sit roughly in the middle of
+        // the floor → ceiling range.
+        const midpoint = (ADAPTIVE_FLOOR_DELAY_MS + ADAPTIVE_CEILING_DELAY_MS) / 2
+        expect(halfDrained).toBeGreaterThan(ADAPTIVE_FLOOR_DELAY_MS)
+        expect(halfDrained).toBeLessThan(ADAPTIVE_CEILING_DELAY_MS)
+        expect(Math.abs(halfDrained - midpoint)).toBeLessThan(50)
+    })
+
+    it("returns the ceiling delay when the bucket is empty", () => {
+        expect(computeAdaptivePageDelayMs({remaining: 0, limit: 120})).toBe(
+            ADAPTIVE_CEILING_DELAY_MS,
+        )
+    })
+
+    it("returns the ceiling delay when remaining is negative (clock skew)", () => {
+        expect(computeAdaptivePageDelayMs({remaining: -1, limit: 120})).toBe(
+            ADAPTIVE_CEILING_DELAY_MS,
+        )
+    })
+
+    it("returns the floor delay when headers are unavailable", () => {
+        // OSS deployments without EE throttling, or the first call before
+        // any header has been seen.
+        expect(computeAdaptivePageDelayMs({remaining: null, limit: null})).toBe(
+            ADAPTIVE_FLOOR_DELAY_MS,
+        )
+    })
+
+    it("returns the floor delay when limit is 0 (avoid divide-by-zero)", () => {
+        expect(computeAdaptivePageDelayMs({remaining: 0, limit: 0})).toBe(ADAPTIVE_FLOOR_DELAY_MS)
+    })
+
+    it("returns the floor delay when only `remaining` is reported", () => {
+        expect(computeAdaptivePageDelayMs({remaining: 50, limit: null})).toBe(
+            ADAPTIVE_FLOOR_DELAY_MS,
+        )
+    })
+
+    it("monotonically increases as remaining tokens drop", () => {
+        // The ramp must never reverse — a draining bucket only slows down,
+        // never speeds up. Cross-tier check using EE Business capacity.
+        const limit = 1800
+        const samples = [limit, 1500, 1000, 900, 500, 200, 100, 0].map((remaining) =>
+            computeAdaptivePageDelayMs({remaining, limit}),
+        )
+        for (let i = 1; i < samples.length; i++) {
+            expect(samples[i]).toBeGreaterThanOrEqual(samples[i - 1])
+        }
+        expect(samples[0]).toBe(ADAPTIVE_FLOOR_DELAY_MS)
+        expect(samples[samples.length - 1]).toBe(ADAPTIVE_CEILING_DELAY_MS)
+    })
+
+    it("produces the same delay shape across tiers at equal fill ratios", () => {
+        // EE Free has capacity 120, Business has 1800. At the same fill
+        // ratio, the delay should be identical — the algorithm is
+        // bucket-aware, not tier-aware.
+        const free = computeAdaptivePageDelayMs({remaining: 30, limit: 120}) // fill 0.25
+        const business = computeAdaptivePageDelayMs({remaining: 450, limit: 1800}) // fill 0.25
+        expect(free).toBe(business)
+    })
+})
diff --git a/web/packages/agenta-entities/tests/unit/add-matching-traces-to-queue.test.ts b/web/packages/agenta-entities/tests/unit/add-matching-traces-to-queue.test.ts
new file mode 100644
index 0000000000..72469c2aaa
--- /dev/null
+++ b/web/packages/agenta-entities/tests/unit/add-matching-traces-to-queue.test.ts
@@ -0,0 +1,185 @@
+/**
+ * Unit tests for `addAllMatchingTracesToQueue` — the batch-add-to-queue ETL
+ * pipeline composition in `@agenta/entities/simpleQueue/etl`.
+ *
+ * Both transports (`fetchPage`, `addTraces`) are injected, so the whole
+ * pipeline is exercised here against fakes — no network or molecule coupling.
+ */
+
+import {describe, expect, it} from "vitest"
+
+import {
+    addAllMatchingTracesToQueue,
+    BatchFlushError,
+    type TracePage,
+    type TracePageFetcher,
+} from "../../src/simpleQueue/etl"
+
+/** Build a `fetchPage` that walks a fixed list of pages. */
+const pagesFetcher = (pages: TracePage[]): TracePageFetcher => {
+    let i = 0
+    return async () => pages[i++] ?? {rows: [], nextCursor: null}
+}
+
+describe("addAllMatchingTracesToQueue", () => {
+    it("scans, dedups by trace_id, flushes in batches, reports a done result", async () => {
+        const flushed: string[][] = []
+        const result = await addAllMatchingTracesToQueue({
+            fetchPage: pagesFetcher([
+                {rows: [{trace_id: "t1"}, {trace_id: "t2"}], nextCursor: "c1"},
+                {rows: [{trace_id: "t2"}, {trace_id: "t3"}], nextCursor: null},
+            ]),
+            addTraces: async (_queueId, ids) => {
+                flushed.push(ids)
+                return "queue-1"
+            },
+            queueId: "queue-1",
+            batchSize: 2,
+            pageDelayMs: 0,
+        })
+
+        expect(result.stoppedBy).toBe("done")
+        expect(result.queued).toBe(3) // t2 deduped across pages
+        expect(flushed).toEqual([["t1", "t2"], ["t3"]])
+    })
+
+    it("reports nothing queued when no trace matches", async () => {
+        const result = await addAllMatchingTracesToQueue({
+            fetchPage: pagesFetcher([{rows: [], nextCursor: null}]),
+            addTraces: async () => "q",
+            queueId: "q",
+            pageDelayMs: 0,
+        })
+
+        expect(result.queued).toBe(0)
+        expect(result.stoppedBy).toBe("done")
+    })
+
+    it("excludes already-queued trace ids", async () => {
+        const flushed: string[][] = []
+        const result = await addAllMatchingTracesToQueue({
+            fetchPage: pagesFetcher([
+                {
+                    rows: [{trace_id: "t1"}, {trace_id: "t2"}, {trace_id: "t3"}],
+                    nextCursor: null,
+                },
+            ]),
+            addTraces: async (_queueId, ids) => {
+                flushed.push(ids)
+                return "q"
+            },
+            queueId: "q",
+            excludeTraceIds: new Set(["t2"]),
+            batchSize: 10,
+            pageDelayMs: 0,
+        })
+
+        expect(result.queued).toBe(2)
+        expect(flushed).toEqual([["t1", "t3"]])
+    })
+
+    it("emits progress for each page and a final reconciliation", async () => {
+        const progress: {scanned: number; queued: number}[] = []
+        await addAllMatchingTracesToQueue({
+            fetchPage: pagesFetcher([
+                {rows: [{trace_id: "t1"}], nextCursor: "c1"},
+                {rows: [{trace_id: "t2"}], nextCursor: null},
+            ]),
+            addTraces: async () => "q",
+            queueId: "q",
+            batchSize: 1,
+            pageDelayMs: 0,
+            onProgress: (p) => progress.push({...p}),
+        })
+
+        expect(progress.length).toBeGreaterThanOrEqual(2)
+        expect(progress[progress.length - 1].queued).toBe(2)
+    })
+
+    it("stops at the maxItems cap and queues exactly maxItems", async () => {
+        let page = 0
+        const flushed: string[][] = []
+        const result = await addAllMatchingTracesToQueue({
+            // Pages never run out — only the cap stops the scan.
+            fetchPage: async () => {
+                const base = page++ * 100
+                return {
+                    rows: Array.from({length: 100}, (_, i) => ({trace_id: `t${base + i}`})),
+                    nextCursor: `c${page}`,
+                }
+            },
+            addTraces: async (_queueId, ids) => {
+                flushed.push(ids)
+                return "q"
+            },
+            queueId: "q",
+            maxItems: 250,
+            batchSize: 1000,
+            pageDelayMs: 0,
+        })
+
+        expect(result.stoppedBy).toBe("cap")
+        // The transform's `limit` makes the cap exact — never an overshoot.
+        expect(result.queued).toBe(250)
+        expect(flushed.flat()).toHaveLength(250)
+        expect(result.scanned).toBeGreaterThanOrEqual(250)
+    })
+
+    it("reports `done`, not `cap`, when the source exhausts exactly at the cap", async () => {
+        const result = await addAllMatchingTracesToQueue({
+            fetchPage: pagesFetcher([
+                {
+                    rows: Array.from({length: 250}, (_, i) => ({trace_id: `t${i}`})),
+                    nextCursor: null,
+                },
+            ]),
+            addTraces: async () => "q",
+            queueId: "q",
+            maxItems: 250,
+            batchSize: 1000,
+            pageDelayMs: 0,
+        })
+
+        // Everything matching was queued — the limit was reached but nothing
+        // was truncated, so this is a clean completion.
+        expect(result.stoppedBy).toBe("done")
+        expect(result.queued).toBe(250)
+    })
+
+    it("surfaces a flush failure as BatchFlushError", async () => {
+        let caught: unknown
+        try {
+            await addAllMatchingTracesToQueue({
+                fetchPage: pagesFetcher([
+                    {rows: [{trace_id: "t1"}, {trace_id: "t2"}], nextCursor: null},
+                ]),
+                addTraces: async () => null, // a handled failure
+                queueId: "q",
+                batchSize: 2,
+                pageDelayMs: 0,
+            })
+        } catch (err) {
+            caught = err
+        }
+        expect(caught).toBeInstanceOf(BatchFlushError)
+    })
+
+    it("returns a cancelled result when aborted mid-scan", async () => {
+        const controller = new AbortController()
+        let calls = 0
+        const result = await addAllMatchingTracesToQueue({
+            fetchPage: async () => {
+                calls++
+                if (calls === 2) controller.abort()
+                return {rows: [{trace_id: `t${calls}`}], nextCursor: `c${calls}`}
+            },
+            addTraces: async () => "q",
+            queueId: "q",
+            signal: controller.signal,
+            batchSize: 100,
+            pageDelayMs: 0,
+        })
+
+        expect(result.stoppedBy).toBe("cancelled")
+    })
+})
diff --git a/web/packages/agenta-entities/tests/unit/etl-primitives.test.ts b/web/packages/agenta-entities/tests/unit/etl-primitives.test.ts
new file mode 100644
index 0000000000..6940ebbf9a
--- /dev/null
+++ b/web/packages/agenta-entities/tests/unit/etl-primitives.test.ts
@@ -0,0 +1,291 @@
+/**
+ * Unit tests for the generic ETL primitives in `@agenta/entities/etl`:
+ * makeSourceFromCursorFetch, makeBufferedBatchSink, makeUniqueKeyTransform.
+ *
+ * All three are dependency-injected and pure — exercised here against fakes,
+ * no network or entity coupling.
+ */
+
+import {describe, expect, it} from "vitest"
+
+import {
+    BatchFlushError,
+    makeBufferedBatchSink,
+    makeSourceFromCursorFetch,
+    makeUniqueKeyTransform,
+    type Chunk,
+    type CursorPage,
+    type Source,
+} from "../../src/etl"
+
+// ── makeSourceFromCursorFetch ───────────────────────────────────────────────
+
+const collectChunks = async <T>(
+    source: Source<T, undefined>,
+    signal: AbortSignal,
+): Promise<Chunk<T>[]> => {
+    const chunks: Chunk<T>[] = []
+    for await (const chunk of source.extract(undefined, signal)) chunks.push(chunk)
+    return chunks
+}
+
+describe("makeSourceFromCursorFetch", () => {
+    it("yields one chunk per page, advances the cursor, ends with cursor null", async () => {
+        const pages: CursorPage<string>[] = [
+            {rows: ["a", "b"], nextCursor: "c1"},
+            {rows: ["c", "d"], nextCursor: "c2"},
+            {rows: ["e"], nextCursor: null},
+        ]
+        const seenCursors: (string | null)[] = []
+        let i = 0
+        const source = makeSourceFromCursorFetch<string>({
+            pageDelayMs: 0,
+            fetchPage: async (cursor) => {
+                seenCursors.push(cursor)
+                return pages[i++]
+            },
+        })
+
+        const chunks = await collectChunks(source, new AbortController().signal)
+
+        expect(chunks).toEqual([
+            {items: ["a", "b"], cursor: "c1"},
+            {items: ["c", "d"], cursor: "c2"},
+            {items: ["e"], cursor: null},
+        ])
+        // First page is fetched with cursor null; later pages advance.
+        expect(seenCursors).toEqual([null, "c1", "c2"])
+    })
+
+    it("stops via the empty-page no-progress guard", async () => {
+        let calls = 0
+        const source = makeSourceFromCursorFetch<string>({
+            pageDelayMs: 0,
+            maxEmptyPages: 3,
+            // Cursor advances each page (so the stuck-cursor guard doesn't
+            // fire) but rows stay empty — exercises the empty-page guard.
+            fetchPage: async () => {
+                calls++
+                return {rows: [], nextCursor: `c${calls}`}
+            },
+        })
+
+        const chunks = await collectChunks(source, new AbortController().signal)
+
+        expect(calls).toBe(3)
+        expect(chunks).toHaveLength(3)
+        expect(chunks[2].cursor).toBeNull()
+    })
+
+    it("stops when the server returns the same cursor we just used (stuck-cursor guard)", async () => {
+        // Real-world trigger: every row on the page shares a `start_time`
+        // the backend's strict-less-than filter can't bump past, so each
+        // fetch returns the same rows + same cursor. The empty-page guard
+        // never fires because `rows.length > 0`, but the loop can never
+        // make forward progress — the stuck-cursor check ends it cleanly.
+        let calls = 0
+        const seenCursors: (string | null)[] = []
+        const source = makeSourceFromCursorFetch<string>({
+            pageDelayMs: 0,
+            fetchPage: async (cursor) => {
+                calls++
+                seenCursors.push(cursor)
+                return {rows: ["a", "b"], nextCursor: "stuck-timestamp"}
+            },
+        })
+
+        const chunks = await collectChunks(source, new AbortController().signal)
+
+        // First fetch with cursor=null gets nextCursor="stuck-timestamp"
+        // (no comparison possible). Second fetch with cursor="stuck-timestamp"
+        // gets the SAME nextCursor back → loop ends.
+        expect(calls).toBe(2)
+        expect(seenCursors).toEqual([null, "stuck-timestamp"])
+        expect(chunks).toHaveLength(2)
+        expect(chunks[1].cursor).toBeNull() // last chunk is closed off
+    })
+
+    it("stops scanning once the signal is aborted", async () => {
+        const controller = new AbortController()
+        let calls = 0
+        const source = makeSourceFromCursorFetch<string>({
+            pageDelayMs: 0,
+            fetchPage: async () => {
+                calls++
+                if (calls === 2) controller.abort()
+                return {rows: ["x"], nextCursor: "next"}
+            },
+        })
+
+        await collectChunks(source, controller.signal)
+
+        // Page 3 is never fetched — the loop checks the signal first.
+        expect(calls).toBe(2)
+    })
+})
+
+// ── makeBufferedBatchSink ───────────────────────────────────────────────────
+
+const chunkOf = <T>(items: T[]): Chunk<T> => ({items, cursor: null})
+
+describe("makeBufferedBatchSink", () => {
+    it("flushes every full batch; finalize flushes the remainder", async () => {
+        const flushed: string[][] = []
+        const {sink, getFlushedCount} = makeBufferedBatchSink<string>({
+            batchSize: 2,
+            flush: async (batch) => {
+                flushed.push(batch)
+            },
+        })
+
+        await sink.load(chunkOf(["a", "b", "c"]))
+        expect(flushed).toEqual([["a", "b"]]) // [c] buffered
+
+        await sink.load(chunkOf(["d"]))
+        expect(flushed).toEqual([
+            ["a", "b"],
+            ["c", "d"],
+        ])
+
+        await sink.load(chunkOf(["e"]))
+        await sink.finalize?.()
+        expect(flushed).toEqual([["a", "b"], ["c", "d"], ["e"]])
+        expect(getFlushedCount()).toBe(5)
+    })
+
+    it("finalize drops the buffer when aborted — partial stays a clean multiple", async () => {
+        const controller = new AbortController()
+        const flushed: string[][] = []
+        const {sink, getFlushedCount} = makeBufferedBatchSink<string>({
+            batchSize: 10,
+            signal: controller.signal,
+            flush: async (batch) => {
+                flushed.push(batch)
+            },
+        })
+
+        await sink.load(chunkOf(["a", "b", "c"])) // no full batch — all buffered
+        controller.abort()
+        await sink.finalize?.()
+
+        expect(flushed).toEqual([])
+        expect(getFlushedCount()).toBe(0)
+    })
+
+    it("a failed flush throws BatchFlushError carrying the flushed-so-far count", async () => {
+        const {sink, getFlushedCount} = makeBufferedBatchSink<string>({
+            batchSize: 2,
+            flush: async (batch) => {
+                if (batch[0] === "c") throw new Error("boom")
+            },
+        })
+
+        await sink.load(chunkOf(["a", "b"]))
+        expect(getFlushedCount()).toBe(2)
+
+        let caught: unknown
+        try {
+            await sink.load(chunkOf(["c", "d"]))
+        } catch (err) {
+            caught = err
+        }
+        expect(caught).toBeInstanceOf(BatchFlushError)
+        expect((caught as BatchFlushError).flushedCount).toBe(2)
+        expect((caught as BatchFlushError).failedCount).toBe(2)
+    })
+
+    it("finalize drops a non-empty buffer after a prior failed flush", async () => {
+        const flushed: string[][] = []
+        const {sink} = makeBufferedBatchSink<string>({
+            batchSize: 2,
+            flush: async (batch) => {
+                flushed.push([...batch])
+                throw new Error("boom")
+            },
+        })
+
+        let caught: unknown
+        try {
+            // [a,b] is flushed (and fails); [c] stays buffered.
+            await sink.load(chunkOf(["a", "b", "c"]))
+        } catch (err) {
+            caught = err
+        }
+        expect(caught).toBeInstanceOf(BatchFlushError)
+        expect(flushed).toEqual([["a", "b"]])
+
+        await sink.finalize?.()
+        // errored — finalize drops [c], no second flush attempt.
+        expect(flushed).toEqual([["a", "b"]])
+    })
+})
+
+// ── makeUniqueKeyTransform ──────────────────────────────────────────────────
+
+describe("makeUniqueKeyTransform", () => {
+    it("dedups keys across chunk boundaries", async () => {
+        const transform = makeUniqueKeyTransform<{id: string}>({selectKey: (row) => row.id})
+
+        const first = await transform({items: [{id: "a"}, {id: "b"}, {id: "a"}], cursor: "x"})
+        const second = await transform({items: [{id: "b"}, {id: "c"}], cursor: null})
+
+        expect(first.items).toEqual(["a", "b"])
+        expect(first.cursor).toBe("x")
+        expect(second.items).toEqual(["c"]) // "b" was already seen in the first chunk
+    })
+
+    it("honors the exclude set — excluded keys counted as seen, never emitted", async () => {
+        const transform = makeUniqueKeyTransform<{id: string}>({
+            selectKey: (row) => row.id,
+            exclude: new Set(["b"]),
+        })
+
+        const out = await transform({items: [{id: "a"}, {id: "b"}, {id: "c"}], cursor: null})
+
+        expect(out.items).toEqual(["a", "c"])
+    })
+
+    it("skips rows with no key", async () => {
+        const transform = makeUniqueKeyTransform<{id?: string}>({selectKey: (row) => row.id})
+
+        const out = await transform({
+            items: [{id: "a"}, {}, {id: undefined}, {id: ""}],
+            cursor: null,
+        })
+
+        expect(out.items).toEqual(["a"])
+    })
+
+    it("caps total emitted keys at the limit, across chunk boundaries", async () => {
+        const transform = makeUniqueKeyTransform<{id: string}>({
+            selectKey: (row) => row.id,
+            limit: 3,
+        })
+
+        const first = await transform({items: [{id: "a"}, {id: "b"}], cursor: "x"})
+        const second = await transform({
+            items: [{id: "c"}, {id: "d"}, {id: "e"}],
+            cursor: null,
+        })
+
+        expect(first.items).toEqual(["a", "b"])
+        // Only "c" fits under the limit of 3 — "d" and "e" are dropped.
+        expect(second.items).toEqual(["c"])
+    })
+
+    it("does not count excluded keys toward the limit", async () => {
+        const transform = makeUniqueKeyTransform<{id: string}>({
+            selectKey: (row) => row.id,
+            exclude: new Set(["a", "b"]),
+            limit: 2,
+        })
+
+        const out = await transform({
+            items: [{id: "a"}, {id: "b"}, {id: "c"}, {id: "d"}, {id: "e"}],
+            cursor: null,
+        })
+
+        // a, b excluded (never emitted, never counted); c, d fill the limit.
+        expect(out.items).toEqual(["c", "d"])
+    })
+})
diff --git a/web/packages/agenta-entities/tests/unit/export-matching-traces.test.ts b/web/packages/agenta-entities/tests/unit/export-matching-traces.test.ts
new file mode 100644
index 0000000000..c5b1cb9d24
--- /dev/null
+++ b/web/packages/agenta-entities/tests/unit/export-matching-traces.test.ts
@@ -0,0 +1,366 @@
+/**
+ * Unit tests for `exportMatchingTraces` — the bulk-trace export ETL pipeline
+ * composition in `@agenta/entities/trace/etl`.
+ *
+ * Both transports (`fetchPage`, `flushBatch`) are injected, so the whole
+ * pipeline is exercised here against fakes — no network, no CSV encoding,
+ * no oss coupling.
+ */
+
+import {describe, expect, it} from "vitest"
+
+import {
+    BatchFlushError,
+    exportMatchingTraces,
+    type ExportTracePage,
+    type ExportTracePageFetcher,
+    type ScannedExportRow,
+} from "../../src/trace/etl"
+
+interface FakeSpan extends ScannedExportRow {
+    trace_id: string
+    span_id: string
+    name?: string
+    children?: FakeSpan[]
+}
+
+const span = (trace_id: string, span_id: string, extras: Partial<FakeSpan> = {}): FakeSpan => ({
+    trace_id,
+    span_id,
+    ...extras,
+})
+
+/** Build a `fetchPage` that walks a fixed list of pages. */
+const pagesFetcher = (pages: ExportTracePage<FakeSpan>[]): ExportTracePageFetcher<FakeSpan> => {
+    let i = 0
+    return async () => pages[i++] ?? {rows: [], nextCursor: null}
+}
+
+describe("exportMatchingTraces", () => {
+    it("flattens span trees, dedups by trace+span, and flushes in batches", async () => {
+        const flushed: FakeSpan[][] = []
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {
+                    rows: [
+                        span("t1", "s1", {children: [span("t1", "s2"), span("t1", "s3")]}),
+                        span("t2", "s4"),
+                    ],
+                    nextCursor: null,
+                },
+            ]),
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            batchSize: 2,
+            pageDelayMs: 0,
+        })
+
+        expect(result.stoppedBy).toBe("done")
+        expect(result.rowCount).toBe(4)
+        // 4 flattened spans across 2 batches of size 2.
+        expect(flushed.map((b) => b.map((r) => r.span_id))).toEqual([
+            ["s1", "s2"],
+            ["s3", "s4"],
+        ])
+    })
+
+    it("dedups identical rows across page boundaries", async () => {
+        const flushed: FakeSpan[][] = []
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {rows: [span("t1", "s1"), span("t2", "s2")], nextCursor: "c1"},
+                {rows: [span("t2", "s2"), span("t3", "s3")], nextCursor: null},
+            ]),
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            batchSize: 10,
+            pageDelayMs: 0,
+        })
+
+        expect(result.rowCount).toBe(3)
+        expect(flushed.flat().map((r) => r.span_id)).toEqual(["s1", "s2", "s3"])
+    })
+
+    it("reports rowCount=0 and stoppedBy=done when no traces match", async () => {
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([{rows: [], nextCursor: null}]),
+            flushBatch: async () => {},
+            pageDelayMs: 0,
+        })
+
+        expect(result.rowCount).toBe(0)
+        expect(result.stoppedBy).toBe("done")
+        expect(result.limitReached).toBe(false)
+    })
+
+    it("stops at the maxRows cap and flushes exactly maxRows", async () => {
+        let page = 0
+        const flushed: FakeSpan[][] = []
+        const result = await exportMatchingTraces<FakeSpan>({
+            // Pages never run out — only the cap stops the scan.
+            fetchPage: async () => {
+                const base = page++ * 100
+                return {
+                    rows: Array.from({length: 100}, (_, i) => span(`t${base + i}`, `s${base + i}`)),
+                    nextCursor: `c${page}`,
+                }
+            },
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            maxRows: 250,
+            batchSize: 1000,
+            pageDelayMs: 0,
+        })
+
+        expect(result.stoppedBy).toBe("limit")
+        expect(result.limitReached).toBe(true)
+        // The transform's `limit` makes the cap exact — never an overshoot.
+        expect(result.rowCount).toBe(250)
+        expect(flushed.flat()).toHaveLength(250)
+        expect(result.scanned).toBeGreaterThanOrEqual(250)
+    })
+
+    it("reports done (not limit) when the source exhausts exactly at the cap", async () => {
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {
+                    rows: Array.from({length: 250}, (_, i) => span(`t${i}`, `s${i}`)),
+                    nextCursor: null,
+                },
+            ]),
+            flushBatch: async () => {},
+            maxRows: 250,
+            batchSize: 1000,
+            pageDelayMs: 0,
+        })
+
+        // Everything matching was flushed — the limit was reached but nothing
+        // was truncated, so this is a clean completion.
+        expect(result.stoppedBy).toBe("done")
+        expect(result.limitReached).toBe(false)
+        expect(result.rowCount).toBe(250)
+    })
+
+    it("recurses into nested children when flattening", async () => {
+        const flushed: FakeSpan[][] = []
+        await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {
+                    rows: [
+                        span("t1", "root", {
+                            children: [
+                                span("t1", "child", {
+                                    children: [span("t1", "grandchild")],
+                                }),
+                            ],
+                        }),
+                    ],
+                    nextCursor: null,
+                },
+            ]),
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            batchSize: 100,
+            pageDelayMs: 0,
+        })
+
+        expect(flushed.flat().map((r) => r.span_id)).toEqual(["root", "child", "grandchild"])
+    })
+
+    it("emits progress for each page and a final reconciliation", async () => {
+        const progress: {scanned: number; rows: number}[] = []
+        await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {rows: [span("t1", "s1")], nextCursor: "c1"},
+                {rows: [span("t2", "s2")], nextCursor: null},
+            ]),
+            flushBatch: async () => {},
+            batchSize: 1,
+            pageDelayMs: 0,
+            onProgress: (p) => progress.push({...p}),
+        })
+
+        expect(progress.length).toBeGreaterThanOrEqual(2)
+        expect(progress[progress.length - 1].rows).toBe(2)
+    })
+
+    it("surfaces a flush failure as BatchFlushError", async () => {
+        let caught: unknown
+        try {
+            await exportMatchingTraces<FakeSpan>({
+                fetchPage: pagesFetcher([
+                    {rows: [span("t1", "s1"), span("t2", "s2")], nextCursor: null},
+                ]),
+                flushBatch: async () => {
+                    throw new Error("simulated flush failure")
+                },
+                batchSize: 2,
+                pageDelayMs: 0,
+            })
+        } catch (err) {
+            caught = err
+        }
+        expect(caught).toBeInstanceOf(BatchFlushError)
+    })
+
+    it("returns a cancelled result when aborted mid-scan", async () => {
+        const controller = new AbortController()
+        let calls = 0
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: async () => {
+                calls++
+                if (calls === 2) controller.abort()
+                return {rows: [span(`t${calls}`, `s${calls}`)], nextCursor: `c${calls}`}
+            },
+            flushBatch: async () => {},
+            signal: controller.signal,
+            batchSize: 100,
+            pageDelayMs: 0,
+        })
+
+        expect(result.stoppedBy).toBe("cancelled")
+        expect(result.limitReached).toBe(false)
+    })
+
+    it("accepts a fetchPage that retries internally (rate-limit-wrapper contract)", async () => {
+        // Demonstrates the integration contract: callers may wrap their
+        // transport with retry logic (e.g. `withRateLimitRetry` in oss) so a
+        // transient failure pauses and re-fetches before resolving. The
+        // pipeline must accept the eventual successful resolution without
+        // double-counting rows or breaking pagination.
+        let firstCallAttempted = 0
+        const fetchPage: ExportTracePageFetcher<FakeSpan> = async () => {
+            firstCallAttempted += 1
+            if (firstCallAttempted === 1) {
+                throw Object.assign(new Error("Rate limit exceeded"), {status: 429})
+            }
+            return {
+                rows: [span("t1", "s1"), span("t2", "s2")],
+                nextCursor: null,
+            }
+        }
+
+        const flushed: FakeSpan[][] = []
+        // Wrap the transport with a one-shot retry that swallows the first 429
+        // — the same shape `withRateLimitRetry` produces in production.
+        const fetchPageWithRetry: ExportTracePageFetcher<FakeSpan> = async (cursor, signal) => {
+            try {
+                return await fetchPage(cursor, signal)
+            } catch (err) {
+                if ((err as {status?: number}).status !== 429) throw err
+                return await fetchPage(cursor, signal)
+            }
+        }
+
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: fetchPageWithRetry,
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            batchSize: 10,
+            pageDelayMs: 0,
+        })
+
+        expect(result.rowCount).toBe(2)
+        expect(result.stoppedBy).toBe("done")
+        expect(flushed.flat().map((r) => r.span_id)).toEqual(["s1", "s2"])
+        expect(firstCallAttempted).toBe(2) // 429 then success
+    })
+
+    it("terminates cleanly when the server returns a stuck cursor (regression for the 'Exporting 0 rows' hang)", async () => {
+        // Reproduces the bug QA hit: every fetched page contains the same
+        // rows (cursor never advances). Without the source's stuck-cursor
+        // guard, the dedup transform filters every page to empty, the
+        // empty-page guard never fires (raw rows.length > 0), and the scan
+        // loops forever while the UI shows "Exporting 0 rows".
+        let calls = 0
+        const flushed: FakeSpan[][] = []
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: async () => {
+                calls++
+                return {
+                    rows: [span("t1", "s1"), span("t1", "s2")],
+                    nextCursor: "stuck-timestamp",
+                }
+            },
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            batchSize: 1000,
+            pageDelayMs: 0,
+        })
+
+        // First fetch (cursor=null) reads the page. Second fetch
+        // (cursor="stuck-timestamp") returns the same cursor → source
+        // ends the stream. Pipeline finalizes with whatever made it past
+        // dedup from the first page.
+        expect(calls).toBe(2)
+        expect(result.stoppedBy).toBe("done")
+        expect(result.rowCount).toBe(2) // s1, s2 from first page; second page deduped out
+        expect(flushed.flat().map((r) => r.span_id)).toEqual(["s1", "s2"])
+    })
+
+    it("progress reports matched rows immediately (regression for 'stuck at 0' during scan)", async () => {
+        // The buffered batch sink only flushes when a batch fills (default
+        // 500). Reporting the flushed count makes the toast stay at 0
+        // until the first full batch lands — confusing for users who see
+        // "Exporting 0 rows" while pages are clearly being fetched.
+        const progressRows: number[] = []
+        await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {
+                    rows: Array.from({length: 50}, (_, i) => span(`t${i}`, `s${i}`)),
+                    nextCursor: "c1",
+                },
+                {
+                    rows: Array.from({length: 50}, (_, i) => span(`t${50 + i}`, `s${50 + i}`)),
+                    nextCursor: null,
+                },
+            ]),
+            flushBatch: async () => {},
+            // Batch size larger than the total — nothing actually flushes
+            // until finalize. Progress must still report per-page rows.
+            batchSize: 1000,
+            pageDelayMs: 0,
+            onProgress: (p) => progressRows.push(p.rows),
+        })
+
+        // Two per-page progress events + one final reconciliation. The
+        // per-page values must be > 0 (post-dedup matched count), not 0.
+        expect(progressRows.length).toBeGreaterThanOrEqual(2)
+        expect(progressRows[0]).toBe(50)
+        expect(progressRows[1]).toBe(100)
+        expect(progressRows[progressRows.length - 1]).toBe(100)
+    })
+
+    it("uses a custom selectKey when provided", async () => {
+        const flushed: FakeSpan[][] = []
+        const result = await exportMatchingTraces<FakeSpan>({
+            fetchPage: pagesFetcher([
+                {
+                    rows: [
+                        span("t1", "s1", {name: "duplicate-name"}),
+                        span("t2", "s2", {name: "duplicate-name"}),
+                        span("t3", "s3", {name: "unique-name"}),
+                    ],
+                    nextCursor: null,
+                },
+            ]),
+            flushBatch: async (batch) => {
+                flushed.push(batch)
+            },
+            // Dedup by name instead of trace+span — drops the second
+            // "duplicate-name" row.
+            selectKey: (row) => row.name ?? `${row.trace_id}:${row.span_id}`,
+            batchSize: 10,
+            pageDelayMs: 0,
+        })
+
+        expect(result.rowCount).toBe(2)
+        expect(flushed.flat().map((r) => r.span_id)).toEqual(["s1", "s3"])
+    })
+})
diff --git a/web/packages/agenta-entities/tests/unit/tier-queue-cap.test.ts b/web/packages/agenta-entities/tests/unit/tier-queue-cap.test.ts
new file mode 100644
index 0000000000..47ddfba0c7
--- /dev/null
+++ b/web/packages/agenta-entities/tests/unit/tier-queue-cap.test.ts
@@ -0,0 +1,70 @@
+/**
+ * Unit tests for `inferQueueMaxFromPlan` — the plan-slug → queue-batch-cap
+ * mapping in `@agenta/entities/trace/etl`.
+ *
+ * Pure function. The frontend reads the subscription plan slug from
+ * `currentSubscriptionQueryAtom.data?.plan`; this module owns the
+ * tier-aware ceiling on how many traces a single batch-add-to-queue run
+ * processes.
+ */
+
+import {describe, expect, it} from "vitest"
+
+import {
+    inferQueueMaxFromPlan,
+    QUEUE_MAX_BUSINESS,
+    QUEUE_MAX_ENTERPRISE,
+    QUEUE_MAX_HOBBY,
+    QUEUE_MAX_PRO,
+} from "../../src/trace/etl"
+
+describe("inferQueueMaxFromPlan", () => {
+    it("returns the hobby cap for the hobby slug", () => {
+        expect(inferQueueMaxFromPlan("cloud_v0_hobby")).toBe(QUEUE_MAX_HOBBY)
+    })
+
+    it("returns the pro cap for the pro slug", () => {
+        expect(inferQueueMaxFromPlan("cloud_v0_pro")).toBe(QUEUE_MAX_PRO)
+    })
+
+    it("returns the business cap for the business slug", () => {
+        expect(inferQueueMaxFromPlan("cloud_v0_business")).toBe(QUEUE_MAX_BUSINESS)
+    })
+
+    it("returns the enterprise cap for the Agenta-managed enterprise slug", () => {
+        expect(inferQueueMaxFromPlan("cloud_v0_agenta_ai")).toBe(QUEUE_MAX_ENTERPRISE)
+    })
+
+    it("returns the enterprise cap for the self-hosted enterprise slug", () => {
+        expect(inferQueueMaxFromPlan("self_hosted_enterprise")).toBe(QUEUE_MAX_ENTERPRISE)
+    })
+
+    it("returns the hobby cap for null / undefined plans (OSS / pre-billing-load)", () => {
+        expect(inferQueueMaxFromPlan(null)).toBe(QUEUE_MAX_HOBBY)
+        expect(inferQueueMaxFromPlan(undefined)).toBe(QUEUE_MAX_HOBBY)
+    })
+
+    it("returns the hobby cap for unknown plan slugs", () => {
+        expect(inferQueueMaxFromPlan("not_a_real_plan")).toBe(QUEUE_MAX_HOBBY)
+        expect(inferQueueMaxFromPlan("")).toBe(QUEUE_MAX_HOBBY)
+    })
+
+    it("is case-insensitive", () => {
+        expect(inferQueueMaxFromPlan("CLOUD_V0_PRO")).toBe(QUEUE_MAX_PRO)
+        expect(inferQueueMaxFromPlan("Cloud_V0_Business")).toBe(QUEUE_MAX_BUSINESS)
+    })
+
+    it("the cap order respects the tier hierarchy", () => {
+        // Sanity check on the constants themselves so a typo can't silently
+        // demote a higher tier below a lower one.
+        expect(QUEUE_MAX_HOBBY).toBeLessThan(QUEUE_MAX_PRO)
+        expect(QUEUE_MAX_PRO).toBeLessThan(QUEUE_MAX_BUSINESS)
+        expect(QUEUE_MAX_BUSINESS).toBeLessThan(QUEUE_MAX_ENTERPRISE)
+    })
+
+    it("the enterprise cap matches the bulk-export ceiling so both pipelines agree at the top tier", () => {
+        // Cross-check against the export's MAX_ROWS so a future bump
+        // touches both. Imported via the same package surface.
+        expect(QUEUE_MAX_ENTERPRISE).toBe(20_000)
+    })
+})
diff --git a/web/packages/agenta-entities/tsconfig.json b/web/packages/agenta-entities/tsconfig.json
index a372768828..600613942b 100644
--- a/web/packages/agenta-entities/tsconfig.json
+++ b/web/packages/agenta-entities/tsconfig.json
@@ -6,5 +6,9 @@
         "tsBuildInfoFile": ".tsbuildinfo"
     },
     "include": ["src/**/*.ts", "src/**/*.tsx", "../css-modules.d.ts"],
-    "exclude": ["node_modules", "dist"]
+    "exclude": [
+        "node_modules",
+        "dist",
+        "src/**/__tests__/**"
+    ]
 }