[FE Feat] Evaluation scenario filtering#4405
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
be34a9e to
4cfdc24
Compare
4cfdc24 to
0300d73
Compare
0300d73 to
185fdc3
Compare
185fdc3 to
b253edc
Compare
086b175 to
1577fb1
Compare
b253edc to
899ef91
Compare
899ef91 to
c9d6a3e
Compare
…wed) Phased plan to productionize the EtlPocScenarios PoC into the real EvalRunDetails scenarios table: thin store, schema columns, self-hydrating cells, ETL filtering, comparison, live updates; retire the PoC. Eng review: 5 outside-voice gaps folded in (non-terminal rendering unbuilt, useEtlColumns drops "other" columns, perf premise unmeasured → perf gate, T5 comparison is a build, CSV export path missed). Design review: focused on interaction states + filter UX (5/10 → 9/10).
Reading evaluationPreviewTableStore.ts confirmed it is already a thin store:
PreviewTableRow carries only identity + testcaseId + status + scenarioIndex
+ comparison fields (no column data), and already does per-eval-type window
order. T1 ("promote a thin store") was lateral churn — re-implementing an
existing store. Phase 1 collapses to the coupled T2+T3 column+cell swap
against the existing store. Confirms the eng-review outside voice's finding.
Reading Table.tsx showed the CSV export path (exportResolveValue, columnLookupMap, loadAllPagesBeforeExport) is keyed off columnResult column ids, which differ from useEtlColumns keys. Phase 1 swaps display columns only; usePreviewColumns/columnResult stay alive for export until Phase 3 (T5). The "other"-column un-drop ripples into ColumnLeaf, EtlResolvedCell, and useCellMaterialization.
Phase 1 (T2+T3) of eval-scenarios-table-integration. The eval run scenarios table now derives its schema columns from the run graph and renders cells that self-hydrate from molecule caches, replacing the backend-metadata + per-cell-fetch display path. - groupRunColumns: pure mapping-grouping in @agenta/entities etl; keeps "other"-kind columns (the PoC dropped them) - EvalRunDetails/etl/: useEtlColumns, useHydrateScenarios, run-aware useCellMaterialization, useScopeChangeEviction, EtlColumnHeader, EtlResolvedCell - EtlResolvedCell renders pending/running/failed scenarios distinctly and distinguishes slice-not-hydrated from scenario-not-run - Table.tsx swaps only the display columns; usePreviewColumns / columnResult stay alive for the CSV export path (Phase 3) - column-parity regression test for groupRunColumns
Phase 2 filtering ships multi-condition AND/OR from day 1, not the PoC's single predicate. Records the decision, generalises the planned predicate type to a flat condition group, and marks T2+T3 as shipped.
…rSchema Pre-stages the Phase 2 (T4) filtering core — pure logic, decoupled from the D5 perf gate that gates T4 wiring. - PredicateGroup (flat AND/OR, one nesting level) + RowFilter type; evaluateRowPredicate / evaluatePredicateGroup / evaluateRowFilter / matchesRowFilter row-level evaluators; makePredicateGroupFilter pipeline transform. makeRowPredicateFilter left intact. - predicateToEntitySlices accepts a PredicateGroup — slice set is the union of every condition's slices (AND vs OR does not change the fetch). - buildFilterSchema: derives filterable fields + type-matched operators from the run schema, with a resolveValueType refinement seam. - 25 unit tests (predicateGroup, filterSchema).
Two regressions from the Phase 1 ETL column swap — both because ETL columns derive from the run's raw mappings, not the curated backend column set: - `testcase_dedup_id` (any `*_dedup_id` column) was rendered. These are internal dedup keys, not user-facing — `groupRunColumns` now drops them, matching the backend-metadata column path. - the static invocation-metrics group (cost / duration / tokens) disappeared. `useEtlColumns` now skips metrics-kind groups and `Table.tsx` keeps the production metric group(s), rendered by the existing metric cell.
Wires Phase 2 filtering into the real scenarios table — multi-condition AND/OR (decision D8), not the /etl-poc page. - ScenarioFilterBar: a filter bar in the table header — column / operator / value per condition, AND/OR join, add / remove / clear. Columns and type-matched operators come from buildFilterSchema; a name heuristic refines value types for the common cases. - scenarioFilterState: per-run PredicateGroup atom; half-built conditions are dropped from the evaluated filter. - useScenarioFilter: filters the base rows via evaluateRowFilter against molecule-cache-resolved columns, with a viewport-fill loop so a strict filter still fills the viewport; unhydrated rows stay visible until known; the confirmed-match count gates the loop. - Table.tsx: feeds filtered base rows into the merge, drives predicate- aware hydration, remounts the table on filter change, guards cell renders against the transient undefined record.
Only evaluator-output and metric columns are offered in the filter bar for v1 — testset (input) and application (output) columns are withheld behind a UI allowlist (FILTERABLE_COLUMN_KINDS). The filter engine and buildFilterSchema still support every column kind, so enabling the rest later is a one-line flip — no structural change.
The filter bar typed columns with a name heuristic, so a boolean evaluator output (e.g. an LLM-judge field) was offered numeric comparators and a numeric value input. Column value types now come from the evaluator output schema: the columnResult column `metricType` — itself read from each output property's JSON-schema `type` by `extractMetrics` — maps to the filter value type via `buildColumnValueTypeResolver`. A boolean column now gets only equals / not-equals and a true/false input; a numeric one gets the comparators. The name heuristic is removed.
extractMetrics typed an output property only when `schema.type` was a
plain string, so a nullable field (`type: ["boolean", "null"]` or
`anyOf: [{type: "boolean"}, {type: "null"}]`) fell back to "string" —
which made the scenario filter bar offer a boolean field a text input
instead of true/false.
resolveSchemaType (new, dependency-free module) unwraps array and
anyOf/oneOf nullable encodings to the first non-"null" type.
The filtered row list kept unhydrated rows "visible until known", so it grew and shrank as hydration revealed non-matches — flickering the table between rows and the empty state until a page finished materializing. filteredBaseRows now holds confirmed matches only: a row appears once it is hydrated AND matches, so the list only ever grows during a scan. Rows materialize as their data arrives, with no show-then-drop. The full empty + loading overlay shows only until the first match lands; after that a "N matches · scanning…" indicator in the filter bar covers the rest of the scan.
The always-visible inline filter conditions wrapped and grew the header row, pushing the table down — worse with every condition added. Following the observability `Filters` pattern, the conditions now live in a popover behind a compact "Filters" button (with an active-condition count badge); the strip itself is a single fixed-height row. Edits are staged in a draft and committed on Apply — Cancel/outside-click discards them — so the table is not re-scanned on every keystroke.
Replaces the top-level "Match All/Any" segmented control with a row-level connector: the first row reads "Where", every later row shows a borderless ghost-style And/Or select. The group has a single op (flat group, D8), so the connectors stay in sync — toggling any one sets the group op.
The "Where" label and the And/Or connector select sat directly in the flex row with their own width classes, which resolved to different widths — so the first row's Column select didn't line up with the rows below. Both connectors now sit inside one shared fixed-width slot, so every row's Column select starts at the same x.
The 64px connector slot truncated the default-size And/Or select to "A…". Widened to 80px so the connector shows in full.
The scanning indicator gated on `hasMore`, so it stayed on forever once the viewport-fill loop had stopped (it stops at the match target) — the dataset still "has more" pages, but the loop is idle. useScenarioFilter now exposes `isFilling` (the loop still intends to load pages). scanInProgress = isFilling OR a page fetch / hydrate batch in flight — so "scanning" turns off once enough matches are found and nothing is in flight, and reappears only while a (scroll-triggered) load is actually running.
The filter sat on its own strip below the run header, taking a second line. It now renders inline in the "Evaluations:" header row (Scenarios tab only) — one line for everything. The filter bar is now self-contained: given a runId it derives the run schema, column value types, and live scan status from atoms. The scenarios table publishes its scan status (match count + scanning) via scenarioFilterStatusAtomFamily; the header's filter bar reads it.
The filter is now a compact icon-only funnel button with a condition-
count badge, placed in the header's right group just before "Compare"
(was a text "Filters" button + inline match-count on the left).
The match-count / scanning indicator moves into the popover header
("N matches · scanning…" next to the title), so the closed affordance
stays minimal.
The /etl-poc test page proved the run-graph + molecule-cache strategy that Phase 1 + 2 shipped into the real scenarios table. Production has its own copies of the ported hooks, so the PoC is now dead test-page code. - delete components/EtlPocScenarios/ and the /etl-poc routes (oss + ee) - drop the /etl-poc branch from the Layout human-eval route check - mark T4 + T7 done in the design doc; Phase 3 (T5/T6/T8) remains
The filter engine already supported in/nin; the popover now exposes them. A list operator shows a tag-style multi-value input (comma / Enter to add entries), numeric columns coerce entries to numbers, and a condition with an empty list counts as incomplete. Switching a condition between a scalar and a list operator resets its value to the right shape.
Comparison interleaves each matched base row with its compare-run counterparts joined by testcase_id (mergedRows). But with a filter active the viewport-fill loop only scans the base run, and a strict filter means the table may never scroll — so compare runs stayed on page 1 and the join missed most counterparts. While a filter is active, eagerly load every compare-run page so the testcase-id join can resolve every matched base row's counterpart.
While a run is still executing, periodically refetch the loaded scenario pages (to refresh row statuses) and evict + re-prefetch the results / metrics molecule caches of running or just-finished scenarios. Without this a scenario that completes after the table loaded keeps a stale empty cache and a stale "running" status, so its cells show the running indicator forever. The loop runs one final pass when the run reaches a terminal status, then stops.
The focus drawer and SingleScenarioViewer were traced through scenarioColumnValues.ts and SingleScenarioViewerPOC: both resolve their values via the old data path's independent atom families, so the T2+T3 cell swap does not regress them. Full ETL migration is deferred — useScenarioCellValue still backs the static invocation-metrics group kept in the table and the CSV export, so it cannot be deleted yet. Marks Phases 1-3 complete.
The focus drawer and SingleScenarioViewer resolve their values from the scenario-steps and evaluation-metric query atoms, which never polled — opening the drawer on a running scenario showed a frozen snapshot. Add a run-status-gated refetchInterval (5s while non-terminal, off once terminal) mirroring evaluationRunQueryAtomFamily. Since evaluation-metric also backs the table's static invocation-metric cells, those columns now refresh live during a run too.
c9d6a3e to
8cb6bd1
Compare
…etl-eval-scenario-filtering
Railway Preview Environment
Updated at 2026-05-26T17:07:53.112Z |
…etl-eval-scenario-filtering
…etl-eval-scenario-filtering
…etl-eval-scenario-filtering
be9ebb7
into
fe-experiment/etl-batch-add-traces
Summary
implements filtering for evaluation run scenarios
Testing
Verified locally
QA follow-up
testing filtering with:
Demo
loom
Checklist
Contributor Resources