bench(swe-lite): file-localization snapshot, 6 backends#495
Conversation
…s, deterministic oracle Publishes the SWE-bench Lite file-localization view of the codedb-vs-peers shootout. Complements `benchmarks/search-shootout/` (hand-authored React tasks + Claude-as-judge) with a verifiable ground-truth view: gold = the file the merged upstream PR patched, oracle = deterministic file-path match. Instances: pallets__flask-4045, psf__requests-2148, psf__requests-2674, mwaskom__seaborn-2848. Backends: codedb (CLI), codedb_CONTEXT (MCP composer), leanctx, fts5_trigram. Headline: all four backends recall the gold file (4/4). Top-1 splits at one task — fts5_trigram 4/4, the other three at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering). codedb_CONTEXT is the sole Pareto-optimal point on (quality, efficiency): 2.25 calls / 1.25s / 14.7k tokens vs 9.75-26.75 calls and 24.75-42s for the rest. The accompanying RESULTS.md flags the deployment-shape caveat that caused an earlier CLI-only read to misrepresent codedb's efficiency: when a tool has multiple deployment surfaces, the bench has to compare primary-against-primary, not a side surface against peers' primaries. Caveats are spelled out in RESULTS.md (n=4 is a sanity check, not a statistic; file-localization ≠ patch-correctness; replay-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two backends to the matrix: - `codegraph` — primitive `codegraph query` surface, driven by a fixed 3-query plan per task (subprocess only, no LLM loop). 4/4 recall, 3/4 top-1, 3.00 avg calls, 0.17s wall. - `codegraph_CONTEXT` — task-shaped `codegraph context` composer, single call per task. 2/4 recall, 2/4 top-1 — misses both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper where the gold patch actually lands. Codegraph rows are explicitly annotated `measurement: tool_output_only` in `results.json`. `replay.py` marks them with `*` in the table and prints a footnote: subprocess time + stdout bytes/4, NOT a full LLM-driven agent loop, so the efficiency cells are not directly comparable to the other rows. Quality cells (recall, top-1) ARE directly comparable. Reframes RESULTS.md as a hypothesis snapshot rather than a dominance claim: small sample, mixed measurement methodology, and the doc now ends in a stated hypothesis (codedb_CONTEXT is the cheapest backend in the 3/4-top-1 cluster; codegraph primitive would likely join that cluster under matched methodology) along with the falsification path (live runner, more tasks, patch oracle). README updated to match. Verify still passes: `python3 replay.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upgraded codegraph 0.7.10 -> 0.9.3 (five minor versions of drift on the tool we're benchmarking — unfair to measure stale). Re-indexed all 4 corpora and re-ran both surfaces. Result: file lists are byte-identical to v0.7.10 on all 4 tasks × both surfaces. Wall times within normal variance. The quality picture in RESULTS.md is robust to the version bump. Adds `backend_versions` to results.json metadata and a one-line note near the top of RESULTS.md so future readers know which codegraph version produced the numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05952b4dd2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def print_table(snapshot: dict, recomputed: dict) -> None: | ||
| backends = snapshot["backends"] | ||
| measurement = { | ||
| b: snapshot["summary"]["by_backend"][b].get("measurement") |
There was a problem hiding this comment.
Guard missing backend summaries before table rendering
The non-JSON path can crash with a KeyError instead of reporting a verification failure when results.json is hand-edited and a backend is listed in backends but missing from summary.by_backend (or missing cells). print_table indexes snapshot["summary"]["by_backend"][b] directly, and main calls print_table before acting on errors, so malformed snapshots produce a traceback rather than the intended "VERIFY FAILED" diagnostics.
Useful? React with 👍 / 👎.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Summary
Adds
benchmarks/swe-lite/— a small file-localization snapshot of6 retrieval backends on 4 SWE-bench Lite instances, graded by a
deterministic oracle (file-path match against the merged upstream
patch). Sibling to
benchmarks/search-shootout/(React corpus,LLM-judged); together they form a two-panel view — judge-graded
quality, plus a verifiable file-localization signal.
pallets__flask-4045,psf__requests-2148,psf__requests-2674,mwaskom__seaborn-2848(base_commits pinned).leanctx, fts5_trigram, codegraph (CLI primitive), codegraph_CONTEXT
(CLI composer; v0.9.3).
rows reflect subprocess + fixed query plan (no LLM loop); other
rows reflect a full agent loop. Marked
*in the table; explainedin RESULTS.md §Measurement caveat.
Headline:
Files:
results.json— frozen data (24 cells + summary + hypothesis)replay.py— recomputes per-backend averages from raw cells,asserts the summary matches, prints the matrix
RESULTS.md— publishable read with caveats and falsifiablehypothesis
README.md— folder guide; positions vssearch-shootoutTest plan
python3 benchmarks/swe-lite/replay.pyexits 0 (summarymatches recomputed cells)
python3 benchmarks/swe-lite/replay.py --jsonemits valid JSONlists byte-identical to v0.7.10
RESULTS.mdframing, hypothesis statement,and measurement caveat
🤖 Generated with Claude Code