bench(swe-lite): file-localization snapshot, 6 backends by justrach · Pull Request #495 · justrach/codedb

justrach · 2026-05-22T12:27:26Z

Summary

Adds benchmarks/swe-lite/ — a small file-localization snapshot of
6 retrieval backends on 4 SWE-bench Lite instances, graded by a
deterministic oracle (file-path match against the merged upstream
patch). Sibling to benchmarks/search-shootout/ (React corpus,
LLM-judged); together they form a two-panel view — judge-graded
quality, plus a verifiable file-localization signal.

4 tasks — pallets__flask-4045, psf__requests-2148,
psf__requests-2674, mwaskom__seaborn-2848 (base_commits pinned).
6 backends — codedb (CLI), codedb_CONTEXT (MCP composer),
leanctx, fts5_trigram, codegraph (CLI primitive), codegraph_CONTEXT
(CLI composer; v0.9.3).
Hypothesis-framed, not a dominance claim. n=4 is a sanity check.
Mixed measurement methodology, explicitly annotated. Codegraph
rows reflect subprocess + fixed query plan (no LLM loop); other
rows reflect a full agent loop. Marked * in the table; explained
in RESULTS.md §Measurement caveat.

Headline:

backend              recall  top-1  avg calls  avg wall (s)  avg tokens
-------------------  ------  -----  ---------  ------------  ----------
codedb               4/4     3/4    26.75      42.00         37,954
codedb_CONTEXT       4/4     3/4     2.25       1.25         14,716
leanctx              4/4     3/4     9.75      27.25         30,172
fts5_trigram         4/4     4/4    13.75      24.75         25,800
codegraph *          4/4     3/4     3.00       0.17          1,981
codegraph_CONTEXT *  2/4     2/4     1.00       0.11          4,146

Files:

results.json — frozen data (24 cells + summary + hypothesis)
replay.py — recomputes per-backend averages from raw cells,
asserts the summary matches, prints the matrix
RESULTS.md — publishable read with caveats and falsifiable
hypothesis
README.md — folder guide; positions vs search-shootout

Test plan

python3 benchmarks/swe-lite/replay.py exits 0 (summary
matches recomputed cells)
python3 benchmarks/swe-lite/replay.py --json emits valid JSON
Codegraph re-verified at v0.9.3 (released 2026-05-22); file
lists byte-identical to v0.7.10
Reviewer eyeball of RESULTS.md framing, hypothesis statement,
and measurement caveat

🤖 Generated with Claude Code

…s, deterministic oracle Publishes the SWE-bench Lite file-localization view of the codedb-vs-peers shootout. Complements `benchmarks/search-shootout/` (hand-authored React tasks + Claude-as-judge) with a verifiable ground-truth view: gold = the file the merged upstream PR patched, oracle = deterministic file-path match. Instances: pallets__flask-4045, psf__requests-2148, psf__requests-2674, mwaskom__seaborn-2848. Backends: codedb (CLI), codedb_CONTEXT (MCP composer), leanctx, fts5_trigram. Headline: all four backends recall the gold file (4/4). Top-1 splits at one task — fts5_trigram 4/4, the other three at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering). codedb_CONTEXT is the sole Pareto-optimal point on (quality, efficiency): 2.25 calls / 1.25s / 14.7k tokens vs 9.75-26.75 calls and 24.75-42s for the rest. The accompanying RESULTS.md flags the deployment-shape caveat that caused an earlier CLI-only read to misrepresent codedb's efficiency: when a tool has multiple deployment surfaces, the bench has to compare primary-against-primary, not a side surface against peers' primaries. Caveats are spelled out in RESULTS.md (n=4 is a sanity check, not a statistic; file-localization ≠ patch-correctness; replay-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two backends to the matrix: - `codegraph` — primitive `codegraph query` surface, driven by a fixed 3-query plan per task (subprocess only, no LLM loop). 4/4 recall, 3/4 top-1, 3.00 avg calls, 0.17s wall. - `codegraph_CONTEXT` — task-shaped `codegraph context` composer, single call per task. 2/4 recall, 2/4 top-1 — misses both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper where the gold patch actually lands. Codegraph rows are explicitly annotated `measurement: tool_output_only` in `results.json`. `replay.py` marks them with `*` in the table and prints a footnote: subprocess time + stdout bytes/4, NOT a full LLM-driven agent loop, so the efficiency cells are not directly comparable to the other rows. Quality cells (recall, top-1) ARE directly comparable. Reframes RESULTS.md as a hypothesis snapshot rather than a dominance claim: small sample, mixed measurement methodology, and the doc now ends in a stated hypothesis (codedb_CONTEXT is the cheapest backend in the 3/4-top-1 cluster; codegraph primitive would likely join that cluster under matched methodology) along with the falsification path (live runner, more tasks, patch oracle). README updated to match. Verify still passes: `python3 replay.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upgraded codegraph 0.7.10 -> 0.9.3 (five minor versions of drift on the tool we're benchmarking — unfair to measure stale). Re-indexed all 4 corpora and re-ran both surfaces. Result: file lists are byte-identical to v0.7.10 on all 4 tasks × both surfaces. Wall times within normal variance. The quality picture in RESULTS.md is robust to the version bump. Adds `backend_versions` to results.json metadata and a one-line note near the top of RESULTS.md so future readers know which codegraph version produced the numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05952b4dd2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T12:29:28Z

+def print_table(snapshot: dict, recomputed: dict) -> None:
+    backends = snapshot["backends"]
+    measurement = {
+        b: snapshot["summary"]["by_backend"][b].get("measurement")


Guard missing backend summaries before table rendering

The non-JSON path can crash with a KeyError instead of reporting a verification failure when results.json is hand-edited and a backend is listed in backends but missing from summary.by_backend (or missing cells). print_table indexes snapshot["summary"]["by_backend"][b] directly, and main calls print_table before acting on errors, so malformed snapshots produce a traceback rather than the intended "VERIFY FAILED" diagnostics.

Useful? React with 👍 / 👎.

github-actions · 2026-05-22T12:29:58Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	526626	526664	+0.01%	+38	OK
`codedb_changes`	54541	59401	+8.91%	+4860	OK
`codedb_deps`	8828	10050	+13.84%	+1222	NOISE
`codedb_edit`	7090	7929	+11.83%	+839	NOISE
`codedb_find`	60736	62953	+3.65%	+2217	OK
`codedb_hot`	104705	99147	-5.31%	-5558	OK
`codedb_outline`	302949	313055	+3.34%	+10106	OK
`codedb_read`	107450	107421	-0.03%	-29	OK
`codedb_search`	150034	156574	+4.36%	+6540	OK
`codedb_snapshot`	292078	290140	-0.66%	-1938	OK
`codedb_status`	12806	15250	+19.08%	+2444	NOISE
`codedb_symbol`	61897	62587	+1.11%	+690	OK
`codedb_tree`	63830	59222	-7.22%	-4608	OK
`codedb_word`	85217	82017	-3.76%	-3200	OK

justrach and others added 3 commits May 22, 2026 20:01

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(swe-lite): file-localization snapshot, 6 backends#495

bench(swe-lite): file-localization snapshot, 6 backends#495
justrach wants to merge 3 commits into
mainfrom
bench/publish-swe-lite-results

justrach commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented May 22, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 22, 2026

Benchmark Regression Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant