Skip to content

bench(swe-lite): file-localization snapshot, 6 backends#495

Open
justrach wants to merge 3 commits into
mainfrom
bench/publish-swe-lite-results
Open

bench(swe-lite): file-localization snapshot, 6 backends#495
justrach wants to merge 3 commits into
mainfrom
bench/publish-swe-lite-results

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

Adds benchmarks/swe-lite/ — a small file-localization snapshot of
6 retrieval backends on 4 SWE-bench Lite instances, graded by a
deterministic oracle (file-path match against the merged upstream
patch). Sibling to benchmarks/search-shootout/ (React corpus,
LLM-judged); together they form a two-panel view — judge-graded
quality, plus a verifiable file-localization signal.

  • 4 taskspallets__flask-4045, psf__requests-2148,
    psf__requests-2674, mwaskom__seaborn-2848 (base_commits pinned).
  • 6 backends — codedb (CLI), codedb_CONTEXT (MCP composer),
    leanctx, fts5_trigram, codegraph (CLI primitive), codegraph_CONTEXT
    (CLI composer; v0.9.3).
  • Hypothesis-framed, not a dominance claim. n=4 is a sanity check.
  • Mixed measurement methodology, explicitly annotated. Codegraph
    rows reflect subprocess + fixed query plan (no LLM loop); other
    rows reflect a full agent loop. Marked * in the table; explained
    in RESULTS.md §Measurement caveat.

Headline:

backend              recall  top-1  avg calls  avg wall (s)  avg tokens
-------------------  ------  -----  ---------  ------------  ----------
codedb               4/4     3/4    26.75      42.00         37,954
codedb_CONTEXT       4/4     3/4     2.25       1.25         14,716
leanctx              4/4     3/4     9.75      27.25         30,172
fts5_trigram         4/4     4/4    13.75      24.75         25,800
codegraph *          4/4     3/4     3.00       0.17          1,981
codegraph_CONTEXT *  2/4     2/4     1.00       0.11          4,146

Files:

  • results.json — frozen data (24 cells + summary + hypothesis)
  • replay.py — recomputes per-backend averages from raw cells,
    asserts the summary matches, prints the matrix
  • RESULTS.md — publishable read with caveats and falsifiable
    hypothesis
  • README.md — folder guide; positions vs search-shootout

Test plan

  • python3 benchmarks/swe-lite/replay.py exits 0 (summary
    matches recomputed cells)
  • python3 benchmarks/swe-lite/replay.py --json emits valid JSON
  • Codegraph re-verified at v0.9.3 (released 2026-05-22); file
    lists byte-identical to v0.7.10
  • Reviewer eyeball of RESULTS.md framing, hypothesis statement,
    and measurement caveat

🤖 Generated with Claude Code

justrach and others added 3 commits May 22, 2026 20:01
…s, deterministic oracle

Publishes the SWE-bench Lite file-localization view of the
codedb-vs-peers shootout. Complements `benchmarks/search-shootout/`
(hand-authored React tasks + Claude-as-judge) with a verifiable
ground-truth view: gold = the file the merged upstream PR patched,
oracle = deterministic file-path match.

Instances: pallets__flask-4045, psf__requests-2148, psf__requests-2674,
mwaskom__seaborn-2848. Backends: codedb (CLI), codedb_CONTEXT (MCP
composer), leanctx, fts5_trigram.

Headline: all four backends recall the gold file (4/4). Top-1 splits
at one task — fts5_trigram 4/4, the other three at 3/4 (the seaborn
axisgrid/_oldcore call-trace ordering). codedb_CONTEXT is the sole
Pareto-optimal point on (quality, efficiency): 2.25 calls / 1.25s /
14.7k tokens vs 9.75-26.75 calls and 24.75-42s for the rest.

The accompanying RESULTS.md flags the deployment-shape caveat that
caused an earlier CLI-only read to misrepresent codedb's efficiency:
when a tool has multiple deployment surfaces, the bench has to
compare primary-against-primary, not a side surface against peers'
primaries.

Caveats are spelled out in RESULTS.md (n=4 is a sanity check, not a
statistic; file-localization ≠ patch-correctness; replay-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two backends to the matrix:

- `codegraph` — primitive `codegraph query` surface, driven by a
  fixed 3-query plan per task (subprocess only, no LLM loop).
  4/4 recall, 3/4 top-1, 3.00 avg calls, 0.17s wall.
- `codegraph_CONTEXT` — task-shaped `codegraph context` composer,
  single call per task. 2/4 recall, 2/4 top-1 — misses both
  `requests` tasks by surfacing urllib3 internals over the
  requests-layer wrapper where the gold patch actually lands.

Codegraph rows are explicitly annotated `measurement: tool_output_only`
in `results.json`. `replay.py` marks them with `*` in the table and
prints a footnote: subprocess time + stdout bytes/4, NOT a full
LLM-driven agent loop, so the efficiency cells are not directly
comparable to the other rows. Quality cells (recall, top-1) ARE
directly comparable.

Reframes RESULTS.md as a hypothesis snapshot rather than a
dominance claim: small sample, mixed measurement methodology, and
the doc now ends in a stated hypothesis (codedb_CONTEXT is the
cheapest backend in the 3/4-top-1 cluster; codegraph primitive
would likely join that cluster under matched methodology) along
with the falsification path (live runner, more tasks, patch
oracle).

README updated to match. Verify still passes: `python3 replay.py`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upgraded codegraph 0.7.10 -> 0.9.3 (five minor versions of drift on
the tool we're benchmarking — unfair to measure stale). Re-indexed
all 4 corpora and re-ran both surfaces.

Result: file lists are byte-identical to v0.7.10 on all 4 tasks ×
both surfaces. Wall times within normal variance. The quality
picture in RESULTS.md is robust to the version bump.

Adds `backend_versions` to results.json metadata and a one-line note
near the top of RESULTS.md so future readers know which codegraph
version produced the numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05952b4dd2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

def print_table(snapshot: dict, recomputed: dict) -> None:
backends = snapshot["backends"]
measurement = {
b: snapshot["summary"]["by_backend"][b].get("measurement")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard missing backend summaries before table rendering

The non-JSON path can crash with a KeyError instead of reporting a verification failure when results.json is hand-edited and a backend is listed in backends but missing from summary.by_backend (or missing cells). print_table indexes snapshot["summary"]["by_backend"][b] directly, and main calls print_table before acting on errors, so malformed snapshots produce a traceback rather than the intended "VERIFY FAILED" diagnostics.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 526626 526664 +0.01% +38 OK
codedb_changes 54541 59401 +8.91% +4860 OK
codedb_deps 8828 10050 +13.84% +1222 NOISE
codedb_edit 7090 7929 +11.83% +839 NOISE
codedb_find 60736 62953 +3.65% +2217 OK
codedb_hot 104705 99147 -5.31% -5558 OK
codedb_outline 302949 313055 +3.34% +10106 OK
codedb_read 107450 107421 -0.03% -29 OK
codedb_search 150034 156574 +4.36% +6540 OK
codedb_snapshot 292078 290140 -0.66% -1938 OK
codedb_status 12806 15250 +19.08% +2444 NOISE
codedb_symbol 61897 62587 +1.11% +690 OK
codedb_tree 63830 59222 -7.22% -4608 OK
codedb_word 85217 82017 -3.76% -3200 OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant