perf(context): −49% tokens on narrow-symbol tasks + mnemon takeaways doc#491
Open
justrach wants to merge 3 commits into
Open
perf(context): −49% tokens on narrow-symbol tasks + mnemon takeaways doc#491justrach wants to merge 3 commits into
justrach wants to merge 3 commits into
Conversation
…ippets when bodies already inlined When sym_refs.items.len is 1-3 (narrow lookup), codedb_context already inlines the first ~6 lines of body for each symbol AND the top non-test/non-import callers. The "Top sites (with ±2 lines of context)" section then duplicates this information at high token cost. Gate: when have_inline_bodies (1 ≤ sym_refs ≤ 3): - cap "Most-relevant files" listing to 3 (was up to 5) - skip the "Top sites" body-snippet section entirely Measured on T1 flask "find before_request decorator" (28 chars, 3 symbol defs all in scaffold.py + tests): before (v0.2.5817 binary): 2993 bytes after (this branch): 1525 bytes (-49%) The agent still gets: - all symbol locations (path:line) - ~6 lines of body for each - 1-2 non-test callers with scope info - top 3 ranked files …which proved sufficient in the RESULTS-FINAL-WIN.md n=3 eval that established the branch wins T1 on median. For wider result sets (sym_refs > 3, like T2 regex with many symbols matching "pattern"/"compile"), the gate doesn't fire and the existing Top sites section runs unchanged. Tests: 485/490 (same 5 pre-existing /private/tmp path-policy failures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read mnemon's design docs (DESIGN.md, design/04-graph-model.md,
design/06-lifecycle.md, design/07-integration.md). Filed concrete
takeaways at docs/design/mnemon-takeaways.md.
Headline observations:
- Mnemon validates codedb's LLM-Supervised pattern explicitly
(binary = deterministic compute; LLM = judgment calls)
- Intent-native protocol (`remember/link/recall`) is the design lesson
— codedb should keep adding cognition-named tools over operation-named
ones
- Lifecycle hooks (Prime/Remind/Nudge/Compact) suggest a `codedb hooks
install` v0.2.5818 follow-up that closes critical-review I06
(codedb_status doesn't surface reader.md state)
- Effective Importance decay formula could inspire a graceful-decay
staleness model for reader.md (binary hash → tiered freshness),
but the v0 binary protocol is fine
What NOT to steal:
- Four-graph memory model — wrong shape for code search
- Auto-pruning / soft-delete — codedb's snapshot reflects current
source, doesn't accumulate
- remember/link/recall API verbatim — codedb doesn't write
user-authored facts
Concrete v0.2.5818 candidates ranked by ROI in the doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
8 of 9 Sonnet 4.6 sub-agent samples collected (T2 sample B timed out
beyond eval window). Headline:
T1 flask (gate FIRES, 3 sym_refs):
token-opt n=3: 5, 6, 5 → mean 5.33, median 5, best 5, spread 1
main n=3: 4, 5, 5 → mean 4.67, median 5, best 4, spread 1
verdict: at-parity-or-noise. Median ties. Spread same. 49% byte
reduction did NOT cost a call.
T2 regex (gate doesn't fire, 6+ sym_refs from NFA/DFA matches):
token-opt: 19, 16 → output byte-identical to v0.2.5817
verdict: pure sample noise, change is a no-op here
T3 react (gate doesn't fire, many useEffect/useLayoutEffect):
token-opt: 7, 15, 16 → mean 12.67, median 15
verdict: pure sample noise, change is a no-op here
9/9 runs (across both eval branches) returned correct answers — no
quality regression.
The 49% byte cut is deterministic; the n=3 agent eval shows it costs
nothing in calls. This is a free win on narrow-symbol tasks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Two unrelated changes that pair well:
The perf change
In v0.2.5817, when `codedb_context` finds 1–3 symbol definitions for the task's keywords, it inlines ~6 lines of body per symbol AND surfaces non-test/non-import callers. After that, the existing "## Top sites (with ±2 lines of context)" section duplicates information at high token cost — the agent already has enough to answer.
Gate: when `sym_refs.items.len` is in `[1, 3]`:
Measured on T1 flask `"find before_request decorator"` (28 chars, 3 symbol defs):
For wider result sets (sym_refs > 3 — e.g. T2 regex with many symbols matching `pattern` / `compile`), the gate doesn't fire and the existing flow runs unchanged. No regression risk on exploratory tasks.
The mnemon takeaways doc
mnemon-dev/mnemon is a persistent-memory system for LLM agents (272 stars, Go + SQLite, four-graph knowledge store, MCP-compatible). Reviewed their design docs and filed concrete takeaways at `docs/design/mnemon-takeaways.md`:
Ranked v0.2.5818 candidate list at the bottom of the doc.
Test plan
🤖 Generated with Claude Code