Skip to content

fix(explore): skip-trigram files compete in Tier 1; share recall pipeline (#447, #451)#492

Open
justrach wants to merge 2 commits into
mainfrom
issue-447-failing-test
Open

fix(explore): skip-trigram files compete in Tier 1; share recall pipeline (#447, #451)#492
justrach wants to merge 2 commits into
mainfrom
issue-447-failing-test

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

  • Skip-trigram files (>64KB) with word-index hits now compete in Tier 1 ranking, so the canonical definition site for a common identifier no longer gets starved when small-file noise fills the quota first.
  • searchContentWithScope is rewritten as a thin annotator over searchContent (via new searchContentLocked), eliminating the divergent second pipeline that caused explore: searchContent invisibility for canonical definition sites in files >64KB #447's fix to skip codedb_callers and scope=true search.
  • Tier 0 per-file divisor is capped at min(5, distinct-hit-files) so a single canonical file is no longer rationed to 1/5 of max_results.

Issues

Closes #451 (scope=true / callers invisibility).
Reaffirms #447 fix and prevents regression of the same bug class.

Test plan

  • zig build test — both issue-447 and the new issue-451 test pass.
  • issue-451 test asserts the canonical large file is reachable via searchContentWithScope AND that the scope annotation (scope_name == "canonical") is correctly attached.
  • Manual: run codedb search Explorer --scope against this repo and confirm src/explore.zig appears.
  • Manual: codedb callers Explorer against this repo.

🤖 Generated with Claude Code

justrach and others added 2 commits May 7, 2026 12:26
…sible to search

watcher.zig:446 forces skip_trigram=true for files >64KB. Such files
are NOT in self.trigram_index — they live in self.skip_trigram_files
and are only reached via Tier 3 of searchContent, which runs AFTER
Tier 1 (trigram candidates) fills the result quota.

For a common identifier with a wide spread of incidental mentions in
small files plus a canonical definition site in a large source file,
Tier 1 saturates with small-file hits and Tier 3 never runs — so the
canonical large file is completely invisible from search results.

Real-world repro on this very repo:
  $ codedb search Explorer --max-results 27
  ✓ 27 results — none from src/explore.zig
  (where `pub const Explorer = struct` lives at line 495 and there
   are 85 word-index hits for the term in that file)

Fails on main; will pass once large-file Tier 1 priority is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line (#447, #451)

`searchContent` Tier 0 used `max_results/5` as a per-file cap that starved
the canonical large file when fewer than 5 hit-files existed. Tier 1 only
walked trigram candidates, so files indexed with `skip_trigram=true`
(content >64KB, watcher.zig:446) could only surface via Tier 3 — which
never ran when small-file noise filled the quota first.

`searchContentWithScope` duplicated the entire pipeline, so #447's fix
landed in plain search while scope=true and `codedb_callers` stayed
stale.

Fix:
- Promote skip-trigram files with word-index hits into Tier 1's ranked
  candidate set, sorted by (hit-count desc, file-len asc).
- Cap Tier 0 per-file divisor at min(5, distinct-hit-files) so a single
  canonical file is no longer rationed to 1/5 of max_results.
- Refactor `searchContent` into `searchContentLocked`; rewrite
  `searchContentWithScope` to call it and annotate scope, eliminating
  the divergent second pipeline.
- Add issue-451 test covering scope=true recall of skip-trigram files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06381d7914

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/explore.zig
Comment on lines +3898 to +3901
const path_copy = try allocator.dupe(u8, r.path);
errdefer allocator.free(path_copy);
const line_text = try allocator.dupe(u8, r.line_text);
errdefer allocator.free(line_text);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid duplicating scoped search payloads after shared search

searchContentWithScope now calls searchContentLocked (which already allocates path and line_text for each hit) and then immediately duplicates both fields again before freeing the originals. For large max_results (up to 10,000 via MCP) this doubles per-hit allocation/copy work in a hot query path (codedb_callers/scoped search), which is a measurable regression risk for latency and memory churn compared with transferring ownership of the existing slices.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

explore: scope=true search has same skip-trigram invisibility as searchContent

1 participant