feat(memory): add TAU-2 trajectory-view treatment#2017
Open
huangruiteng wants to merge 15 commits into
Open
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
added 5 commits
May 13, 2026 18:31
9cfe362 to
c2228a2
Compare
c2228a2 to
2b767f2
Compare
MaojiaSheng
approved these changes
May 15, 2026
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
Follow-up to #2003. This PR is the minimal OpenViking-side step for the TAU-2 fine-grained trajectory-memory line: turn raw trajectory extraction into a reusable procedure-like trajectory view, then expose an explicit TAU-2 trajectory retrieval route for first-user and pre-write evaluation.
The headline read is that the trajectory-view route is directionally useful, especially when memory is injected at the pre-write decision node. The strongest current OV-native read is still split by domain: PR-B is strongest on retail, while category rerank in #2044 is strongest on airline.
This PR intentionally stays small. It does not include category rerank or the full Harness diagnostic stack; those are tracked separately in #2044.
What This Changes
benchmark/tau2/config/trajectory.yamlwith explicit trajectory treatments:memory_v2_trajectory_view: first-user retrieval frommemories/trajectoriesmemory_v2_trajectory_prewrite: first-user + pre-write retrieval frommemories/trajectoriessearch_memory_typeso TAU-2 configs can selectexperiencesortrajectorieswithout changing the adapter path.Evidence Details
The latest read uses reasoning-high, fixed-first-user, retail + airline held-out test, and 8 repeats per domain.
Domain avgis the simple mean of retail and airline; task-weighted totals remain available in the run scoreboards.Core Routes
Ablations
memories/experiencestop2memories/experiencestop2Interpretation:
Validation
git diff --checkuvx ruff format --check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.pyuvx ruff check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.pypython3 -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.pyEarlier route smokes also confirmed:
memory_v2_trajectory_view: non-empty trajectory corpus probe and first-user trace injected frommemories/trajectories.memory_v2_trajectory_prewrite: non-empty trajectory corpus probe and both first-user /before_write_tool_calltraces injected frommemories/trajectories.Thanks @yangxinxin-7 for the TAU-2 benchmark scaffold in #2003; this PR builds on that workflow rather than adding a separate eval path.