Skip to content

Commit 00b5edc

Browse files
christsoclaude
andcommitted
feat: workspace lifecycle hooks + cross-repo-sync showcase
Replace setup/teardown with before_all/after_all/before_each/after_each lifecycle hooks (bun:test/Vitest naming). Shared workspace across tests in a suite with after_each reset. Remove workspaceFingerprint (YAGNI). Add cross-repo-sync showcase demonstrating the full workspace config surface with real ground truth diffs from agentevals commit history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 438bd5d commit 00b5edc

29 files changed

Lines changed: 3374 additions & 343 deletions

apps/web/src/content/docs/evaluation/eval-cases.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ tests:
2929
| `expected_output` | No | Expected response for comparison (string, object, or message array). Alias: `expected_output` |
3030
| `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) |
3131
| `workspace` | No | Per-case workspace config (overrides suite-level) |
32-
| `metadata` | No | Arbitrary key-value pairs passed to setup/teardown scripts |
32+
| `metadata` | No | Arbitrary key-value pairs passed to lifecycle scripts |
3333
| `rubrics` | No | Structured evaluation criteria |
3434
| `assert` | No | Per-test evaluators |
3535
| `sidecar` | No | Additional metadata passed to evaluators |
@@ -117,28 +117,28 @@ Override the suite-level workspace config for individual tests. Test-level field
117117

118118
```yaml
119119
workspace:
120-
setup:
120+
before_all:
121121
script: ["bun", "run", "default-setup.ts"]
122122
123123
tests:
124124
- id: case-1
125125
criteria: Should work
126126
input: Do something
127127
workspace:
128-
setup:
128+
before_all:
129129
script: ["bun", "run", "custom-setup.ts"]
130130
131131
- id: case-2
132132
criteria: Should also work
133133
input: Do something else
134-
# Inherits suite-level setup
134+
# Inherits suite-level before_all
135135
```
136136

137-
See [Workspace Setup/Teardown](/targets/configuration/#workspace-setupteardown) for the full workspace config reference.
137+
See [Workspace Lifecycle Hooks](/targets/configuration/#workspace-lifecycle-hooks) for the full workspace config reference.
138138

139139
## Per-Case Metadata
140140

141-
Pass arbitrary key-value pairs to setup/teardown scripts via the `metadata` field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
141+
Pass arbitrary key-value pairs to lifecycle scripts via the `metadata` field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
142142

143143
```yaml
144144
tests:
@@ -149,11 +149,11 @@ tests:
149149
repo: sympy/sympy
150150
base_commit: "abc123def"
151151
workspace:
152-
setup:
152+
before_all:
153153
script: ["python", "checkout_repo.py"]
154154
```
155155

156-
The `metadata` field is included in the stdin JSON passed to setup and teardown scripts as `case_metadata`.
156+
The `metadata` field is included in the stdin JSON passed to lifecycle scripts as `case_metadata`.
157157

158158
## Per-Test Assertions
159159

apps/web/src/content/docs/evaluation/eval-files.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ tests:
3535
| `description` | Human-readable description of the evaluation |
3636
| `dataset` | Optional dataset identifier |
3737
| `execution` | Default execution config (for example `target`) |
38-
| `workspace` | Suite-level workspace config (setup/teardown scripts, template) |
38+
| `workspace` | Suite-level workspace config (lifecycle hooks, template) |
3939
| `tests` | Array of individual tests, or a string path to an external file |
4040
| `assert` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
4141

apps/web/src/content/docs/targets/configuration.mdx

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -99,47 +99,54 @@ targets:
9999
```
100100

101101
When `workspace_template` is set:
102-
- The template directory is copied to `~/.agentv/workspaces/<eval-run-id>/<test-id>/`
102+
- The template directory is copied to `~/.agentv/workspaces/<eval-run-id>/shared/`
103103
- The `.git` directory is skipped during copy
104-
- Each test gets its own isolated copy
104+
- Tests share the workspace; use `after_each` to reset state between tests
105105

106-
### Workspace Setup/Teardown
106+
### Workspace Lifecycle Hooks
107107

108-
Run scripts before and after each test using the `workspace` block. This can be defined at the suite level (applies to all tests) or per test (overrides suite-level).
108+
Run scripts at different points in the evaluation lifecycle using the `workspace` block. This can be defined at the suite level (applies to all tests) or per test (overrides suite-level).
109109

110110
```yaml
111111
workspace:
112112
template: ./workspace-templates/my-project
113-
setup:
113+
before_all:
114114
script: ["bun", "run", "setup.ts"]
115115
timeout_ms: 120000
116116
cwd: ./scripts
117-
teardown:
118-
script: ["bun", "run", "teardown.ts"]
117+
after_each:
118+
script: ["bun", "run", "reset.ts"]
119+
timeout_ms: 5000
120+
after_all:
121+
script: ["bun", "run", "cleanup.ts"]
119122
timeout_ms: 30000
120123
```
121124

122125
| Field | Description |
123126
|-------|-------------|
124127
| `template` | Directory to copy as workspace (alternative to target-level `workspace_template`) |
125-
| `setup` | Script to run after workspace creation, before the agent runs |
126-
| `teardown` | Script to run after evaluation, before cleanup |
128+
| `before_all` | Runs once after workspace creation, before the first test |
129+
| `after_all` | Runs once after the last test, before cleanup |
130+
| `before_each` | Runs before each test |
131+
| `after_each` | Runs after each test (e.g., reset workspace state for reuse) |
127132

128133
Each script config accepts:
129134

130135
| Field | Description |
131136
|-------|-------------|
132137
| `script` | Command array (e.g., `["bun", "run", "setup.ts"]`) |
133-
| `timeout_ms` | Timeout in milliseconds (default: 60000 for setup, 30000 for teardown) |
138+
| `timeout_ms` | Timeout in milliseconds (default: 60000 for `before_all`, 30000 for others) |
134139
| `cwd` | Working directory (relative paths resolved against eval file directory) |
135140

136-
**Lifecycle order:** template copy → setup script → git baseline → agent runs → file changes captured → teardown script → cleanup
141+
**Lifecycle order:** template copy → `before_all` → git baseline → (`before_each` → agent runs → file changes captured → `after_each`) × N tests → `after_all` → cleanup
142+
143+
**Shared workspace:** The workspace is created once and shared across all tests in a suite. Use `after_each` to reset state between tests (e.g., `git checkout . && git clean -fd`).
137144

138145
**Error handling:**
139-
- Setup failure aborts the test with an error result
140-
- Teardown failure is non-fatal (warning only)
146+
- `before_all` / `before_each` failure aborts the test with an error result
147+
- `after_all` / `after_each` failure is non-fatal (warning only)
141148

142-
**Script context:** Both scripts receive a JSON object on stdin with case context:
149+
**Script context:** All scripts receive a JSON object on stdin with case context:
143150

144151
```json
145152
{
@@ -153,10 +160,6 @@ Each script config accepts:
153160

154161
**Suite vs per-test:** When both are defined, test-level fields replace suite-level fields. See [Per-Test Workspace Config](/evaluation/eval-cases/#per-case-workspace-config) for examples.
155162

156-
### Workspace Fingerprinting
157-
158-
After setup and git baseline initialization, AgentV computes a SHA-256 fingerprint of the workspace file tree. This fingerprint is included in the evaluation result as `workspaceFingerprint` and can be used to verify that workspaces are reproducible across runs.
159-
160163
### Cleanup Behavior
161164

162165
By default:
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
targets:
2+
- name: mock_agent
3+
provider: cli
4+
command_template: bash mock-agent.sh {PROMPT} {OUTPUT_FILE}
5+
timeout_seconds: 30
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Cross-Repo Sync Showcase
2+
3+
Evaluates whether a coding agent can keep two public repos in sync after one changes.
4+
5+
## Scenario
6+
7+
When **agentv** (EntityProcess/agentv) ships a feature, the **agentevals** (agentevals/agentevals) spec docs must be updated to reflect the change. This eval measures how well an agent handles that cross-repo synchronization.
8+
9+
## Workspace Features Demonstrated
10+
11+
| Feature | Usage |
12+
|---------|-------|
13+
| `workspace.template` | AGENTS.md + skills dir copied to workspace |
14+
| `workspace.before_each` | Clones agentevals at "before" state per test |
15+
| `workspace.after_each` | Resets git state between tests |
16+
| `metadata` | Commit SHAs passed to setup via stdin JSON |
17+
| `fileChanges` | Unified diff of agent's edits |
18+
19+
## Test Cases
20+
21+
1. **eval-spec-v2-sync** — Add 4 deterministic assert types + required gates
22+
2. **cases-to-tests-sync** — Rename `cases``tests` across spec docs
23+
3. **schema-field-rename-sync** — Rename `eval_cases``cases`, `expected_outcome``criteria`/`outcome`
24+
25+
## Running
26+
27+
```bash
28+
bun install
29+
bun agentv eval ./evals/dataset.eval.yaml
30+
```
31+
32+
## Structure
33+
34+
```
35+
├── evals/
36+
│ ├── dataset.eval.yaml # 3 test cases
37+
│ └── ground-truth/ # Real diffs from commit history
38+
├── workspace-template/
39+
│ ├── AGENTS.md # Multi-repo context
40+
│ └── skills/
41+
│ └── cross-repo-sync.md # Sync skill
42+
├── scripts/
43+
│ ├── setup.ts # before_each: clone repo
44+
│ ├── reset.ts # after_each: git reset
45+
│ └── validate-sync.ts # Code judge
46+
├── .agentv/
47+
│ └── targets.yaml # Mock CLI agent
48+
└── package.json
49+
```

examples/showcase/cross-repo-sync/bun.lock

Lines changed: 80 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
name: cross-repo-sync
2+
description: Evaluate agent ability to sync agentv implementation with agentevals spec
3+
version: "1.0"
4+
tags: [showcase, workspace, cross-repo]
5+
6+
workspace:
7+
template: ../workspace-template
8+
before_each:
9+
script: ["bun", "run", "../scripts/setup.ts"]
10+
timeout_ms: 900000
11+
cwd: .
12+
after_each:
13+
script: ["bun", "run", "../scripts/reset.ts"]
14+
timeout_ms: 5000
15+
cwd: .
16+
17+
execution:
18+
target: mock_agent
19+
20+
tests:
21+
- id: eval-spec-v2-sync
22+
metadata:
23+
agentevals_before: "9f8aa3a"
24+
ground_truth: ../evals/ground-truth/eval-spec-v2.diff
25+
criteria: >-
26+
Update agentevals spec to reflect eval spec v2: add contains/regex/is_json/equals
27+
assert types, required gates for all evaluators, tests-as-string-path.
28+
input:
29+
- role: user
30+
content: |
31+
agentv just merged eval spec v2 (PR #262). Update the agentevals
32+
spec docs to reflect: 4 new deterministic assert types, required
33+
gates, assert field at test/suite level, tests-as-string-path.
34+
assert:
35+
- name: sync-check
36+
type: code_judge
37+
script: ["bun", "run", "../scripts/validate-sync.ts"]
38+
expected_files_modified:
39+
- agentevals/docs/src/content/docs/specification/evaluators.mdx
40+
- agentevals/docs/src/content/docs/specification/eval-format.mdx
41+
expected_keywords: [contains, regex, is_json, equals, required, assert]
42+
43+
- id: cases-to-tests-sync
44+
metadata:
45+
agentevals_before: "1aaa26f"
46+
ground_truth: ../evals/ground-truth/cases-to-tests.diff
47+
criteria: >-
48+
Rename 'cases' to 'tests' throughout the agentevals spec docs.
49+
input:
50+
- role: user
51+
content: |
52+
agentv renamed cases→tests in the eval schema (PR #240).
53+
Update all agentevals spec docs to match.
54+
assert:
55+
- name: sync-check
56+
type: code_judge
57+
script: ["bun", "run", "../scripts/validate-sync.ts"]
58+
expected_files_modified:
59+
- agentevals/docs/src/content/docs/specification/eval-format.mdx
60+
- agentevals/docs/src/content/docs/specification/evalcase-schema.mdx
61+
expected_keywords: [tests]
62+
63+
- id: schema-field-rename-sync
64+
metadata:
65+
agentevals_before: "81f4b44"
66+
ground_truth: ../evals/ground-truth/schema-field-rename.diff
67+
criteria: >-
68+
Rename eval_cases→cases and expected_outcome→criteria/outcome in agentevals spec.
69+
input:
70+
- role: user
71+
content: |
72+
agentv renamed schema fields: eval_cases→cases, expected_outcome→criteria
73+
at case level, expected_outcome→outcome at rubric level (PR #202).
74+
Update agentevals spec docs accordingly.
75+
assert:
76+
- name: sync-check
77+
type: code_judge
78+
script: ["bun", "run", "../scripts/validate-sync.ts"]
79+
expected_files_modified:
80+
- agentevals/docs/src/content/docs/specification/eval-format.mdx
81+
- agentevals/docs/src/content/docs/specification/evalcase-schema.mdx
82+
expected_keywords: [cases, criteria, outcome]

0 commit comments

Comments
 (0)