You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace setup/teardown with before_all/after_all/before_each/after_each
lifecycle hooks (bun:test/Vitest naming). Shared workspace across tests
in a suite with after_each reset. Remove workspaceFingerprint (YAGNI).
Add cross-repo-sync showcase demonstrating the full workspace config
surface with real ground truth diffs from agentevals commit history.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| `metadata` | No | Arbitrary key-value pairs passed to setup/teardown scripts |
32
+
| `metadata` | No | Arbitrary key-value pairs passed to lifecycle scripts |
33
33
| `rubrics` | No | Structured evaluation criteria |
34
34
| `assert` | No | Per-test evaluators |
35
35
| `sidecar` | No | Additional metadata passed to evaluators |
@@ -117,28 +117,28 @@ Override the suite-level workspace config for individual tests. Test-level field
117
117
118
118
```yaml
119
119
workspace:
120
-
setup:
120
+
before_all:
121
121
script: ["bun", "run", "default-setup.ts"]
122
122
123
123
tests:
124
124
- id: case-1
125
125
criteria: Should work
126
126
input: Do something
127
127
workspace:
128
-
setup:
128
+
before_all:
129
129
script: ["bun", "run", "custom-setup.ts"]
130
130
131
131
- id: case-2
132
132
criteria: Should also work
133
133
input: Do something else
134
-
# Inherits suite-level setup
134
+
# Inherits suite-level before_all
135
135
```
136
136
137
-
See [Workspace Setup/Teardown](/targets/configuration/#workspace-setupteardown) for the full workspace config reference.
137
+
See [Workspace Lifecycle Hooks](/targets/configuration/#workspace-lifecycle-hooks) for the full workspace config reference.
138
138
139
139
## Per-Case Metadata
140
140
141
-
Pass arbitrary key-value pairs to setup/teardown scripts via the `metadata` field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
141
+
Pass arbitrary key-value pairs to lifecycle scripts via the `metadata` field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
142
142
143
143
```yaml
144
144
tests:
@@ -149,11 +149,11 @@ tests:
149
149
repo: sympy/sympy
150
150
base_commit: "abc123def"
151
151
workspace:
152
-
setup:
152
+
before_all:
153
153
script: ["python", "checkout_repo.py"]
154
154
```
155
155
156
-
The `metadata` field is included in the stdin JSON passed to setup and teardown scripts as `case_metadata`.
156
+
The `metadata` field is included in the stdin JSON passed to lifecycle scripts as `case_metadata`.
Copy file name to clipboardExpand all lines: apps/web/src/content/docs/targets/configuration.mdx
+21-18Lines changed: 21 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,47 +99,54 @@ targets:
99
99
```
100
100
101
101
When `workspace_template` is set:
102
-
- The template directory is copied to `~/.agentv/workspaces/<eval-run-id>/<test-id>/`
102
+
- The template directory is copied to `~/.agentv/workspaces/<eval-run-id>/shared/`
103
103
- The `.git` directory is skipped during copy
104
-
- Each test gets its own isolated copy
104
+
- Tests share the workspace; use `after_each` to reset state between tests
105
105
106
-
### Workspace Setup/Teardown
106
+
### Workspace Lifecycle Hooks
107
107
108
-
Run scripts before and after each test using the `workspace` block. This can be defined at the suite level (applies to all tests) or per test (overrides suite-level).
108
+
Run scripts at different points in the evaluation lifecycle using the `workspace` block. This can be defined at the suite level (applies to all tests) or per test (overrides suite-level).
109
109
110
110
```yaml
111
111
workspace:
112
112
template: ./workspace-templates/my-project
113
-
setup:
113
+
before_all:
114
114
script: ["bun", "run", "setup.ts"]
115
115
timeout_ms: 120000
116
116
cwd: ./scripts
117
-
teardown:
118
-
script: ["bun", "run", "teardown.ts"]
117
+
after_each:
118
+
script: ["bun", "run", "reset.ts"]
119
+
timeout_ms: 5000
120
+
after_all:
121
+
script: ["bun", "run", "cleanup.ts"]
119
122
timeout_ms: 30000
120
123
```
121
124
122
125
| Field | Description |
123
126
|-------|-------------|
124
127
| `template` | Directory to copy as workspace (alternative to target-level `workspace_template`) |
125
-
| `setup` | Script to run after workspace creation, before the agent runs |
126
-
| `teardown` | Script to run after evaluation, before cleanup |
128
+
| `before_all` | Runs once after workspace creation, before the first test |
129
+
| `after_all` | Runs once after the last test, before cleanup |
130
+
| `before_each` | Runs before each test |
131
+
| `after_each` | Runs after each test (e.g., reset workspace state for reuse) |
**Shared workspace:** The workspace is created once and shared across all tests in a suite. Use `after_each` to reset state between tests (e.g., `git checkout . && git clean -fd`).
137
144
138
145
**Error handling:**
139
-
- Setup failure aborts the test with an error result
140
-
- Teardown failure is non-fatal (warning only)
146
+
- `before_all`/ `before_each` failure aborts the test with an error result
147
+
- `after_all`/ `after_each` failure is non-fatal (warning only)
141
148
142
-
**Script context:** Both scripts receive a JSON object on stdin with case context:
149
+
**Script context:** All scripts receive a JSON object on stdin with case context:
143
150
144
151
```json
145
152
{
@@ -153,10 +160,6 @@ Each script config accepts:
153
160
154
161
**Suite vs per-test:** When both are defined, test-level fields replace suite-level fields. See [Per-Test Workspace Config](/evaluation/eval-cases/#per-case-workspace-config) for examples.
155
162
156
-
### Workspace Fingerprinting
157
-
158
-
After setup and git baseline initialization, AgentV computes a SHA-256 fingerprint of the workspace file tree. This fingerprint is included in the evaluation result as `workspaceFingerprint` and can be used to verify that workspaces are reproducible across runs.
Evaluates whether a coding agent can keep two public repos in sync after one changes.
4
+
5
+
## Scenario
6
+
7
+
When **agentv** (EntityProcess/agentv) ships a feature, the **agentevals** (agentevals/agentevals) spec docs must be updated to reflect the change. This eval measures how well an agent handles that cross-repo synchronization.
8
+
9
+
## Workspace Features Demonstrated
10
+
11
+
| Feature | Usage |
12
+
|---------|-------|
13
+
|`workspace.template`| AGENTS.md + skills dir copied to workspace |
14
+
|`workspace.before_each`| Clones agentevals at "before" state per test |
15
+
|`workspace.after_each`| Resets git state between tests |
16
+
|`metadata`| Commit SHAs passed to setup via stdin JSON |
0 commit comments