diff --git a/.agents/skills/audit-markers/SKILL.md b/.agents/skills/audit-markers/SKILL.md
index 558e8c9a9..e71ff1607 100644
--- a/.agents/skills/audit-markers/SKILL.md
+++ b/.agents/skills/audit-markers/SKILL.md
@@ -69,7 +69,8 @@ test actually loads, and report whether the gate is correctly set or too loose.
 
 Read these before auditing — they are the authoritative source for marker conventions:
 
-- **Marker guide:** `test/MARKERS_GUIDE.md`
+- **Test strategy:** `test/README.md` — classification decision rules, per-tier definitions, philosophy, authoring guide, CI pipeline
+- **Marker guide:** `test/README.md` — marker tables, common patterns, backend reference (sections: Backend markers, Resource predicates, Common patterns)
 - **Marker registration:** `test/conftest.py` (`pytest_configure`) and `pyproject.toml` (`[tool.pytest.ini_options]`)
 - **Resource predicates:** `test/predicates.py` (predicate functions for resource gating)
 - **Example marker format:** `docs/examples/conftest.py` (`_extract_markers_from_file`)
@@ -362,7 +363,7 @@ When asking, present the test code and your reasoning so the user can make an in
 
 # Part 2: Project-Specific Rules
 
-Read `test/MARKERS_GUIDE.md` for the full marker reference (marker tables,
+Read `test/README.md` for the full marker reference (marker tables,
 resource gates, auto-skip logic, common patterns). This section covers only
 the **code analysis heuristics** the skill needs to classify tests — things
 that require reading the test source code rather than looking up a table.
@@ -848,18 +849,18 @@ in the `pytestmark` list or decorator.
 Check that every backend used in test files has a registered marker.
 The project's backend registry is `BACKEND_MARKERS` in `test/conftest.py`
 (single source of truth). Markers must also appear in `pyproject.toml`
-`[tool.pytest.ini_options].markers` and in `test/MARKERS_GUIDE.md`.
+`[tool.pytest.ini_options].markers` and in `test/test/README.md`.
 
 For each backend constructor or `start_session(backend_name=...)` call
 found during classification, verify:
 
 1. A marker exists in `BACKEND_MARKERS` for that backend.
 2. The marker appears in `pyproject.toml`.
-3. The marker appears in the MARKERS_GUIDE.md backend table.
+3. The marker appears in the test/README.md backend table.
 
 If a backend is used in tests but has no registered marker, flag it as
 a **missing backend marker** issue and add it to the registry, pyproject.toml,
-and MARKERS_GUIDE.md (same apply/confirm rules as other fixes in Step 4).
+and test/README.md (same apply/confirm rules as other fixes in Step 4).
 
 ## Step 6 — Flag infrastructure notes
 
@@ -896,7 +897,7 @@ flag as a blocker, don't silently re-add:
 - **Backend marker registry:** `BACKEND_MARKERS` dict in `test/conftest.py` is
   the single source of truth for backend markers. `pytest_configure` iterates
   over it. New backends are added by inserting one entry into the dict.
-  `pyproject.toml` and `test/MARKERS_GUIDE.md` must stay in sync manually.
+  `pyproject.toml` and `test/test/README.md` must stay in sync manually.
 - **Resource predicates:** `test/predicates.py` provides `require_gpu`,
   `require_ram`, `require_api_key`, `require_package`,
   `require_python`.
diff --git a/.agents/skills/skill-author/SKILL.md b/.agents/skills/skill-author/SKILL.md
index ca1e5ca8b..b305bbdae 100644
--- a/.agents/skills/skill-author/SKILL.md
+++ b/.agents/skills/skill-author/SKILL.md
@@ -134,7 +134,7 @@ After frontmatter, write clear markdown instructions the agent follows:
   "Check if markers are correct" is worse than "Compare the test's assertions
   to the qualitative decision rule in section 3."
 - **Reference project files.** Point to docs, configs, and examples by relative
-  path so the agent can read them. E.g. "See `test/MARKERS_GUIDE.md` for the
+  path so the agent can read them. E.g. "See `test/README.md` for the
   full marker taxonomy."
 - **Declare scope boundaries.** State what the skill does NOT do. E.g. "This
   skill does not modify conftest.py — flag infrastructure issues as notes."
diff --git a/AGENTS.md b/AGENTS.md
index ff27fe273..493dd472b 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -49,7 +49,7 @@ uv run mypy .                         # Type check
 ## 3. Test Markers
 Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. The `unit` marker is auto-applied by conftest — never write it explicitly. The `llm` marker is deprecated; use `e2e` instead.
 
-See **[test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md)** for the full marker reference (tier definitions, backend markers, resource gates, auto-skip logic, common patterns).
+See **[test/README.md](test/README.md)** for classification rules, authoring guide, marker reference, CI tier map, and local workflow.
 
 **Examples in `docs/examples/`** are opt-in — unlike `test/` files (auto-collected, default `unit`), examples require an explicit `# pytest:` comment to be collected. Files without this comment are silently ignored (they won't appear in skip summaries either). This is because examples have variable dependencies and limited setup:
 ```python
@@ -115,6 +115,8 @@ Pre-commit runs: ruff, mypy, uv-lock, codespell
 
 ## 10. Writing Tests
 
+See **[test/README.md — Authoring guide](test/README.md#authoring-guide)** for the full authoring guide (naming, fixture discipline, mock discipline, assertion style).
+
 - Place tests in `test/` mirroring source structure
 - Name files `test_*.py` (required for pydocstyle)
 - Use `gh_run` fixture for CI-aware tests (see `test/conftest.py`)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 7deee4c6c..c726c660e 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -320,6 +320,9 @@ as it can corrupt state.
 
 ### Quick Reference
 
+See [test/README.md](test/README.md) for classification rules, authoring guide,
+CI tier map, coverage, and the full local workflow reference. Essential commands:
+
 ```bash
 # Install all dependencies (required for tests)
 uv sync --all-extras --all-groups
@@ -327,25 +330,12 @@ uv sync --all-extras --all-groups
 # Start Ollama (required for most tests)
 ollama serve
 
-# Default: qualitative tests, skip slow tests
+# Default: includes qualitative tests, skips slow tests
 uv run pytest
 
 # Fast tests only (no qualitative, ~2 min)
 uv run pytest -m "not qualitative"
 
-# Unit tests only (self-contained, no services)
-uv run pytest -m unit
-
-# Run only slow tests (>1 min)
-uv run pytest -m slow
-
-# Run specific backend tests
-uv run pytest -m "ollama"
-uv run pytest -m "openai"
-
-# CI/CD mode (skips qualitative tests)
-CICD=1 uv run pytest
-
 # Lint and format
 uv run ruff format .
 uv run ruff check .
@@ -395,23 +385,26 @@ for m in granite4:micro granite4:micro-h deepseek-r1:8b \
 
 ### Test Markers
 
-Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. See [test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md) for the full marker reference, including tier definitions, backend markers, resource gates, and auto-skip logic.
+Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. See [test/README.md](test/README.md) for the full guide: classification rules, marker reference, authoring guide, CI tiers, and auto-skip logic.
 
 ### CI/CD Tests
 
 CI runs the following checks on every pull request:
-1. **Pre-commit hooks** (`pre-commit run --all-files`) - Ruff, mypy, uv-lock, codespell
-2. **Test suite** (`CICD=1 uv run pytest`) - Skips qualitative tests for speed
+1. **Pre-commit hooks** (`pre-commit run --all-files`) — ruff, mypy, uv-lock, codespell
+2. **Test suite** — `CICD=1 uv run pytest test` on Python 3.11/3.12/3.13 with Ollama running; skips qualitative tests
 
 To replicate CI locally:
 ```bash
-# Run pre-commit checks (same as CI)
+# Pre-commit checks (same as CI)
 pre-commit run --all-files
 
-# Run tests with CICD flag (same as CI, skips qualitative tests)
-CICD=1 uv run pytest
+# Tests with CICD flag (skips qualitative, matches CI scope)
+CICD=1 uv run pytest test
 ```
 
+See [test/README.md — CI pipeline](test/README.md#ci-pipeline) for the full CI
+breakdown and planned nightly/pre-release tiers.
+
 ### Timing Expectations
 
 - Fast tests (`-m "not qualitative"`): ~2 minutes
@@ -448,7 +441,7 @@ print(m.last_prompt())
 
 ### Getting Help
 
-- Check this guide and [test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md)
+- Check this guide and [test/README.md](test/README.md)
 - Search [existing issues](https://github.com/generative-computing/mellea/issues)
 - Check out [Github Discussions](https://github.com/generative-computing/mellea/discussions)
 - Open a new issue with the appropriate label
@@ -459,7 +452,7 @@ print(m.last_prompt())
 
 - **[Docs writing guide](docs/docs/guide/CONTRIBUTING.md)** - Conventions, PR checklist, and review process for documentation contributions
 - **[API Documentation](https://docs.mellea.ai)** - Published documentation site
-- **[Test Markers Guide](test/MARKERS_GUIDE.md)** - Detailed pytest marker documentation
+- **[Test Guide](test/README.md)** - Test strategy, classification, markers, and authoring guide
 - **[AGENTS.md](AGENTS.md)** - Guidelines for AI assistants working on Mellea internals
 - **[AGENTS_TEMPLATE.md](docs/AGENTS_TEMPLATE.md)** - Template for projects using Mellea
 
diff --git a/docs/docs/community/contributing-guide.md b/docs/docs/community/contributing-guide.md
index bb583292c..3330db249 100644
--- a/docs/docs/community/contributing-guide.md
+++ b/docs/docs/community/contributing-guide.md
@@ -196,49 +196,13 @@ git commit -n -m "wip: intermediate work"
 
 ## Testing
 
-### Test markers
+The full test guide lives in
+[`test/README.md`](https://github.com/generative-computing/mellea/blob/main/test/README.md)
+in the repository — classification rules, marker reference (tier definitions,
+backend matrix, resource predicates), authoring guide, CI pipeline, and
+GPU/Ollama operational notes.
 
-Tests use a four-tier granularity system. Every test belongs to exactly one tier:
-
-| Tier | When to use | How to apply |
-| ---- | ----------- | ------------ |
-| `unit` | Self-contained, no services, no I/O | Auto-applied — never write `@pytest.mark.unit` |
-| `integration` | Real SDK/library boundary or multi-component wiring | `@pytest.mark.integration` |
-| `e2e` | Real backends (Ollama, APIs, GPU models), deterministic assertions | `@pytest.mark.e2e` + backend marker(s) |
-| `qualitative` | Subset of e2e with non-deterministic output assertions | `@pytest.mark.qualitative` per-function, `e2e` + backend at module level |
-
-**Backend markers** (only for e2e/qualitative tests):
-
-| Marker | Backend | Resources |
-| ------ | ------- | --------- |
-| `ollama` | Ollama (port 11434) | Local, light (~2–4 GB RAM) |
-| `openai` | OpenAI API or compatible | API calls (may use Ollama `/v1`) |
-| `watsonx` | Watsonx API | API calls, requires credentials |
-| `huggingface` | HuggingFace transformers | Local, GPU required |
-| `litellm` | LiteLLM (wraps other backends) | Depends on underlying backend |
-| `bedrock` | AWS Bedrock | API calls, requires credentials |
-
-**Resource predicates** (from `test/predicates.py`, for e2e/qualitative tests):
-
-| Predicate | Use when test needs |
-| --------- | ------------------- |
-| `require_gpu()` | Any GPU (CUDA or MPS) |
-| `require_gpu(min_vram_gb=N)` | GPU with at least N GB VRAM |
-| `require_ram(min_gb=N)` | N GB+ system RAM |
-| `require_api_key("ENV_VAR")` | Specific API credentials |
-| `require_package("pkg")` | Optional dependency |
-| `require_python((3, 11))` | Minimum Python version |
-
-**Other markers:**
-
-| Marker | Purpose |
-| ------ | ------- |
-| `slow` | Tests taking >1 minute (excluded by default) |
-| `qualitative` | Non-deterministic output (skipped when `CICD=1`) |
-
-For more information, see our [Markers Guide](https://github.com/generative-computing/mellea/blob/main/test/MARKERS_GUIDE.md).
-
-### Running tests
+Essential commands:
 
 ```bash
 # Install all dependencies (required for tests)
@@ -247,43 +211,15 @@ uv sync --all-extras --all-groups
 # Start Ollama (required for most tests)
 ollama serve
 
-# Default: runs qualitative tests, skips slow tests
-uv run pytest
-
 # Fast tests only (no qualitative, ~2 min)
 uv run pytest -m "not qualitative"
 
-# Run only slow tests (>5 min)
-uv run pytest -m slow
-
-# Run specific backend tests
-uv run pytest -m "ollama"
-uv run pytest -m "openai"
-
-# Run unit tests only (no backends needed)
-uv run pytest -m unit
-
-# CI/CD mode (skips qualitative tests)
-CICD=1 uv run pytest
-```
-
-### Timing expectations
-
-| Run | Duration |
-| --- | -------- |
-| Fast tests (`-m "not qualitative"`) | ~2 minutes |
-| Default (qualitative, no slow) | Several minutes |
-| Slow tests (`-m slow`) | More than 1 minute |
-| Pre-commit hooks | 1–5 minutes |
-
-### Replicate CI locally
+# Default: includes qualitative, skips slow
+uv run pytest
 
-```bash
-# Run pre-commit checks (same as CI)
+# Replicate CI locally
 pre-commit run --all-files
-
-# Run tests with CICD flag (same as CI, skips qualitative tests)
-CICD=1 uv run pytest
+CICD=1 uv run pytest test
 ```
 
 ## Pull request process
diff --git a/docs/docs/how-to/unit-test-generative-code.md b/docs/docs/how-to/unit-test-generative-code.md
index 0c62cd271..18e25dd6c 100644
--- a/docs/docs/how-to/unit-test-generative-code.md
+++ b/docs/docs/how-to/unit-test-generative-code.md
@@ -7,7 +7,8 @@ description: "Write reliable tests for @generative functions using pytest marker
 **Prerequisites:** [Quick Start](../getting-started/quickstart) complete,
 `pip install mellea`, Ollama running locally, `pytest` installed.
 
-> **Contributing to Mellea itself?** See the [Contributing Guide](../community/contributing-guide#testing)
+> **Contributing to Mellea itself?** See
+> [`test/README.md`](https://github.com/generative-computing/mellea/blob/main/test/README.md)
 > for Mellea's own test markers, fixtures, and CI setup.
 
 Testing generative code requires you to separate concerns: some assertions are
diff --git a/test/MARKERS_GUIDE.md b/test/MARKERS_GUIDE.md
deleted file mode 100644
index 8f0ad3b50..000000000
--- a/test/MARKERS_GUIDE.md
+++ /dev/null
@@ -1,322 +0,0 @@
-# Pytest Markers Guide
-
-## Quick Reference
-
-```bash
-# By granularity tier
-pytest -m unit                          # Self-contained, no services (fast)
-pytest -m integration                   # Real SDK/library boundary or multi-component wiring
-pytest -m e2e                           # Real backends (ollama, APIs, GPU models)
-pytest -m "e2e and not qualitative"     # Deterministic real-backend tests only
-
-# By backend
-pytest -m ollama                        # Ollama tests
-pytest -m huggingface                   # HuggingFace tests
-pytest -m "openai or watsonx"           # Cloud API tests
-
-# By characteristics
-pytest -m "not qualitative"             # Fast, deterministic tests (~2 min)
-pytest -m qualitative                   # Non-deterministic output quality tests
-pytest -m slow                          # Long-running tests (>1 min)
-
-# Default (configured in pyproject.toml): skips slow, includes qualitative
-pytest
-```
-
-## Granularity Tiers
-
-Every test belongs to exactly one tier. The tier determines what infrastructure
-the test needs and how fast/heavy it is to run.
-
-### Unit (auto-applied)
-
-**Entirely self-contained** — no services, no I/O, no fixtures that connect
-to anything external. Pure logic testing.
-
-- Auto-applied by conftest hook when no other granularity marker is present
-- **Never write `@pytest.mark.unit` on files** — it is implicit
-- Runs in milliseconds to low seconds, minimal memory
-- Would pass on any machine with just Python and project deps
-
-```python
-# No markers needed — auto-applied as unit
-def test_cblock_repr():
-    assert str(CBlock(value="hi")) == "hi"
-```
-
-### Integration (explicit)
-
-**Verifies that your code correctly communicates across a real boundary.**
-The boundary may be a third-party SDK/library whose API contract you are
-asserting against, multiple internal components wired together, or a
-fixture-managed local service. What distinguishes integration from unit is
-that at least one real external component — not a mock or stub — is on the
-other side of the boundary being tested.
-
-- Add `@pytest.mark.integration` explicitly
-- No backend markers needed — integration tests do not use real LLM backends
-- Slower than unit (fixture setup, real SDK objects), but faster than e2e
-
-**Positive indicators:**
-
-- Uses a real third-party SDK object to *capture and assert* on output —
-  e.g. `InMemoryMetricReader`, `InMemorySpanExporter`, `LoggingHandler` —
-  rather than patching the SDK away
-- Asserts on the format or content of data as received by an external
-  component (semantic conventions, attribute names, accumulated values)
-- Wires multiple real project components together and mocks only at the
-  outermost boundary
-- Breaking the interface between your code and the external component
-  (e.g. a changed attribute name, a missing SDK call) would cause the test
-  to fail
-
-**Negative indicators (likely unit instead):**
-
-- All external boundaries replaced with `MagicMock`, `patch`, or `AsyncMock`
-- Third-party library imported only as a type or helper, not as a real
-  collaborator being asserted against
-- Toggles env vars and checks booleans or config state with no real SDK
-  objects instantiated
-
-**Tie-breaker:** If you changed the contract between your code and the
-external component, would this test catch it? If yes → integration. If no
-→ unit.
-
-```python
-@pytest.mark.integration
-def test_token_metrics_format(clean_metrics_env):
-    # Real InMemoryMetricReader — asserting against the OTel SDK contract
-    reader = InMemoryMetricReader()
-    provider = MeterProvider(metric_readers=[reader])
-    record_token_usage_metrics(input_tokens=10, output_tokens=5, ...)
-    metrics_data = reader.get_metrics_data()
-    assert metrics_data.resource_metrics[0]...name == "mellea.llm.tokens.input"
-
-@pytest.mark.integration
-def test_session_chains_components(mock_backend):
-    # Multiple real project components wired together; only LLM call mocked
-    session = start_session(backend=mock_backend)
-    result = session.instruct("hello")
-    assert mock_backend.generate.called
-```
-
-### E2E (explicit)
-
-**Tests against real backends** — cloud APIs, local servers (ollama), or
-GPU-loaded models (huggingface, vllm). No mocks on the critical path.
-
-- Add `@pytest.mark.e2e` explicitly, always combined with backend marker(s)
-- Resource predicates (`require_gpu()`, `require_ram()`, etc.) only apply to
-  e2e and qualitative tests — see "Resource Gating" section below
-- Assertions are **deterministic** — structural, type-based, or functional
-
-```python
-pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
-
-def test_structured_output(session):
-    result = session.format(Person, "Make up a person")
-    assert isinstance(json.loads(result.value), dict)
-```
-
-### Qualitative (explicit, per-function)
-
-**Subset of e2e.** Same infrastructure requirements, but assertions check
-**non-deterministic output content** that may vary across model versions or runs.
-
-- Add `@pytest.mark.qualitative` per-function (not module-level)
-- Module must also carry `e2e` + backend markers at module level
-- Skipped in CI when `CICD=1`
-- Included by default in local runs
-
-```python
-pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
-
-@pytest.mark.qualitative
-def test_greeting_content(session):
-    result = session.instruct("Write a greeting")
-    assert "hello" in result.value.lower()
-```
-
-**Decision rule:** If swapping the model version could break the assertion
-despite the system working correctly, it is `qualitative`. If the assertion
-checks structure, types, or functional correctness, it is `e2e`.
-
-### The `llm` marker (deprecated)
-
-`llm` is a legacy marker equivalent to `e2e`. It remains registered for
-backward compatibility but should not be used in new tests. Use `e2e` instead.
-
-The conftest auto-apply hook treats `llm` the same as `e2e` — tests marked
-`llm` will not receive the `unit` marker.
-
-## Backend Markers
-
-Backend markers identify which backend a test needs. They enable selective
-test runs (`pytest -m ollama`) and drive auto-skip logic.
-
-**Backend markers only go on e2e and qualitative tests.** Unit and integration
-tests don't need real backends.
-
-| Marker         | Backend                       | Resources                             |
-| -------------- | ----------------------------- | ------------------------------------- |
-| `ollama`       | Ollama (port 11434)           | Local, light (~2-4GB RAM)             |
-| `openai`       | OpenAI API or compatible      | API calls (may use Ollama `/v1`)      |
-| `watsonx`      | Watsonx API                   | API calls, requires credentials       |
-| `huggingface`  | HuggingFace transformers      | Local, GPU required                   |
-| `vllm`         | vLLM                          | Local, GPU required                   |
-| `litellm`      | LiteLLM (wraps other backends)| Depends on underlying backend         |
-| `bedrock`      | AWS Bedrock                   | API calls, requires credentials       |
-
-### OpenAI-via-Ollama pattern
-
-Some tests use the OpenAI client pointed at Ollama's `/v1` endpoint. Mark
-these with **both** `openai` and `ollama`, but do **not** add `require_api_key`:
-
-```python
-pytestmark = [pytest.mark.e2e, pytest.mark.openai, pytest.mark.ollama]
-```
-
-## Resource Gating (Predicates)
-
-E2E and qualitative tests need gating so they skip cleanly when required
-infrastructure is absent. Use **predicate decorators** from `test/predicates.py`
-— they give test authors precise control over skip conditions.
-
-```python
-from test.predicates import require_gpu, require_api_key
-```
-
-| Predicate | Use when test needs |
-| --------- | ------------------- |
-| `require_gpu()` | Any GPU (CUDA or MPS) |
-| `require_gpu(min_vram_gb=N)` | GPU with at least N GB VRAM |
-| `require_ram(min_gb=N)` | N GB+ system RAM (genuinely RAM-bound tests only) |
-| `require_api_key("ENV_VAR")` | Specific API credentials |
-| `require_package("pkg")` | Optional dependency |
-| `require_python((3, 11))` | Minimum Python version |
-
-### Typical combinations
-
-- `huggingface` → `require_gpu(min_vram_gb=N)` (compute N from model params)
-- `vllm` → `require_gpu(min_vram_gb=N)` (compute N from model params)
-- `watsonx` → `require_api_key("WATSONX_API_KEY", "WATSONX_URL", "WATSONX_PROJECT_ID")`
-- `openai` → `require_api_key("OPENAI_API_KEY")` only for real OpenAI (not Ollama-compat)
-
-### Other gating markers
-
-These are not resource predicates but still control test selection:
-
-| Marker         | Gate                             | Auto-skip when                                   |
-| -------------- | -------------------------------- | ------------------------------------------------ |
-| `slow`         | Tests taking >1 minute           | Excluded by default via `pyproject.toml` addopts |
-| `qualitative`  | Non-deterministic output         | Skipped when `CICD=1`                            |
-
-### Removed markers
-
-`requires_gpu`, `requires_heavy_ram`, and `requires_gpu_isolation` have been
-removed. Use `require_gpu(min_vram_gb=N)` from `test.predicates` instead.
-`requires_api_key` is still active — see below.
-
-## Auto-Detection
-
-The test suite automatically detects system capabilities and skips tests
-whose requirements are not met. No configuration needed.
-
-| Capability | How detected                  |
-| ---------- | ----------------------------- |
-| Ollama     | Port 11434 check              |
-| GPU/VRAM   | `torch` + `sysctl hw.memsize` |
-| API keys   | Environment variable check    |
-
-Use `-rs` with pytest to see skip reasons:
-
-```bash
-pytest -rs
-```
-
-## Common Marker Patterns
-
-```python
-# Unit — no markers needed (auto-applied by conftest)
-def test_cblock_repr():
-    assert str(CBlock(value="hi")) == "hi"
-
-# Integration — mocked backend
-@pytest.mark.integration
-def test_session_with_mock(mock_backend):
-    session = start_session(backend=mock_backend)
-    result = session.instruct("hello")
-    assert mock_backend.generate.called
-
-# E2E — real Ollama backend, deterministic
-pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
-def test_structured_output(session):
-    result = session.format(Person, "Make up a person")
-    assert isinstance(json.loads(result.value), dict)
-
-# E2E + qualitative — real backend, non-deterministic
-pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
-@pytest.mark.qualitative
-def test_greeting_content(session):
-    result = session.instruct("Write a greeting")
-    assert "hello" in result.value.lower()
-
-# Heavy GPU e2e (predicates for resource gating)
-from test.predicates import require_gpu
-
-pytestmark = [pytest.mark.e2e, pytest.mark.huggingface,
-              require_gpu(min_vram_gb=20)]
-```
-
-## Example Files (`docs/examples/`)
-
-Unlike `test/` files (which are auto-collected and default to `unit`), examples
-require an explicit `# pytest:` comment to be collected. This opt-in approach
-reflects that examples often have variable dependencies and limited setup, so
-only files that declare themselves runnable should be executed.
-
-```python
-# pytest: e2e, ollama, qualitative
-"""Example description..."""
-```
-
-Same classification rules apply. The comment must appear in the first few
-lines before non-comment code. Files without this comment are silently
-ignored — they won't appear in skip summaries or collection output.
-Parser: `docs/examples/conftest.py` (`_extract_markers_from_file`).
-
-## Adding Markers to New Tests
-
-1. **Classify the test** — unit, integration, e2e, or qualitative?
-2. **Add granularity marker** — integration and e2e are explicit; unit is auto-applied
-3. **Add backend marker(s)** — only for e2e/qualitative
-4. **Add resource predicates** — only for e2e/qualitative, use `test/predicates.py`
-5. **Verify** — `pytest --collect-only -m "your_marker"` to check
-
-Use the `/audit-markers` skill to validate markers on existing or new test files.
-
-## CI/CD Integration
-
-```yaml
-jobs:
-  unit-tests:
-    run: pytest -m unit              # Fast, no services needed
-
-  ollama-tests:
-    run: pytest -m "e2e and ollama and not qualitative"
-
-  quality-tests:
-    if: github.event_name == 'schedule'
-    run: pytest -m "qualitative and ollama"
-```
-
-- `CICD=1` skips qualitative tests
-- `slow` tests excluded by default (add `-m slow` to include)
-
-## Related Files
-
-- `test/conftest.py` — marker registration, auto-detection, skip logic, unit auto-apply hook
-- `test/predicates.py` — resource gating predicates (`require_gpu`, `require_ram`, etc.)
-- `docs/examples/conftest.py` — example marker parser (`_extract_markers_from_file`)
-- `pyproject.toml` — marker definitions and pytest configuration
-- `.agents/skills/audit-markers/SKILL.md` — skill for auditing and fixing markers
diff --git a/test/README.md b/test/README.md
index 4c4c7a148..359957b12 100644
--- a/test/README.md
+++ b/test/README.md
@@ -1,147 +1,529 @@
 # Mellea Test Suite
 
-Test files must be named as `test_*.py` so that pydocstyle ignores them.
+This file covers everything you need to contribute tests: strategy, classification,
+marker reference, authoring guide, CI pipeline, and operational notes.
 
-## Running Tests
+## Quick start
+
+**First-time setup.** Follow the
+[Contributing Guide](../docs/docs/community/contributing-guide.md#development-setup)
+to install `uv`, sync deps, and install pre-commit hooks. For anything beyond
+unit tests you also need Ollama running locally:
 
 ```bash
-# Fast tests only (~2 min) - skips qualitative and slow tests
-uv run pytest -m "not qualitative"
+ollama serve &
+ollama pull granite4:micro granite4:micro-h
+```
 
-# Default - includes qualitative tests, skips slow tests
-uv run pytest
+**Running tests during development.**
 
-# All tests including slow tests (>5 min)
-uv run pytest -m slow
+```bash
+uv run pytest -m "not qualitative"            # ~2 min fast loop
+uv run pytest test/path/to/test_thing.py       # focus on one file
+uv run pytest -rs                              # show why anything skipped
 ```
 
-## Environment Variables
+**Adding a new test.**
 
-- `CICD=1` - Enable CI mode (skips qualitative tests)
-- `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` - Helps with GPU memory fragmentation
+1. **Classify** it (see [Test tiers](#test-tiers)). If it doesn't call a real
+   backend or external SDK, it's `unit` and needs no marker.
+2. **Place** the file at `test/<mirror of source path>/test_<module>.py`.
+3. **Add the granularity marker** — `integration`, `e2e`, or `qualitative` are
+   explicit; `unit` is auto-applied by conftest, never write it.
+4. **Add backend marker(s)** — only for `e2e`/`qualitative` tests
+   (`ollama`, `huggingface`, etc.). See [Backend markers](#backend-markers).
+5. **Add resource predicates** — only for `e2e`/`qualitative`, use
+   `test/predicates.py`. See [Resource predicates](#resource-predicates).
+6. **Verify collection** — `uv run pytest --collect-only -m "your_marker"`.
+7. **Run it** — `uv run pytest path::test_name -v`.
 
-## Ollama Model Eviction
+**Before opening a PR.**
 
-When pytest orchestrates many Ollama-backed tests in sequence, the default 5-minute
-keep-alive means models from earlier tests stay resident and accumulate, eventually
-starving later tests of memory.
+```bash
+uv run pre-commit run --all-files              # what CI runs first
+CICD=1 uv run pytest test                      # what CI runs second
+```
 
-Two mechanisms in `test/conftest.py` handle this:
+## Philosophy
 
-- **Per-module eviction** (`pytest_runtest_teardown`) — when crossing a file
-  boundary between Ollama-marked tests, queries `/api/ps` for all loaded models
-  and evicts them with `keep_alive=0`. Covers both `test/` and `docs/examples/`.
-  Always active, no flags required.
-- **Group warm-up/eviction** (`pytest_runtest_setup`) — warms up a fixed set of CI
-  models (`keep_alive=-1`) when entering the Ollama backend group and evicts them
-  when leaving. Requires `--group-by-backend`.
+Mellea tests assert **observable contracts**, not implementation details.
 
-**Trade-off:** if two consecutive test files use the same model, it will be unloaded
-and reloaded (~5-15 s overhead). Predictable memory behaviour is more important
-than saving a reload, especially on constrained CI runners. Tests within a single
-file share the loaded model with no overhead.
+- Test the public API surface, not private helpers.
+- Test cross-backend behaviour where it needs to be consistent.
+- A test that passes while the system is broken has negative value; prefer
+  fewer, more honest tests over coverage padding.
+- When a test fails, fix the **code**. Adjusting an assertion to silence a
+  failure is almost always wrong; fixing the test is acceptable only if the
+  test was never correctly written.
+- All tiers run locally by default. Qualitative tests (assertions on LLM output
+  *content*) are the one tier that CI skips, so non-deterministic checks never
+  block CI.
 
-**Caveat:** eviction targets *all* loaded Ollama models, not just those loaded by
-the test. If you are using Ollama interactively while the suite runs, your model
-will be evicted between test modules.
+## Test tiers
+
+Every test belongs to exactly one granularity tier. Apply the decision rules
+below in order:
 
-## GPU Testing on CUDA Systems
+| Question | Answer → tier |
+|----------|---------------|
+| Does it call a real LLM backend or external API? | Yes → **e2e** (or **qualitative**, see below) |
+| Does it assert against a real third-party SDK object (OTel reader, metrics collector)? | Yes → **integration** |
+| Does it wire multiple real project components together without external I/O? | Yes → **integration** |
+| Does everything happen in-process with no real external collaborators? | Yes → **unit** (auto-applied) |
 
-### The Problem: CUDA EXCLUSIVE_PROCESS Mode
+### Unit
 
-When running GPU tests on systems with `EXCLUSIVE_PROCESS` mode (common on HPC clusters), you may encounter "CUDA device busy" errors. This happens because:
+**Entirely self-contained** — no services, no I/O, no network. Pure logic:
+formatters, parsers, schema validation, config loading, pure helper functions.
+Runs in milliseconds on any machine.
 
-1. **Parent Process Context**: The pytest parent process creates a CUDA context when running regular tests
-2. **Subprocess Blocking**: Example tests run in subprocesses (via `docs/examples/conftest.py`)
-3. **Exclusive Access**: In `EXCLUSIVE_PROCESS` mode, only one process can hold a CUDA context per GPU
-4. **Result**: Subprocesses fail with "CUDA device busy" when the parent still holds the context
+The `unit` marker is **auto-applied by conftest** to every test that has no
+other granularity marker. Never write `@pytest.mark.unit` yourself.
 
-### Solution 1: NVIDIA MPS (Recommended)
+```python
+# No markers needed — auto-applied as unit
+def test_cblock_repr():
+    assert str(CBlock(value="hi")) == "hi"
+```
 
-**NVIDIA Multi-Process Service (MPS)** allows multiple processes to share a GPU in `EXCLUSIVE_PROCESS` mode:
+### Integration
 
-```bash
-# Enable MPS in your job scheduler configuration
-# Consult your HPC documentation for specific syntax
+**Verifies that your code correctly communicates across a real boundary.**
+The boundary may be a third-party SDK/library whose API contract you are
+asserting against, multiple internal components wired together, or a
+fixture-managed local service. What distinguishes integration from unit is
+that at least one real external component — not a mock or stub — is on the
+other side of the boundary being tested.
+
+Add `@pytest.mark.integration` explicitly; no backend marker is needed.
+
+**Positive indicators:**
+
+- Uses a real third-party SDK object to *capture and assert* on output —
+  e.g. `InMemoryMetricReader`, `InMemorySpanExporter`, `LoggingHandler` —
+  rather than patching the SDK away
+- Asserts on the format or content of data as received by an external
+  component (semantic conventions, attribute names, accumulated values)
+- Wires multiple real project components together and mocks only at the
+  outermost boundary
+- Breaking the interface between your code and the external component
+  would cause the test to fail
+
+**Negative indicators (likely unit instead):**
+
+- All external boundaries replaced with `MagicMock`, `patch`, or `AsyncMock`
+- Third-party library imported only as a type or helper, not as a real
+  collaborator being asserted against
+
+**Tie-breaker:** if you changed the contract between your code and the external
+component, would this test catch it? If yes → integration. If no → unit.
+
+```python
+@pytest.mark.integration
+def test_token_metrics_format(clean_metrics_env):
+    # Real InMemoryMetricReader — asserting against the OTel SDK contract
+    reader = InMemoryMetricReader()
+    provider = MeterProvider(metric_readers=[reader])
+    record_token_usage_metrics(input_tokens=10, output_tokens=5, ...)
+    metrics_data = reader.get_metrics_data()
+    assert metrics_data.resource_metrics[0]...name == "mellea.llm.tokens.input"
+
+@pytest.mark.integration
+def test_session_chains_components(mock_backend):
+    # Multiple real project components wired together; only LLM call mocked
+    session = start_session(backend=mock_backend)
+    result = session.instruct("hello")
+    assert mock_backend.generate.called
 ```
 
-### Solution 2: Run Smaller Test Subsets
+### E2E
 
-If MPS is unavailable, break down test execution into smaller subsets to avoid GPU sharing conflicts:
+**Tests against real backends** — cloud APIs, local servers (Ollama), or
+GPU-loaded models (HuggingFace, vLLM). No mocks on the critical path.
 
-```bash
-# Run tests and examples separately
-pytest -m huggingface test/
-pytest -m huggingface docs/examples/
+Add `@pytest.mark.e2e` explicitly, always combined with backend marker(s).
+Assertions must be **deterministic** — structural, type-based, or functional.
+Assertions on generated text content belong in qualitative tests, not e2e.
+
+```python
+pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
 
-# Or run specific test directories
-pytest test/backends/test_huggingface.py
-pytest docs/examples/safety/
+def test_structured_output_returns_valid_json(session):
+    result = session.format(Person, "Make up a person")
+    assert isinstance(json.loads(result.value), dict)
 ```
 
-**Note**: If conflicts persist, continue breaking down into smaller subsets until tests pass. The key is reducing the number of concurrent GPU-using processes.
+### Qualitative
 
-### Why This Matters
+**A sub-tier of e2e**: same infrastructure requirements, but assertions check
+**non-deterministic output content** that may vary across model versions or runs.
 
-The test infrastructure runs examples in subprocesses (see `docs/examples/conftest.py`) to:
-- Isolate example execution environments
-- Capture stdout/stderr cleanly
-- Prevent cross-contamination between examples
+Add `@pytest.mark.qualitative` per-function (not at module level). The module
+still needs `e2e` and the backend marker. Qualitative tests are included in the
+default local run but skipped in CI (`CICD=1`).
 
-However, this creates the "Parent Trap": the parent pytest process holds a CUDA context from running regular tests, blocking subprocesses from accessing the GPU.
+```python
+pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
 
-### Technical Details
+@pytest.mark.qualitative
+def test_greeting_contains_salutation(session):
+    result = session.instruct("Write a greeting")
+    assert "hello" in result.value.lower()    # content check — qualitative
+```
 
-**CUDA Context Lifecycle**:
-- Created on first CUDA operation (e.g., `torch.cuda.is_available()`)
-- Persists until process exit or explicit `cudaDeviceReset()`
-- In `EXCLUSIVE_PROCESS` mode, blocks other processes from GPU access
+**Decision rule:** if swapping the model version could break the assertion
+despite the system working correctly, it is `qualitative`. If the assertion
+checks structure, types, or functional correctness, it is `e2e`.
 
-**MPS Architecture**:
-- Runs as a proxy service between applications and GPU driver
-- Multiplexes CUDA contexts from multiple processes onto single GPU
-- Transparent to applications - no code changes needed
-- Requires explicit enablement via job scheduler flags
+### Deprecated: the `llm` marker
 
-**Alternative Approaches Tried** (documented in `GPU_PARENT_TRAP_SOLUTION.md`):
-- ❌ `torch.cuda.empty_cache()` - Only affects PyTorch allocator, not driver context
-- ❌ `cudaDeviceReset()` in subprocesses - Parent still holds context
-- ❌ Inter-example delays - Doesn't release parent context
-- ❌ pynvml polling - Can't force parent to release context
-- ✅ MPS - Allows GPU sharing without code changes
+`llm` is a legacy alias for `e2e`. It remains registered for backwards
+compatibility but must not be used in new tests. The conftest auto-apply hook
+treats `llm` the same as `e2e`.
 
-## Test Markers
+## Backend markers
 
-See [`MARKERS_GUIDE.md`](MARKERS_GUIDE.md) for complete marker documentation.
+Backend markers identify which backend a test needs. They enable selective test
+runs (`pytest -m ollama`) and drive auto-skip logic. **Only apply to `e2e` and
+`qualitative` tests.**
 
-Key markers for GPU testing:
-- `@pytest.mark.vllm` - Requires vLLM backend (local, GPU required)
-- `@pytest.mark.huggingface` - Requires HuggingFace backend (local, GPU-heavy)
-- `@pytest.mark.slow` - Tests taking >5 minutes
+| Marker | Backend | Resources |
+|--------|---------|-----------|
+| `ollama` | Ollama (port 11434) | Local, light (~2–4 GB RAM) |
+| `openai` | OpenAI API or any OpenAI-compatible endpoint | API calls (may use Ollama `/v1`) |
+| `watsonx` | IBM Watsonx API | API calls, requires credentials |
+| `huggingface` | HuggingFace transformers | Local, GPU required (VRAM varies) |
+| `vllm` | vLLM | Local, GPU required |
+| `litellm` | LiteLLM (wraps other backends) | Depends on underlying backend |
+| `bedrock` | AWS Bedrock | API calls, requires credentials |
 
-### Resource gating (predicates)
+### OpenAI-via-Ollama pattern
 
-Use predicate functions from `test/predicates.py` for resource gating:
+Some tests use the OpenAI client pointed at Ollama's `/v1` endpoint. Mark these
+with **both** `openai` and `ollama`, but do **not** add `require_api_key`:
 
 ```python
-from test.predicates import require_gpu, require_ram
+pytestmark = [pytest.mark.e2e, pytest.mark.openai, pytest.mark.ollama]
+```
+
+## Resource predicates
+
+Fine-grained resource gating uses predicate decorators from `test/predicates.py`.
+They compose with `pytestmark` and produce self-documenting skip reasons:
 
-pytestmark = [pytest.mark.e2e, pytest.mark.huggingface, require_gpu(), require_ram(min_gb=48)]
+```python
+from test.predicates import require_gpu, require_api_key
 ```
 
 | Predicate | Use when test needs |
-| --------- | ------------------- |
+|-----------|---------------------|
 | `require_gpu()` | Any GPU (CUDA or MPS) |
 | `require_gpu(min_vram_gb=N)` | GPU with at least N GB VRAM |
-| `require_ram(min_gb=N)` | N GB+ system RAM |
+| `require_ram(min_gb=N)` | N GB+ system RAM (genuinely RAM-bound tests only) |
 | `require_api_key("ENV_VAR")` | Specific API credentials |
+| `require_package("pkg")` | Optional dependency |
+| `require_python((3, 11))` | Minimum Python version |
+
+**Typical combinations:**
+
+- `huggingface` → `require_gpu(min_vram_gb=N)` (compute N from model parameters)
+- `vllm` → `require_gpu(min_vram_gb=N)` (compute N from model parameters)
+- `watsonx` → `require_api_key("WATSONX_API_KEY", "WATSONX_URL", "WATSONX_PROJECT_ID")`
+- `openai` → `require_api_key("OPENAI_API_KEY")` only for real OpenAI, not Ollama-compat
+
+**Other gating markers** (not resource predicates, but still control selection):
+
+| Marker | Gate | Auto-skip when |
+|--------|------|----------------|
+| `slow` | Tests taking >1 minute | Excluded by default via `pyproject.toml` `addopts` |
+| `qualitative` | Non-deterministic output | Skipped when `CICD=1` |
+
+**Removed markers:** `requires_gpu`, `requires_heavy_ram`, and
+`requires_gpu_isolation` have been removed. Use `require_gpu(min_vram_gb=N)`
+from `test.predicates` instead.
+
+## Auto-detection
+
+The test suite automatically detects system capabilities and skips tests whose
+requirements are not met. No manual configuration needed.
+
+| Capability | How detected |
+|------------|--------------|
+| Ollama | Port 11434 check at collection time |
+| GPU / VRAM | `torch` + `sysctl hw.memsize` |
+| API keys | Environment variable check |
+
+Run `pytest -rs` to see skip reasons for each skipped test.
+
+## Common patterns
+
+```python
+# Unit — no markers needed (auto-applied by conftest)
+def test_cblock_repr():
+    assert str(CBlock(value="hi")) == "hi"
+
+# Integration — mocked backend, real project components
+@pytest.mark.integration
+def test_session_with_mock(mock_backend):
+    session = start_session(backend=mock_backend)
+    result = session.instruct("hello")
+    assert mock_backend.generate.called
+
+# E2E — real Ollama backend, deterministic assertion
+pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
+
+def test_structured_output(session):
+    result = session.format(Person, "Make up a person")
+    assert isinstance(json.loads(result.value), dict)
+
+# Qualitative — real backend, non-deterministic content check
+pytestmark = [pytest.mark.e2e, pytest.mark.ollama]
+
+@pytest.mark.qualitative
+def test_greeting_content(session):
+    result = session.instruct("Write a greeting")
+    assert "hello" in result.value.lower()
+
+# Heavy GPU e2e — resource predicate for gating
+from test.predicates import require_gpu
+
+pytestmark = [pytest.mark.e2e, pytest.mark.huggingface, require_gpu(min_vram_gb=20)]
+```
+
+## Authoring guide
+
+### Naming and structure
+
+- File: `test_<module>.py` in a directory mirroring the source (e.g.
+  `test/backends/test_ollama.py` for `mellea/backends/ollama.py`).
+- Files must be named `test_*.py` so that pydocstyle ignores them.
+- Function: `test_<subject>_<scenario>_<expected>`, written so the name reads
+  as a sentence.
+- One behavioural claim per test. If a test has `and` in the name, split it.
+
+### Fixture discipline
+
+The global `test/conftest.py` provides:
+
+| Fixture | Scope | Use |
+|---------|-------|-----|
+| `gh_run` | session | Returns `1` when `CICD=1` is set; use for CI-conditional behaviour |
+| `system_capabilities` | session | Detected hardware/service capabilities (GPU, Ollama, API keys) |
+
+Backend-specific fixtures (e.g. a pre-configured `session` against
+`granite4:micro`, or a `mock_backend` for unit/integration tests) are defined
+per test module or per-directory conftest — check the test files closest to
+what you're adding before creating new fixtures.
+
+Rules:
+
+- Reuse existing fixtures before creating new ones.
+- Do not create session-scoped fixtures that depend on real backends — they
+  prevent test isolation and make skip logic unreliable.
+- For Ollama-backed tests, the conftest evicts models between test modules
+  automatically. Do not add `keep_alive` management in individual tests.
+
+### Mock discipline
+
+- Do not mock what you can replace with a real test double.
+- Do not mock internal project components unless the test is explicitly testing
+  the boundary *around* that component.
+- When you must mock a backend for a unit or integration test, mock at the
+  backend's public method boundary (`generate_from_chat_context`,
+  `generate_from_raw`), not by patching internal Mellea classes.
+
+### Assertions
+
+- Assert one observable outcome per test.
+- Prefer specific assertions (`isinstance(result.value, str)`) over broad ones
+  (`result is not None`).
+- Do not assert on `repr()` strings — they break on whitespace changes.
+
+### Slow tests
 
-> **Deprecated:** The markers `requires_gpu`, `requires_heavy_ram`, `requires_api_key`,
-> and `requires_gpu_isolation` are deprecated. Existing tests using them still work
-> (conftest auto-skip handles them) but new tests must use predicates. Migrate legacy
-> markers to predicates when touching those files. `require_gpu_isolation()` has been
-> removed — use `--group-by-backend` for backend grouping instead.
+Mark any test taking more than one minute with `@pytest.mark.slow`. Slow tests
+are excluded from the default `pytest` invocation and from CI. Run them
+explicitly with `pytest -m slow`.
+
+## Running tests
+
+### Environment variables
+
+| Variable | Effect |
+|----------|--------|
+| `CICD=1` | Skips qualitative tests (mirrors CI behaviour) |
+| `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` | Helps with GPU memory fragmentation |
+
+### Common command lines
+
+```bash
+# Fast loop — unit + integration + e2e, no qualitative (~2 min)
+uv run pytest -m "not qualitative"
+
+# Default — all tiers including qualitative, skips slow
+uv run pytest
+
+# Slow tests only
+uv run pytest -m slow
+
+# Single backend
+uv run pytest -m ollama
+uv run pytest -m "e2e and ollama and not qualitative"
+
+# Specific file or test
+uv run pytest test/backends/test_ollama.py
+uv run pytest test/backends/test_ollama.py::test_structured_output_returns_valid_json
+
+# See why tests were skipped
+uv run pytest -rs
+
+# CI mode locally (mirrors what PR CI does)
+CICD=1 uv run pytest test
+
+# Nightly-style local run on a GPU host
+./test/scripts/run_tests_with_ollama_and_vllm.sh --group-by-backend -v -s
+```
+
+### Scoping a test run
+
+A pytest run can be scoped along four independent axes; combine them as needed.
+
+| Axis | Flag / form | Examples |
+|------|-------------|---------|
+| **By tier** | `-m <marker>` | `-m unit`, `-m integration`, `-m e2e`, `-m qualitative`, `-m slow` |
+| **By backend** | `-m <backend>` | `-m ollama`, `-m huggingface`, `-m "openai or watsonx"` |
+| **By compound expression** | `-m "<expr>"` | `-m "e2e and ollama and not qualitative"` |
+| **By path / node id** | positional | `pytest test/backends/test_ollama.py`, `pytest test/foo.py::test_bar` |
+
+The `addopts` in `pyproject.toml` adds `-m "not slow"` to every invocation, so
+slow tests are always excluded unless you pass `-m slow` yourself. Qualitative
+tests run by default locally and are skipped only when `CICD=1` is set.
+
+### Auto-skip behaviour
+
+Tests skip automatically when requirements are not met:
+
+- **Ollama tests** skip at *collection time* if port 11434 is not reachable,
+  preventing fixture setup errors before the skip decision.
+- **GPU/HuggingFace/vLLM tests** skip if no GPU is detected or VRAM is below
+  the test's requirement.
+- **Cloud API tests** skip if required environment variables are unset.
+
+## CI pipeline
+
+| Tier | Trigger | Where | What runs |
+|------|---------|-------|-----------|
+| **Pre-commit** | Every commit (local) | Local hook | ruff, mypy, uv-lock, codespell, markdownlint |
+| **PR CI** | Every push / merge group | GitHub Actions, Ubuntu | `pytest test/` on Python 3.11/3.12/3.13 with Ollama. `CICD=1` (qualitative skipped). `slow` excluded. |
+| **Nightly** | Scheduled | IBM internal LSF cluster (GPU) | Full `pytest test/ --group-by-backend`, Ollama + vLLM, qualitative enabled. Failures file an auto-issue. |
+| **On-demand nightly** | Not yet available | IBM internal LSF cluster | Comment-triggered nightly against a PR branch. Tracked in [#734](https://github.com/generative-computing/mellea/issues/734); ask a maintainer if you need pre-merge GPU validation today. |
+
+**PR CI** (`ci.yml` → `quality.yml`): pre-commit checks, then Ollama installed
+and `granite4:micro` + `granite4:micro-h` pulled, then `uv run -m pytest -v
+--junit-xml=... test`. `docs/examples/` is not collected in PR CI.
+
+**Nightly** (`test/scripts/run_tests_with_ollama_and_vllm.sh`): starts local
+Ollama and (when GPU present) a local vLLM server, then runs
+`pytest test/ --group-by-backend`. The `--group-by-backend` flag reorders tests
+to run each backend as a contiguous group, reducing GPU memory fragmentation.
 
 ## Coverage
 
-Coverage reports are generated in `htmlcov/` and `coverage.json`.
+Branch coverage is enabled and runs automatically with every test invocation.
+Reports are written to `htmlcov/` and `coverage.json`.
+
+```bash
+uv run pytest
+open htmlcov/index.html        # macOS
+xdg-open htmlcov/index.html    # Linux
+```
+
+Coverage is measured over `mellea/` and `cli/`. Test files and `docs/` are
+excluded. There is no enforced minimum threshold; use coverage locally to
+identify untested paths. Uploading artifacts and trend reporting is an open gap
+([#737](https://github.com/generative-computing/mellea/issues/737)).
+
+## Examples as tests
+
+Files in `docs/examples/` are not auto-collected. A file is only executed by
+pytest if it has an opt-in comment near the top:
+
+```python
+# pytest: e2e, ollama, qualitative
+"""Greeting example — demonstrates session.instruct()."""
+```
+
+The comment lists comma-separated marker names (not `-m` expression syntax —
+no `and`/`or`/`not`). Files without this comment are silently ignored and do
+not appear in skip summaries or collection output.
+
+The same classification rules and marker conventions apply as for `test/`
+files. Only add the `# pytest:` comment when the example has the necessary
+dependencies documented and should be part of the regression suite.
+
+Parser: `docs/examples/conftest.py` (`_extract_markers_from_file`).
+
+## Ollama model eviction
+
+When pytest orchestrates many Ollama-backed tests in sequence, the default 5-minute
+keep-alive means models from earlier tests stay resident and accumulate, eventually
+starving later tests of memory.
+
+Two mechanisms in `test/conftest.py` handle this:
+
+- **Per-module eviction** (`pytest_runtest_teardown`) — when crossing a file
+  boundary between Ollama-marked tests, queries `/api/ps` for all loaded models
+  and evicts them with `keep_alive=0`. Covers both `test/` and `docs/examples/`.
+  Always active, no flags required.
+- **Group warm-up/eviction** (`pytest_runtest_setup`) — warms up a fixed set of CI
+  models (`keep_alive=-1`) when entering the Ollama backend group and evicts them
+  when leaving. Requires `--group-by-backend`.
+
+**Trade-off:** if two consecutive test files use the same model, it will be unloaded
+and reloaded (~5–15 s overhead). Predictable memory behaviour is more important
+than saving a reload, especially on constrained CI runners. Tests within a single
+file share the loaded model with no overhead.
+
+**Caveat:** eviction targets *all* loaded Ollama models, not just those loaded by
+the test. If you are using Ollama interactively while the suite runs, your model
+will be evicted between test modules.
+
+## GPU testing on CUDA systems
+
+### The problem: CUDA EXCLUSIVE_PROCESS mode
+
+When running GPU tests on systems with `EXCLUSIVE_PROCESS` mode (common on HPC
+clusters), you may encounter "CUDA device busy" errors. This happens because:
+
+1. The pytest parent process creates a CUDA context when running regular tests.
+2. Example tests run in subprocesses (via `docs/examples/conftest.py`).
+3. In `EXCLUSIVE_PROCESS` mode, only one process can hold a CUDA context per GPU.
+4. Subprocesses fail with "CUDA device busy" when the parent still holds the context.
+
+### Solution 1: NVIDIA MPS (recommended)
+
+**NVIDIA Multi-Process Service (MPS)** allows multiple processes to share a GPU
+in `EXCLUSIVE_PROCESS` mode. Enable it via your job scheduler configuration;
+consult your HPC documentation for specific syntax.
+
+### Solution 2: run smaller test subsets
+
+If MPS is unavailable, run `test/` and `docs/examples/` in separate invocations:
+
+```bash
+pytest -m huggingface test/
+pytest -m huggingface docs/examples/
+```
+
+If conflicts persist, continue breaking down into smaller subsets.
+
+### Why this matters
+
+The test infrastructure runs examples in subprocesses to isolate execution and
+capture stdout/stderr cleanly, but this creates the "parent trap": the parent
+pytest process holds a CUDA context from running regular tests, blocking
+subprocesses from accessing the GPU.
+
+**Approaches that do not work:** `torch.cuda.empty_cache()` (only affects the
+PyTorch allocator, not the driver context), `cudaDeviceReset()` in subprocesses
+(parent still holds the context), inter-example delays.