diff --git a/.agents/skills/audit-markers/SKILL.md b/.agents/skills/audit-markers/SKILL.md index 558e8c9a9..e71ff1607 100644 --- a/.agents/skills/audit-markers/SKILL.md +++ b/.agents/skills/audit-markers/SKILL.md @@ -69,7 +69,8 @@ test actually loads, and report whether the gate is correctly set or too loose. Read these before auditing — they are the authoritative source for marker conventions: -- **Marker guide:** `test/MARKERS_GUIDE.md` +- **Test strategy:** `test/README.md` — classification decision rules, per-tier definitions, philosophy, authoring guide, CI pipeline +- **Marker guide:** `test/README.md` — marker tables, common patterns, backend reference (sections: Backend markers, Resource predicates, Common patterns) - **Marker registration:** `test/conftest.py` (`pytest_configure`) and `pyproject.toml` (`[tool.pytest.ini_options]`) - **Resource predicates:** `test/predicates.py` (predicate functions for resource gating) - **Example marker format:** `docs/examples/conftest.py` (`_extract_markers_from_file`) @@ -362,7 +363,7 @@ When asking, present the test code and your reasoning so the user can make an in # Part 2: Project-Specific Rules -Read `test/MARKERS_GUIDE.md` for the full marker reference (marker tables, +Read `test/README.md` for the full marker reference (marker tables, resource gates, auto-skip logic, common patterns). This section covers only the **code analysis heuristics** the skill needs to classify tests — things that require reading the test source code rather than looking up a table. @@ -848,18 +849,18 @@ in the `pytestmark` list or decorator. Check that every backend used in test files has a registered marker. The project's backend registry is `BACKEND_MARKERS` in `test/conftest.py` (single source of truth). Markers must also appear in `pyproject.toml` -`[tool.pytest.ini_options].markers` and in `test/MARKERS_GUIDE.md`. +`[tool.pytest.ini_options].markers` and in `test/test/README.md`. For each backend constructor or `start_session(backend_name=...)` call found during classification, verify: 1. A marker exists in `BACKEND_MARKERS` for that backend. 2. The marker appears in `pyproject.toml`. -3. The marker appears in the MARKERS_GUIDE.md backend table. +3. The marker appears in the test/README.md backend table. If a backend is used in tests but has no registered marker, flag it as a **missing backend marker** issue and add it to the registry, pyproject.toml, -and MARKERS_GUIDE.md (same apply/confirm rules as other fixes in Step 4). +and test/README.md (same apply/confirm rules as other fixes in Step 4). ## Step 6 — Flag infrastructure notes @@ -896,7 +897,7 @@ flag as a blocker, don't silently re-add: - **Backend marker registry:** `BACKEND_MARKERS` dict in `test/conftest.py` is the single source of truth for backend markers. `pytest_configure` iterates over it. New backends are added by inserting one entry into the dict. - `pyproject.toml` and `test/MARKERS_GUIDE.md` must stay in sync manually. + `pyproject.toml` and `test/test/README.md` must stay in sync manually. - **Resource predicates:** `test/predicates.py` provides `require_gpu`, `require_ram`, `require_api_key`, `require_package`, `require_python`. diff --git a/.agents/skills/skill-author/SKILL.md b/.agents/skills/skill-author/SKILL.md index ca1e5ca8b..b305bbdae 100644 --- a/.agents/skills/skill-author/SKILL.md +++ b/.agents/skills/skill-author/SKILL.md @@ -134,7 +134,7 @@ After frontmatter, write clear markdown instructions the agent follows: "Check if markers are correct" is worse than "Compare the test's assertions to the qualitative decision rule in section 3." - **Reference project files.** Point to docs, configs, and examples by relative - path so the agent can read them. E.g. "See `test/MARKERS_GUIDE.md` for the + path so the agent can read them. E.g. "See `test/README.md` for the full marker taxonomy." - **Declare scope boundaries.** State what the skill does NOT do. E.g. "This skill does not modify conftest.py — flag infrastructure issues as notes." diff --git a/AGENTS.md b/AGENTS.md index ff27fe273..493dd472b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -49,7 +49,7 @@ uv run mypy . # Type check ## 3. Test Markers Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. The `unit` marker is auto-applied by conftest — never write it explicitly. The `llm` marker is deprecated; use `e2e` instead. -See **[test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md)** for the full marker reference (tier definitions, backend markers, resource gates, auto-skip logic, common patterns). +See **[test/README.md](test/README.md)** for classification rules, authoring guide, marker reference, CI tier map, and local workflow. **Examples in `docs/examples/`** are opt-in — unlike `test/` files (auto-collected, default `unit`), examples require an explicit `# pytest:` comment to be collected. Files without this comment are silently ignored (they won't appear in skip summaries either). This is because examples have variable dependencies and limited setup: ```python @@ -115,6 +115,8 @@ Pre-commit runs: ruff, mypy, uv-lock, codespell ## 10. Writing Tests +See **[test/README.md — Authoring guide](test/README.md#authoring-guide)** for the full authoring guide (naming, fixture discipline, mock discipline, assertion style). + - Place tests in `test/` mirroring source structure - Name files `test_*.py` (required for pydocstyle) - Use `gh_run` fixture for CI-aware tests (see `test/conftest.py`) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7deee4c6c..c726c660e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -320,6 +320,9 @@ as it can corrupt state. ### Quick Reference +See [test/README.md](test/README.md) for classification rules, authoring guide, +CI tier map, coverage, and the full local workflow reference. Essential commands: + ```bash # Install all dependencies (required for tests) uv sync --all-extras --all-groups @@ -327,25 +330,12 @@ uv sync --all-extras --all-groups # Start Ollama (required for most tests) ollama serve -# Default: qualitative tests, skip slow tests +# Default: includes qualitative tests, skips slow tests uv run pytest # Fast tests only (no qualitative, ~2 min) uv run pytest -m "not qualitative" -# Unit tests only (self-contained, no services) -uv run pytest -m unit - -# Run only slow tests (>1 min) -uv run pytest -m slow - -# Run specific backend tests -uv run pytest -m "ollama" -uv run pytest -m "openai" - -# CI/CD mode (skips qualitative tests) -CICD=1 uv run pytest - # Lint and format uv run ruff format . uv run ruff check . @@ -395,23 +385,26 @@ for m in granite4:micro granite4:micro-h deepseek-r1:8b \ ### Test Markers -Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. See [test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md) for the full marker reference, including tier definitions, backend markers, resource gates, and auto-skip logic. +Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. See [test/README.md](test/README.md) for the full guide: classification rules, marker reference, authoring guide, CI tiers, and auto-skip logic. ### CI/CD Tests CI runs the following checks on every pull request: -1. **Pre-commit hooks** (`pre-commit run --all-files`) - Ruff, mypy, uv-lock, codespell -2. **Test suite** (`CICD=1 uv run pytest`) - Skips qualitative tests for speed +1. **Pre-commit hooks** (`pre-commit run --all-files`) — ruff, mypy, uv-lock, codespell +2. **Test suite** — `CICD=1 uv run pytest test` on Python 3.11/3.12/3.13 with Ollama running; skips qualitative tests To replicate CI locally: ```bash -# Run pre-commit checks (same as CI) +# Pre-commit checks (same as CI) pre-commit run --all-files -# Run tests with CICD flag (same as CI, skips qualitative tests) -CICD=1 uv run pytest +# Tests with CICD flag (skips qualitative, matches CI scope) +CICD=1 uv run pytest test ``` +See [test/README.md — CI pipeline](test/README.md#ci-pipeline) for the full CI +breakdown and planned nightly/pre-release tiers. + ### Timing Expectations - Fast tests (`-m "not qualitative"`): ~2 minutes @@ -448,7 +441,7 @@ print(m.last_prompt()) ### Getting Help -- Check this guide and [test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md) +- Check this guide and [test/README.md](test/README.md) - Search [existing issues](https://github.com/generative-computing/mellea/issues) - Check out [Github Discussions](https://github.com/generative-computing/mellea/discussions) - Open a new issue with the appropriate label @@ -459,7 +452,7 @@ print(m.last_prompt()) - **[Docs writing guide](docs/docs/guide/CONTRIBUTING.md)** - Conventions, PR checklist, and review process for documentation contributions - **[API Documentation](https://docs.mellea.ai)** - Published documentation site -- **[Test Markers Guide](test/MARKERS_GUIDE.md)** - Detailed pytest marker documentation +- **[Test Guide](test/README.md)** - Test strategy, classification, markers, and authoring guide - **[AGENTS.md](AGENTS.md)** - Guidelines for AI assistants working on Mellea internals - **[AGENTS_TEMPLATE.md](docs/AGENTS_TEMPLATE.md)** - Template for projects using Mellea diff --git a/docs/docs/community/contributing-guide.md b/docs/docs/community/contributing-guide.md index bb583292c..3330db249 100644 --- a/docs/docs/community/contributing-guide.md +++ b/docs/docs/community/contributing-guide.md @@ -196,49 +196,13 @@ git commit -n -m "wip: intermediate work" ## Testing -### Test markers +The full test guide lives in +[`test/README.md`](https://github.com/generative-computing/mellea/blob/main/test/README.md) +in the repository — classification rules, marker reference (tier definitions, +backend matrix, resource predicates), authoring guide, CI pipeline, and +GPU/Ollama operational notes. -Tests use a four-tier granularity system. Every test belongs to exactly one tier: - -| Tier | When to use | How to apply | -| ---- | ----------- | ------------ | -| `unit` | Self-contained, no services, no I/O | Auto-applied — never write `@pytest.mark.unit` | -| `integration` | Real SDK/library boundary or multi-component wiring | `@pytest.mark.integration` | -| `e2e` | Real backends (Ollama, APIs, GPU models), deterministic assertions | `@pytest.mark.e2e` + backend marker(s) | -| `qualitative` | Subset of e2e with non-deterministic output assertions | `@pytest.mark.qualitative` per-function, `e2e` + backend at module level | - -**Backend markers** (only for e2e/qualitative tests): - -| Marker | Backend | Resources | -| ------ | ------- | --------- | -| `ollama` | Ollama (port 11434) | Local, light (~2–4 GB RAM) | -| `openai` | OpenAI API or compatible | API calls (may use Ollama `/v1`) | -| `watsonx` | Watsonx API | API calls, requires credentials | -| `huggingface` | HuggingFace transformers | Local, GPU required | -| `litellm` | LiteLLM (wraps other backends) | Depends on underlying backend | -| `bedrock` | AWS Bedrock | API calls, requires credentials | - -**Resource predicates** (from `test/predicates.py`, for e2e/qualitative tests): - -| Predicate | Use when test needs | -| --------- | ------------------- | -| `require_gpu()` | Any GPU (CUDA or MPS) | -| `require_gpu(min_vram_gb=N)` | GPU with at least N GB VRAM | -| `require_ram(min_gb=N)` | N GB+ system RAM | -| `require_api_key("ENV_VAR")` | Specific API credentials | -| `require_package("pkg")` | Optional dependency | -| `require_python((3, 11))` | Minimum Python version | - -**Other markers:** - -| Marker | Purpose | -| ------ | ------- | -| `slow` | Tests taking >1 minute (excluded by default) | -| `qualitative` | Non-deterministic output (skipped when `CICD=1`) | - -For more information, see our [Markers Guide](https://github.com/generative-computing/mellea/blob/main/test/MARKERS_GUIDE.md). - -### Running tests +Essential commands: ```bash # Install all dependencies (required for tests) @@ -247,43 +211,15 @@ uv sync --all-extras --all-groups # Start Ollama (required for most tests) ollama serve -# Default: runs qualitative tests, skips slow tests -uv run pytest - # Fast tests only (no qualitative, ~2 min) uv run pytest -m "not qualitative" -# Run only slow tests (>5 min) -uv run pytest -m slow - -# Run specific backend tests -uv run pytest -m "ollama" -uv run pytest -m "openai" - -# Run unit tests only (no backends needed) -uv run pytest -m unit - -# CI/CD mode (skips qualitative tests) -CICD=1 uv run pytest -``` - -### Timing expectations - -| Run | Duration | -| --- | -------- | -| Fast tests (`-m "not qualitative"`) | ~2 minutes | -| Default (qualitative, no slow) | Several minutes | -| Slow tests (`-m slow`) | More than 1 minute | -| Pre-commit hooks | 1–5 minutes | - -### Replicate CI locally +# Default: includes qualitative, skips slow +uv run pytest -```bash -# Run pre-commit checks (same as CI) +# Replicate CI locally pre-commit run --all-files - -# Run tests with CICD flag (same as CI, skips qualitative tests) -CICD=1 uv run pytest +CICD=1 uv run pytest test ``` ## Pull request process diff --git a/docs/docs/how-to/unit-test-generative-code.md b/docs/docs/how-to/unit-test-generative-code.md index 0c62cd271..18e25dd6c 100644 --- a/docs/docs/how-to/unit-test-generative-code.md +++ b/docs/docs/how-to/unit-test-generative-code.md @@ -7,7 +7,8 @@ description: "Write reliable tests for @generative functions using pytest marker **Prerequisites:** [Quick Start](../getting-started/quickstart) complete, `pip install mellea`, Ollama running locally, `pytest` installed. -> **Contributing to Mellea itself?** See the [Contributing Guide](../community/contributing-guide#testing) +> **Contributing to Mellea itself?** See +> [`test/README.md`](https://github.com/generative-computing/mellea/blob/main/test/README.md) > for Mellea's own test markers, fixtures, and CI setup. Testing generative code requires you to separate concerns: some assertions are diff --git a/test/MARKERS_GUIDE.md b/test/MARKERS_GUIDE.md deleted file mode 100644 index 8f0ad3b50..000000000 --- a/test/MARKERS_GUIDE.md +++ /dev/null @@ -1,322 +0,0 @@ -# Pytest Markers Guide - -## Quick Reference - -```bash -# By granularity tier -pytest -m unit # Self-contained, no services (fast) -pytest -m integration # Real SDK/library boundary or multi-component wiring -pytest -m e2e # Real backends (ollama, APIs, GPU models) -pytest -m "e2e and not qualitative" # Deterministic real-backend tests only - -# By backend -pytest -m ollama # Ollama tests -pytest -m huggingface # HuggingFace tests -pytest -m "openai or watsonx" # Cloud API tests - -# By characteristics -pytest -m "not qualitative" # Fast, deterministic tests (~2 min) -pytest -m qualitative # Non-deterministic output quality tests -pytest -m slow # Long-running tests (>1 min) - -# Default (configured in pyproject.toml): skips slow, includes qualitative -pytest -``` - -## Granularity Tiers - -Every test belongs to exactly one tier. The tier determines what infrastructure -the test needs and how fast/heavy it is to run. - -### Unit (auto-applied) - -**Entirely self-contained** — no services, no I/O, no fixtures that connect -to anything external. Pure logic testing. - -- Auto-applied by conftest hook when no other granularity marker is present -- **Never write `@pytest.mark.unit` on files** — it is implicit -- Runs in milliseconds to low seconds, minimal memory -- Would pass on any machine with just Python and project deps - -```python -# No markers needed — auto-applied as unit -def test_cblock_repr(): - assert str(CBlock(value="hi")) == "hi" -``` - -### Integration (explicit) - -**Verifies that your code correctly communicates across a real boundary.** -The boundary may be a third-party SDK/library whose API contract you are -asserting against, multiple internal components wired together, or a -fixture-managed local service. What distinguishes integration from unit is -that at least one real external component — not a mock or stub — is on the -other side of the boundary being tested. - -- Add `@pytest.mark.integration` explicitly -- No backend markers needed — integration tests do not use real LLM backends -- Slower than unit (fixture setup, real SDK objects), but faster than e2e - -**Positive indicators:** - -- Uses a real third-party SDK object to *capture and assert* on output — - e.g. `InMemoryMetricReader`, `InMemorySpanExporter`, `LoggingHandler` — - rather than patching the SDK away -- Asserts on the format or content of data as received by an external - component (semantic conventions, attribute names, accumulated values) -- Wires multiple real project components together and mocks only at the - outermost boundary -- Breaking the interface between your code and the external component - (e.g. a changed attribute name, a missing SDK call) would cause the test - to fail - -**Negative indicators (likely unit instead):** - -- All external boundaries replaced with `MagicMock`, `patch`, or `AsyncMock` -- Third-party library imported only as a type or helper, not as a real - collaborator being asserted against -- Toggles env vars and checks booleans or config state with no real SDK - objects instantiated - -**Tie-breaker:** If you changed the contract between your code and the -external component, would this test catch it? If yes → integration. If no -→ unit. - -```python -@pytest.mark.integration -def test_token_metrics_format(clean_metrics_env): - # Real InMemoryMetricReader — asserting against the OTel SDK contract - reader = InMemoryMetricReader() - provider = MeterProvider(metric_readers=[reader]) - record_token_usage_metrics(input_tokens=10, output_tokens=5, ...) - metrics_data = reader.get_metrics_data() - assert metrics_data.resource_metrics[0]...name == "mellea.llm.tokens.input" - -@pytest.mark.integration -def test_session_chains_components(mock_backend): - # Multiple real project components wired together; only LLM call mocked - session = start_session(backend=mock_backend) - result = session.instruct("hello") - assert mock_backend.generate.called -``` - -### E2E (explicit) - -**Tests against real backends** — cloud APIs, local servers (ollama), or -GPU-loaded models (huggingface, vllm). No mocks on the critical path. - -- Add `@pytest.mark.e2e` explicitly, always combined with backend marker(s) -- Resource predicates (`require_gpu()`, `require_ram()`, etc.) only apply to - e2e and qualitative tests — see "Resource Gating" section below -- Assertions are **deterministic** — structural, type-based, or functional - -```python -pytestmark = [pytest.mark.e2e, pytest.mark.ollama] - -def test_structured_output(session): - result = session.format(Person, "Make up a person") - assert isinstance(json.loads(result.value), dict) -``` - -### Qualitative (explicit, per-function) - -**Subset of e2e.** Same infrastructure requirements, but assertions check -**non-deterministic output content** that may vary across model versions or runs. - -- Add `@pytest.mark.qualitative` per-function (not module-level) -- Module must also carry `e2e` + backend markers at module level -- Skipped in CI when `CICD=1` -- Included by default in local runs - -```python -pytestmark = [pytest.mark.e2e, pytest.mark.ollama] - -@pytest.mark.qualitative -def test_greeting_content(session): - result = session.instruct("Write a greeting") - assert "hello" in result.value.lower() -``` - -**Decision rule:** If swapping the model version could break the assertion -despite the system working correctly, it is `qualitative`. If the assertion -checks structure, types, or functional correctness, it is `e2e`. - -### The `llm` marker (deprecated) - -`llm` is a legacy marker equivalent to `e2e`. It remains registered for -backward compatibility but should not be used in new tests. Use `e2e` instead. - -The conftest auto-apply hook treats `llm` the same as `e2e` — tests marked -`llm` will not receive the `unit` marker. - -## Backend Markers - -Backend markers identify which backend a test needs. They enable selective -test runs (`pytest -m ollama`) and drive auto-skip logic. - -**Backend markers only go on e2e and qualitative tests.** Unit and integration -tests don't need real backends. - -| Marker | Backend | Resources | -| -------------- | ----------------------------- | ------------------------------------- | -| `ollama` | Ollama (port 11434) | Local, light (~2-4GB RAM) | -| `openai` | OpenAI API or compatible | API calls (may use Ollama `/v1`) | -| `watsonx` | Watsonx API | API calls, requires credentials | -| `huggingface` | HuggingFace transformers | Local, GPU required | -| `vllm` | vLLM | Local, GPU required | -| `litellm` | LiteLLM (wraps other backends)| Depends on underlying backend | -| `bedrock` | AWS Bedrock | API calls, requires credentials | - -### OpenAI-via-Ollama pattern - -Some tests use the OpenAI client pointed at Ollama's `/v1` endpoint. Mark -these with **both** `openai` and `ollama`, but do **not** add `require_api_key`: - -```python -pytestmark = [pytest.mark.e2e, pytest.mark.openai, pytest.mark.ollama] -``` - -## Resource Gating (Predicates) - -E2E and qualitative tests need gating so they skip cleanly when required -infrastructure is absent. Use **predicate decorators** from `test/predicates.py` -— they give test authors precise control over skip conditions. - -```python -from test.predicates import require_gpu, require_api_key -``` - -| Predicate | Use when test needs | -| --------- | ------------------- | -| `require_gpu()` | Any GPU (CUDA or MPS) | -| `require_gpu(min_vram_gb=N)` | GPU with at least N GB VRAM | -| `require_ram(min_gb=N)` | N GB+ system RAM (genuinely RAM-bound tests only) | -| `require_api_key("ENV_VAR")` | Specific API credentials | -| `require_package("pkg")` | Optional dependency | -| `require_python((3, 11))` | Minimum Python version | - -### Typical combinations - -- `huggingface` → `require_gpu(min_vram_gb=N)` (compute N from model params) -- `vllm` → `require_gpu(min_vram_gb=N)` (compute N from model params) -- `watsonx` → `require_api_key("WATSONX_API_KEY", "WATSONX_URL", "WATSONX_PROJECT_ID")` -- `openai` → `require_api_key("OPENAI_API_KEY")` only for real OpenAI (not Ollama-compat) - -### Other gating markers - -These are not resource predicates but still control test selection: - -| Marker | Gate | Auto-skip when | -| -------------- | -------------------------------- | ------------------------------------------------ | -| `slow` | Tests taking >1 minute | Excluded by default via `pyproject.toml` addopts | -| `qualitative` | Non-deterministic output | Skipped when `CICD=1` | - -### Removed markers - -`requires_gpu`, `requires_heavy_ram`, and `requires_gpu_isolation` have been -removed. Use `require_gpu(min_vram_gb=N)` from `test.predicates` instead. -`requires_api_key` is still active — see below. - -## Auto-Detection - -The test suite automatically detects system capabilities and skips tests -whose requirements are not met. No configuration needed. - -| Capability | How detected | -| ---------- | ----------------------------- | -| Ollama | Port 11434 check | -| GPU/VRAM | `torch` + `sysctl hw.memsize` | -| API keys | Environment variable check | - -Use `-rs` with pytest to see skip reasons: - -```bash -pytest -rs -``` - -## Common Marker Patterns - -```python -# Unit — no markers needed (auto-applied by conftest) -def test_cblock_repr(): - assert str(CBlock(value="hi")) == "hi" - -# Integration — mocked backend -@pytest.mark.integration -def test_session_with_mock(mock_backend): - session = start_session(backend=mock_backend) - result = session.instruct("hello") - assert mock_backend.generate.called - -# E2E — real Ollama backend, deterministic -pytestmark = [pytest.mark.e2e, pytest.mark.ollama] -def test_structured_output(session): - result = session.format(Person, "Make up a person") - assert isinstance(json.loads(result.value), dict) - -# E2E + qualitative — real backend, non-deterministic -pytestmark = [pytest.mark.e2e, pytest.mark.ollama] -@pytest.mark.qualitative -def test_greeting_content(session): - result = session.instruct("Write a greeting") - assert "hello" in result.value.lower() - -# Heavy GPU e2e (predicates for resource gating) -from test.predicates import require_gpu - -pytestmark = [pytest.mark.e2e, pytest.mark.huggingface, - require_gpu(min_vram_gb=20)] -``` - -## Example Files (`docs/examples/`) - -Unlike `test/` files (which are auto-collected and default to `unit`), examples -require an explicit `# pytest:` comment to be collected. This opt-in approach -reflects that examples often have variable dependencies and limited setup, so -only files that declare themselves runnable should be executed. - -```python -# pytest: e2e, ollama, qualitative -"""Example description...""" -``` - -Same classification rules apply. The comment must appear in the first few -lines before non-comment code. Files without this comment are silently -ignored — they won't appear in skip summaries or collection output. -Parser: `docs/examples/conftest.py` (`_extract_markers_from_file`). - -## Adding Markers to New Tests - -1. **Classify the test** — unit, integration, e2e, or qualitative? -2. **Add granularity marker** — integration and e2e are explicit; unit is auto-applied -3. **Add backend marker(s)** — only for e2e/qualitative -4. **Add resource predicates** — only for e2e/qualitative, use `test/predicates.py` -5. **Verify** — `pytest --collect-only -m "your_marker"` to check - -Use the `/audit-markers` skill to validate markers on existing or new test files. - -## CI/CD Integration - -```yaml -jobs: - unit-tests: - run: pytest -m unit # Fast, no services needed - - ollama-tests: - run: pytest -m "e2e and ollama and not qualitative" - - quality-tests: - if: github.event_name == 'schedule' - run: pytest -m "qualitative and ollama" -``` - -- `CICD=1` skips qualitative tests -- `slow` tests excluded by default (add `-m slow` to include) - -## Related Files - -- `test/conftest.py` — marker registration, auto-detection, skip logic, unit auto-apply hook -- `test/predicates.py` — resource gating predicates (`require_gpu`, `require_ram`, etc.) -- `docs/examples/conftest.py` — example marker parser (`_extract_markers_from_file`) -- `pyproject.toml` — marker definitions and pytest configuration -- `.agents/skills/audit-markers/SKILL.md` — skill for auditing and fixing markers diff --git a/test/README.md b/test/README.md index 4c4c7a148..359957b12 100644 --- a/test/README.md +++ b/test/README.md @@ -1,147 +1,529 @@ # Mellea Test Suite -Test files must be named as `test_*.py` so that pydocstyle ignores them. +This file covers everything you need to contribute tests: strategy, classification, +marker reference, authoring guide, CI pipeline, and operational notes. -## Running Tests +## Quick start + +**First-time setup.** Follow the +[Contributing Guide](../docs/docs/community/contributing-guide.md#development-setup) +to install `uv`, sync deps, and install pre-commit hooks. For anything beyond +unit tests you also need Ollama running locally: ```bash -# Fast tests only (~2 min) - skips qualitative and slow tests -uv run pytest -m "not qualitative" +ollama serve & +ollama pull granite4:micro granite4:micro-h +``` -# Default - includes qualitative tests, skips slow tests -uv run pytest +**Running tests during development.** -# All tests including slow tests (>5 min) -uv run pytest -m slow +```bash +uv run pytest -m "not qualitative" # ~2 min fast loop +uv run pytest test/path/to/test_thing.py # focus on one file +uv run pytest -rs # show why anything skipped ``` -## Environment Variables +**Adding a new test.** -- `CICD=1` - Enable CI mode (skips qualitative tests) -- `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` - Helps with GPU memory fragmentation +1. **Classify** it (see [Test tiers](#test-tiers)). If it doesn't call a real + backend or external SDK, it's `unit` and needs no marker. +2. **Place** the file at `test//test_.py`. +3. **Add the granularity marker** — `integration`, `e2e`, or `qualitative` are + explicit; `unit` is auto-applied by conftest, never write it. +4. **Add backend marker(s)** — only for `e2e`/`qualitative` tests + (`ollama`, `huggingface`, etc.). See [Backend markers](#backend-markers). +5. **Add resource predicates** — only for `e2e`/`qualitative`, use + `test/predicates.py`. See [Resource predicates](#resource-predicates). +6. **Verify collection** — `uv run pytest --collect-only -m "your_marker"`. +7. **Run it** — `uv run pytest path::test_name -v`. -## Ollama Model Eviction +**Before opening a PR.** -When pytest orchestrates many Ollama-backed tests in sequence, the default 5-minute -keep-alive means models from earlier tests stay resident and accumulate, eventually -starving later tests of memory. +```bash +uv run pre-commit run --all-files # what CI runs first +CICD=1 uv run pytest test # what CI runs second +``` -Two mechanisms in `test/conftest.py` handle this: +## Philosophy -- **Per-module eviction** (`pytest_runtest_teardown`) — when crossing a file - boundary between Ollama-marked tests, queries `/api/ps` for all loaded models - and evicts them with `keep_alive=0`. Covers both `test/` and `docs/examples/`. - Always active, no flags required. -- **Group warm-up/eviction** (`pytest_runtest_setup`) — warms up a fixed set of CI - models (`keep_alive=-1`) when entering the Ollama backend group and evicts them - when leaving. Requires `--group-by-backend`. +Mellea tests assert **observable contracts**, not implementation details. -**Trade-off:** if two consecutive test files use the same model, it will be unloaded -and reloaded (~5-15 s overhead). Predictable memory behaviour is more important -than saving a reload, especially on constrained CI runners. Tests within a single -file share the loaded model with no overhead. +- Test the public API surface, not private helpers. +- Test cross-backend behaviour where it needs to be consistent. +- A test that passes while the system is broken has negative value; prefer + fewer, more honest tests over coverage padding. +- When a test fails, fix the **code**. Adjusting an assertion to silence a + failure is almost always wrong; fixing the test is acceptable only if the + test was never correctly written. +- All tiers run locally by default. Qualitative tests (assertions on LLM output + *content*) are the one tier that CI skips, so non-deterministic checks never + block CI. -**Caveat:** eviction targets *all* loaded Ollama models, not just those loaded by -the test. If you are using Ollama interactively while the suite runs, your model -will be evicted between test modules. +## Test tiers + +Every test belongs to exactly one granularity tier. Apply the decision rules +below in order: -## GPU Testing on CUDA Systems +| Question | Answer → tier | +|----------|---------------| +| Does it call a real LLM backend or external API? | Yes → **e2e** (or **qualitative**, see below) | +| Does it assert against a real third-party SDK object (OTel reader, metrics collector)? | Yes → **integration** | +| Does it wire multiple real project components together without external I/O? | Yes → **integration** | +| Does everything happen in-process with no real external collaborators? | Yes → **unit** (auto-applied) | -### The Problem: CUDA EXCLUSIVE_PROCESS Mode +### Unit -When running GPU tests on systems with `EXCLUSIVE_PROCESS` mode (common on HPC clusters), you may encounter "CUDA device busy" errors. This happens because: +**Entirely self-contained** — no services, no I/O, no network. Pure logic: +formatters, parsers, schema validation, config loading, pure helper functions. +Runs in milliseconds on any machine. -1. **Parent Process Context**: The pytest parent process creates a CUDA context when running regular tests -2. **Subprocess Blocking**: Example tests run in subprocesses (via `docs/examples/conftest.py`) -3. **Exclusive Access**: In `EXCLUSIVE_PROCESS` mode, only one process can hold a CUDA context per GPU -4. **Result**: Subprocesses fail with "CUDA device busy" when the parent still holds the context +The `unit` marker is **auto-applied by conftest** to every test that has no +other granularity marker. Never write `@pytest.mark.unit` yourself. -### Solution 1: NVIDIA MPS (Recommended) +```python +# No markers needed — auto-applied as unit +def test_cblock_repr(): + assert str(CBlock(value="hi")) == "hi" +``` -**NVIDIA Multi-Process Service (MPS)** allows multiple processes to share a GPU in `EXCLUSIVE_PROCESS` mode: +### Integration -```bash -# Enable MPS in your job scheduler configuration -# Consult your HPC documentation for specific syntax +**Verifies that your code correctly communicates across a real boundary.** +The boundary may be a third-party SDK/library whose API contract you are +asserting against, multiple internal components wired together, or a +fixture-managed local service. What distinguishes integration from unit is +that at least one real external component — not a mock or stub — is on the +other side of the boundary being tested. + +Add `@pytest.mark.integration` explicitly; no backend marker is needed. + +**Positive indicators:** + +- Uses a real third-party SDK object to *capture and assert* on output — + e.g. `InMemoryMetricReader`, `InMemorySpanExporter`, `LoggingHandler` — + rather than patching the SDK away +- Asserts on the format or content of data as received by an external + component (semantic conventions, attribute names, accumulated values) +- Wires multiple real project components together and mocks only at the + outermost boundary +- Breaking the interface between your code and the external component + would cause the test to fail + +**Negative indicators (likely unit instead):** + +- All external boundaries replaced with `MagicMock`, `patch`, or `AsyncMock` +- Third-party library imported only as a type or helper, not as a real + collaborator being asserted against + +**Tie-breaker:** if you changed the contract between your code and the external +component, would this test catch it? If yes → integration. If no → unit. + +```python +@pytest.mark.integration +def test_token_metrics_format(clean_metrics_env): + # Real InMemoryMetricReader — asserting against the OTel SDK contract + reader = InMemoryMetricReader() + provider = MeterProvider(metric_readers=[reader]) + record_token_usage_metrics(input_tokens=10, output_tokens=5, ...) + metrics_data = reader.get_metrics_data() + assert metrics_data.resource_metrics[0]...name == "mellea.llm.tokens.input" + +@pytest.mark.integration +def test_session_chains_components(mock_backend): + # Multiple real project components wired together; only LLM call mocked + session = start_session(backend=mock_backend) + result = session.instruct("hello") + assert mock_backend.generate.called ``` -### Solution 2: Run Smaller Test Subsets +### E2E -If MPS is unavailable, break down test execution into smaller subsets to avoid GPU sharing conflicts: +**Tests against real backends** — cloud APIs, local servers (Ollama), or +GPU-loaded models (HuggingFace, vLLM). No mocks on the critical path. -```bash -# Run tests and examples separately -pytest -m huggingface test/ -pytest -m huggingface docs/examples/ +Add `@pytest.mark.e2e` explicitly, always combined with backend marker(s). +Assertions must be **deterministic** — structural, type-based, or functional. +Assertions on generated text content belong in qualitative tests, not e2e. + +```python +pytestmark = [pytest.mark.e2e, pytest.mark.ollama] -# Or run specific test directories -pytest test/backends/test_huggingface.py -pytest docs/examples/safety/ +def test_structured_output_returns_valid_json(session): + result = session.format(Person, "Make up a person") + assert isinstance(json.loads(result.value), dict) ``` -**Note**: If conflicts persist, continue breaking down into smaller subsets until tests pass. The key is reducing the number of concurrent GPU-using processes. +### Qualitative -### Why This Matters +**A sub-tier of e2e**: same infrastructure requirements, but assertions check +**non-deterministic output content** that may vary across model versions or runs. -The test infrastructure runs examples in subprocesses (see `docs/examples/conftest.py`) to: -- Isolate example execution environments -- Capture stdout/stderr cleanly -- Prevent cross-contamination between examples +Add `@pytest.mark.qualitative` per-function (not at module level). The module +still needs `e2e` and the backend marker. Qualitative tests are included in the +default local run but skipped in CI (`CICD=1`). -However, this creates the "Parent Trap": the parent pytest process holds a CUDA context from running regular tests, blocking subprocesses from accessing the GPU. +```python +pytestmark = [pytest.mark.e2e, pytest.mark.ollama] -### Technical Details +@pytest.mark.qualitative +def test_greeting_contains_salutation(session): + result = session.instruct("Write a greeting") + assert "hello" in result.value.lower() # content check — qualitative +``` -**CUDA Context Lifecycle**: -- Created on first CUDA operation (e.g., `torch.cuda.is_available()`) -- Persists until process exit or explicit `cudaDeviceReset()` -- In `EXCLUSIVE_PROCESS` mode, blocks other processes from GPU access +**Decision rule:** if swapping the model version could break the assertion +despite the system working correctly, it is `qualitative`. If the assertion +checks structure, types, or functional correctness, it is `e2e`. -**MPS Architecture**: -- Runs as a proxy service between applications and GPU driver -- Multiplexes CUDA contexts from multiple processes onto single GPU -- Transparent to applications - no code changes needed -- Requires explicit enablement via job scheduler flags +### Deprecated: the `llm` marker -**Alternative Approaches Tried** (documented in `GPU_PARENT_TRAP_SOLUTION.md`): -- ❌ `torch.cuda.empty_cache()` - Only affects PyTorch allocator, not driver context -- ❌ `cudaDeviceReset()` in subprocesses - Parent still holds context -- ❌ Inter-example delays - Doesn't release parent context -- ❌ pynvml polling - Can't force parent to release context -- ✅ MPS - Allows GPU sharing without code changes +`llm` is a legacy alias for `e2e`. It remains registered for backwards +compatibility but must not be used in new tests. The conftest auto-apply hook +treats `llm` the same as `e2e`. -## Test Markers +## Backend markers -See [`MARKERS_GUIDE.md`](MARKERS_GUIDE.md) for complete marker documentation. +Backend markers identify which backend a test needs. They enable selective test +runs (`pytest -m ollama`) and drive auto-skip logic. **Only apply to `e2e` and +`qualitative` tests.** -Key markers for GPU testing: -- `@pytest.mark.vllm` - Requires vLLM backend (local, GPU required) -- `@pytest.mark.huggingface` - Requires HuggingFace backend (local, GPU-heavy) -- `@pytest.mark.slow` - Tests taking >5 minutes +| Marker | Backend | Resources | +|--------|---------|-----------| +| `ollama` | Ollama (port 11434) | Local, light (~2–4 GB RAM) | +| `openai` | OpenAI API or any OpenAI-compatible endpoint | API calls (may use Ollama `/v1`) | +| `watsonx` | IBM Watsonx API | API calls, requires credentials | +| `huggingface` | HuggingFace transformers | Local, GPU required (VRAM varies) | +| `vllm` | vLLM | Local, GPU required | +| `litellm` | LiteLLM (wraps other backends) | Depends on underlying backend | +| `bedrock` | AWS Bedrock | API calls, requires credentials | -### Resource gating (predicates) +### OpenAI-via-Ollama pattern -Use predicate functions from `test/predicates.py` for resource gating: +Some tests use the OpenAI client pointed at Ollama's `/v1` endpoint. Mark these +with **both** `openai` and `ollama`, but do **not** add `require_api_key`: ```python -from test.predicates import require_gpu, require_ram +pytestmark = [pytest.mark.e2e, pytest.mark.openai, pytest.mark.ollama] +``` + +## Resource predicates + +Fine-grained resource gating uses predicate decorators from `test/predicates.py`. +They compose with `pytestmark` and produce self-documenting skip reasons: -pytestmark = [pytest.mark.e2e, pytest.mark.huggingface, require_gpu(), require_ram(min_gb=48)] +```python +from test.predicates import require_gpu, require_api_key ``` | Predicate | Use when test needs | -| --------- | ------------------- | +|-----------|---------------------| | `require_gpu()` | Any GPU (CUDA or MPS) | | `require_gpu(min_vram_gb=N)` | GPU with at least N GB VRAM | -| `require_ram(min_gb=N)` | N GB+ system RAM | +| `require_ram(min_gb=N)` | N GB+ system RAM (genuinely RAM-bound tests only) | | `require_api_key("ENV_VAR")` | Specific API credentials | +| `require_package("pkg")` | Optional dependency | +| `require_python((3, 11))` | Minimum Python version | + +**Typical combinations:** + +- `huggingface` → `require_gpu(min_vram_gb=N)` (compute N from model parameters) +- `vllm` → `require_gpu(min_vram_gb=N)` (compute N from model parameters) +- `watsonx` → `require_api_key("WATSONX_API_KEY", "WATSONX_URL", "WATSONX_PROJECT_ID")` +- `openai` → `require_api_key("OPENAI_API_KEY")` only for real OpenAI, not Ollama-compat + +**Other gating markers** (not resource predicates, but still control selection): + +| Marker | Gate | Auto-skip when | +|--------|------|----------------| +| `slow` | Tests taking >1 minute | Excluded by default via `pyproject.toml` `addopts` | +| `qualitative` | Non-deterministic output | Skipped when `CICD=1` | + +**Removed markers:** `requires_gpu`, `requires_heavy_ram`, and +`requires_gpu_isolation` have been removed. Use `require_gpu(min_vram_gb=N)` +from `test.predicates` instead. + +## Auto-detection + +The test suite automatically detects system capabilities and skips tests whose +requirements are not met. No manual configuration needed. + +| Capability | How detected | +|------------|--------------| +| Ollama | Port 11434 check at collection time | +| GPU / VRAM | `torch` + `sysctl hw.memsize` | +| API keys | Environment variable check | + +Run `pytest -rs` to see skip reasons for each skipped test. + +## Common patterns + +```python +# Unit — no markers needed (auto-applied by conftest) +def test_cblock_repr(): + assert str(CBlock(value="hi")) == "hi" + +# Integration — mocked backend, real project components +@pytest.mark.integration +def test_session_with_mock(mock_backend): + session = start_session(backend=mock_backend) + result = session.instruct("hello") + assert mock_backend.generate.called + +# E2E — real Ollama backend, deterministic assertion +pytestmark = [pytest.mark.e2e, pytest.mark.ollama] + +def test_structured_output(session): + result = session.format(Person, "Make up a person") + assert isinstance(json.loads(result.value), dict) + +# Qualitative — real backend, non-deterministic content check +pytestmark = [pytest.mark.e2e, pytest.mark.ollama] + +@pytest.mark.qualitative +def test_greeting_content(session): + result = session.instruct("Write a greeting") + assert "hello" in result.value.lower() + +# Heavy GPU e2e — resource predicate for gating +from test.predicates import require_gpu + +pytestmark = [pytest.mark.e2e, pytest.mark.huggingface, require_gpu(min_vram_gb=20)] +``` + +## Authoring guide + +### Naming and structure + +- File: `test_.py` in a directory mirroring the source (e.g. + `test/backends/test_ollama.py` for `mellea/backends/ollama.py`). +- Files must be named `test_*.py` so that pydocstyle ignores them. +- Function: `test___`, written so the name reads + as a sentence. +- One behavioural claim per test. If a test has `and` in the name, split it. + +### Fixture discipline + +The global `test/conftest.py` provides: + +| Fixture | Scope | Use | +|---------|-------|-----| +| `gh_run` | session | Returns `1` when `CICD=1` is set; use for CI-conditional behaviour | +| `system_capabilities` | session | Detected hardware/service capabilities (GPU, Ollama, API keys) | + +Backend-specific fixtures (e.g. a pre-configured `session` against +`granite4:micro`, or a `mock_backend` for unit/integration tests) are defined +per test module or per-directory conftest — check the test files closest to +what you're adding before creating new fixtures. + +Rules: + +- Reuse existing fixtures before creating new ones. +- Do not create session-scoped fixtures that depend on real backends — they + prevent test isolation and make skip logic unreliable. +- For Ollama-backed tests, the conftest evicts models between test modules + automatically. Do not add `keep_alive` management in individual tests. + +### Mock discipline + +- Do not mock what you can replace with a real test double. +- Do not mock internal project components unless the test is explicitly testing + the boundary *around* that component. +- When you must mock a backend for a unit or integration test, mock at the + backend's public method boundary (`generate_from_chat_context`, + `generate_from_raw`), not by patching internal Mellea classes. + +### Assertions + +- Assert one observable outcome per test. +- Prefer specific assertions (`isinstance(result.value, str)`) over broad ones + (`result is not None`). +- Do not assert on `repr()` strings — they break on whitespace changes. + +### Slow tests -> **Deprecated:** The markers `requires_gpu`, `requires_heavy_ram`, `requires_api_key`, -> and `requires_gpu_isolation` are deprecated. Existing tests using them still work -> (conftest auto-skip handles them) but new tests must use predicates. Migrate legacy -> markers to predicates when touching those files. `require_gpu_isolation()` has been -> removed — use `--group-by-backend` for backend grouping instead. +Mark any test taking more than one minute with `@pytest.mark.slow`. Slow tests +are excluded from the default `pytest` invocation and from CI. Run them +explicitly with `pytest -m slow`. + +## Running tests + +### Environment variables + +| Variable | Effect | +|----------|--------| +| `CICD=1` | Skips qualitative tests (mirrors CI behaviour) | +| `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` | Helps with GPU memory fragmentation | + +### Common command lines + +```bash +# Fast loop — unit + integration + e2e, no qualitative (~2 min) +uv run pytest -m "not qualitative" + +# Default — all tiers including qualitative, skips slow +uv run pytest + +# Slow tests only +uv run pytest -m slow + +# Single backend +uv run pytest -m ollama +uv run pytest -m "e2e and ollama and not qualitative" + +# Specific file or test +uv run pytest test/backends/test_ollama.py +uv run pytest test/backends/test_ollama.py::test_structured_output_returns_valid_json + +# See why tests were skipped +uv run pytest -rs + +# CI mode locally (mirrors what PR CI does) +CICD=1 uv run pytest test + +# Nightly-style local run on a GPU host +./test/scripts/run_tests_with_ollama_and_vllm.sh --group-by-backend -v -s +``` + +### Scoping a test run + +A pytest run can be scoped along four independent axes; combine them as needed. + +| Axis | Flag / form | Examples | +|------|-------------|---------| +| **By tier** | `-m ` | `-m unit`, `-m integration`, `-m e2e`, `-m qualitative`, `-m slow` | +| **By backend** | `-m ` | `-m ollama`, `-m huggingface`, `-m "openai or watsonx"` | +| **By compound expression** | `-m ""` | `-m "e2e and ollama and not qualitative"` | +| **By path / node id** | positional | `pytest test/backends/test_ollama.py`, `pytest test/foo.py::test_bar` | + +The `addopts` in `pyproject.toml` adds `-m "not slow"` to every invocation, so +slow tests are always excluded unless you pass `-m slow` yourself. Qualitative +tests run by default locally and are skipped only when `CICD=1` is set. + +### Auto-skip behaviour + +Tests skip automatically when requirements are not met: + +- **Ollama tests** skip at *collection time* if port 11434 is not reachable, + preventing fixture setup errors before the skip decision. +- **GPU/HuggingFace/vLLM tests** skip if no GPU is detected or VRAM is below + the test's requirement. +- **Cloud API tests** skip if required environment variables are unset. + +## CI pipeline + +| Tier | Trigger | Where | What runs | +|------|---------|-------|-----------| +| **Pre-commit** | Every commit (local) | Local hook | ruff, mypy, uv-lock, codespell, markdownlint | +| **PR CI** | Every push / merge group | GitHub Actions, Ubuntu | `pytest test/` on Python 3.11/3.12/3.13 with Ollama. `CICD=1` (qualitative skipped). `slow` excluded. | +| **Nightly** | Scheduled | IBM internal LSF cluster (GPU) | Full `pytest test/ --group-by-backend`, Ollama + vLLM, qualitative enabled. Failures file an auto-issue. | +| **On-demand nightly** | Not yet available | IBM internal LSF cluster | Comment-triggered nightly against a PR branch. Tracked in [#734](https://github.com/generative-computing/mellea/issues/734); ask a maintainer if you need pre-merge GPU validation today. | + +**PR CI** (`ci.yml` → `quality.yml`): pre-commit checks, then Ollama installed +and `granite4:micro` + `granite4:micro-h` pulled, then `uv run -m pytest -v +--junit-xml=... test`. `docs/examples/` is not collected in PR CI. + +**Nightly** (`test/scripts/run_tests_with_ollama_and_vllm.sh`): starts local +Ollama and (when GPU present) a local vLLM server, then runs +`pytest test/ --group-by-backend`. The `--group-by-backend` flag reorders tests +to run each backend as a contiguous group, reducing GPU memory fragmentation. ## Coverage -Coverage reports are generated in `htmlcov/` and `coverage.json`. +Branch coverage is enabled and runs automatically with every test invocation. +Reports are written to `htmlcov/` and `coverage.json`. + +```bash +uv run pytest +open htmlcov/index.html # macOS +xdg-open htmlcov/index.html # Linux +``` + +Coverage is measured over `mellea/` and `cli/`. Test files and `docs/` are +excluded. There is no enforced minimum threshold; use coverage locally to +identify untested paths. Uploading artifacts and trend reporting is an open gap +([#737](https://github.com/generative-computing/mellea/issues/737)). + +## Examples as tests + +Files in `docs/examples/` are not auto-collected. A file is only executed by +pytest if it has an opt-in comment near the top: + +```python +# pytest: e2e, ollama, qualitative +"""Greeting example — demonstrates session.instruct().""" +``` + +The comment lists comma-separated marker names (not `-m` expression syntax — +no `and`/`or`/`not`). Files without this comment are silently ignored and do +not appear in skip summaries or collection output. + +The same classification rules and marker conventions apply as for `test/` +files. Only add the `# pytest:` comment when the example has the necessary +dependencies documented and should be part of the regression suite. + +Parser: `docs/examples/conftest.py` (`_extract_markers_from_file`). + +## Ollama model eviction + +When pytest orchestrates many Ollama-backed tests in sequence, the default 5-minute +keep-alive means models from earlier tests stay resident and accumulate, eventually +starving later tests of memory. + +Two mechanisms in `test/conftest.py` handle this: + +- **Per-module eviction** (`pytest_runtest_teardown`) — when crossing a file + boundary between Ollama-marked tests, queries `/api/ps` for all loaded models + and evicts them with `keep_alive=0`. Covers both `test/` and `docs/examples/`. + Always active, no flags required. +- **Group warm-up/eviction** (`pytest_runtest_setup`) — warms up a fixed set of CI + models (`keep_alive=-1`) when entering the Ollama backend group and evicts them + when leaving. Requires `--group-by-backend`. + +**Trade-off:** if two consecutive test files use the same model, it will be unloaded +and reloaded (~5–15 s overhead). Predictable memory behaviour is more important +than saving a reload, especially on constrained CI runners. Tests within a single +file share the loaded model with no overhead. + +**Caveat:** eviction targets *all* loaded Ollama models, not just those loaded by +the test. If you are using Ollama interactively while the suite runs, your model +will be evicted between test modules. + +## GPU testing on CUDA systems + +### The problem: CUDA EXCLUSIVE_PROCESS mode + +When running GPU tests on systems with `EXCLUSIVE_PROCESS` mode (common on HPC +clusters), you may encounter "CUDA device busy" errors. This happens because: + +1. The pytest parent process creates a CUDA context when running regular tests. +2. Example tests run in subprocesses (via `docs/examples/conftest.py`). +3. In `EXCLUSIVE_PROCESS` mode, only one process can hold a CUDA context per GPU. +4. Subprocesses fail with "CUDA device busy" when the parent still holds the context. + +### Solution 1: NVIDIA MPS (recommended) + +**NVIDIA Multi-Process Service (MPS)** allows multiple processes to share a GPU +in `EXCLUSIVE_PROCESS` mode. Enable it via your job scheduler configuration; +consult your HPC documentation for specific syntax. + +### Solution 2: run smaller test subsets + +If MPS is unavailable, run `test/` and `docs/examples/` in separate invocations: + +```bash +pytest -m huggingface test/ +pytest -m huggingface docs/examples/ +``` + +If conflicts persist, continue breaking down into smaller subsets. + +### Why this matters + +The test infrastructure runs examples in subprocesses to isolate execution and +capture stdout/stderr cleanly, but this creates the "parent trap": the parent +pytest process holds a CUDA context from running regular tests, blocking +subprocesses from accessing the GPU. + +**Approaches that do not work:** `torch.cuda.empty_cache()` (only affects the +PyTorch allocator, not the driver context), `cudaDeviceReset()` in subprocesses +(parent still holds the context), inter-example delays.