docs: add test strategy document#1167
Conversation
Adds docs/docs/community/testing-strategy.md — the contributor guide covering philosophy, classification rules, authoring guide, CI tier map, local workflow, examples-as-tests opt-in, and backend/resource gating. Closes the pre-existing broken link from unit-test-generative-code to community/contributing-guide#testing. MARKERS_GUIDE.md is kept as the detailed marker reference and cross-linked from the new page. Updates: docs.json (add page to Community group), AGENTS.md §3 and §10 (pointer to strategy doc), CONTRIBUTING.md (Test Markers section), test/README.md (remove duplicated predicates table), test/MARKERS_GUIDE.md (add "See also" header). Closes generative-computing#853 Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Rewrites testing-strategy.md following a cold read as an outsider: - Adds backend intro table so 'why ollama vs huggingface matters for testing' is explained upfront rather than assumed knowledge - Expands tier definitions (unit/integration/e2e/qualitative) with positive/negative indicators and explicit decision rules; qualitative is now properly framed as a sub-tier of e2e - Adds fixture reference table (session, mock_backend, gh_run, etc.) - Adds coverage section (htmlcov/, branch coverage, no enforced threshold) - Fixes CI accuracy: PR CI runs 'pytest test' not 'pytest' — docs/examples are not collected in CI - Fixes '# pytest:' comment description — takes marker names, not -m expressions - Removes predicate table and auto-detection table (defer to MARKERS_GUIDE to avoid maintaining two identical tables); keeps the usage code example - Clarifies qualitative tests run locally by default, skipped in CI only CONTRIBUTING.md: trims the Testing Quick Reference to essential setup commands + pointer to strategy doc; fixes CI command (test was missing the 'test' path argument); removes duplicate tier/marker/workflow content. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Comprehensive revision of testing-strategy.md following a cold-read review: Accuracy fixes: - Remove fabricated fixture table (session/mock_backend/clean_metrics_env/ plugins are per-file, not global conftest; only gh_run and system_capabilities are confirmed global fixtures) - Fix coverage claim: reports run in CI but are discarded — issue generative-computing#737 tracks surfacing them; CI job summary is JUnit counts only, not coverage - Fix 'open htmlcov/' to show both macOS and Linux commands Clarity fixes: - Rename test_<unit>_... naming convention to test_<subject>_... (unit collided with the tier name) - Integration positive indicators were exclusively OTel; added component- wiring example so contributors outside telemetry can recognise their tests - Qualitative code example no longer repeats the e2e test from the previous section - Philosophy: clarify all tiers run locally by default, not just qualitative - Add note about generative-computing#729 e2e→integration migration for existing tests Structural (progressive): - Move 'About backends' inside Classification where it's needed - Move 'Local dev workflow' before CI pipeline (contributors need commands before they need to understand automation) - Move Coverage after CI (it is CI output) - Reorder so readers can stop earlier: Philosophy → Classification → Authoring → Local dev → CI → Coverage → Advanced Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Final review fixes: - Fix backend method names in mock discipline: 'generate'/'astream' do not exist on backends; actual public methods are 'generate_from_chat_context' and 'generate_from_raw' - Fix qualitative code example: 'result' was undefined (left over from trimming the duplicated e2e test). Now shows a single self-contained qualitative example with an explanatory comment about why the assertion is non-deterministic - LiteLLM described as 'unified Python client' rather than 'proxy' (the proxy form exists but is not how Mellea uses LiteLLM) - Reword 'SDK boundary assertions' (jargon) to 'assertions on what an external SDK actually emitted' in the per-tier table - Remove redundant 'Auto-applied unit marker' subsection — the same fact is stated in Philosophy, Classification > Unit, and Authoring > Markers Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Reduce em-dash count from 38 to 24 by substituting commas, colons, semicolons, full stops, and parentheses where they read as naturally or better. Em dashes remain where they're genuinely the right call (sharp asides, table-cell separators). Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The CI table previously marked nightly and pre-release as 'planned'. Nightlies actually run today on the Bluevela LSF cluster, orchestrated by an external nightly.py driver that invokes test/scripts/run_tests_with_ollama_and_vllm.sh --group-by-backend with Ollama and vLLM. Failures auto-file an issue (see generative-computing#985 for the format). Updates: - CI table now distinguishes Pre-commit / PR CI (GitHub Actions) / Nightly (Bluevela, scheduled, exists) / On-demand nightly (generative-computing#734, planned) / Pre-release (planned) / Local dev - Adds 'Nightly in detail' section explaining the script, models, and --group-by-backend memory-fragmentation rationale - Adds 'Scoping a test run' subsection with a four-axis table (tier / backend / compound expression / path) and a runtime resource-gating note, so contributors can see how to slice runs without reading every later section - Common command lines now includes the local nightly-style invocation Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The doc explained concepts well but a developer wanting to actually do something had to synthesise from four different sections. Adds a short action-oriented section near the top covering: - First-time setup (pointer to contributing-guide + the two essential Ollama commands) - Running tests during development (three commands) - Adding a new test (5-step procedure with links into Classification, Markers, and the rest of the doc for detail) - Before opening a PR (the two commands CI runs, plus a pointer that GPU-backed paths only run in nightly so changes there should be validated via maintainer before merge) Progressive structure preserved: this is a quick-reference top section, the deeper concept sections below it remain authoritative. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Three follow-ups uncovered while validating the doc: Stale issue references: - generative-computing#734 (on-demand nightlies) and generative-computing#737 (coverage reporting) last saw activity 2026-03-24, over two months ago. Saying 'tracked in generative-computing#734' implies active work. Reword to 'open as #N but not currently being worked on' so contributors aren't misled into expecting progress. - Pre-release tier had no specific tracker; just point at parent epic generative-computing#726. Cross-context links: - 'https://generative-computing.github.io/mellea/community/testing-strategy' was used in test/MARKERS_GUIDE.md, test/README.md, and CONTRIBUTING.md but the URL returns 404 — the canonical Mintlify domain isn't actually configured there. Replace with relative path to the source markdown, matching the pattern already used in AGENTS.md. Source-tree files read on GitHub get a working link in PR review and after merge. - CONTRIBUTING.md anchor '#ci-pipeline-tiers' was stale (renamed to 'CI pipeline' / '#ci-pipeline'). Bluevela rename: - Bluevela is an internal IBM environment, not mentioned anywhere else in the public repo. Replace with neutral 'IBM internal LSF cluster' so the published docs don't expose internal infra naming. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The audit-markers skill listed test/MARKERS_GUIDE.md as the authoritative source for marker conventions. Now that the test strategy doc owns classification decision rules and per-tier definitions, add it to the skill's Project References (above MARKERS_GUIDE.md) so the agent reads it for the 'is this unit / integration / e2e / qualitative?' question, then drops to MARKERS_GUIDE.md for marker mechanics. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
jakelorocco
left a comment
There was a problem hiding this comment.
I think the changes look good / the documentation is accurate.
I don't quite know how I feel about this document being something that is hosted with the Mellea documentation but will default to others on that decision. Typically when I contribute to a repository, I expect the documentation for testing, etc... to be a regular *.md file with instructions, not a community doc.
Maybe we should start linking to some of these pages from our PR template? That way users don't have to go out of their way to figure out how to get their PRs in good shape?
There was a problem hiding this comment.
Should we remove the testing section from https://docs.mellea.ai/community/contributing-guide#testing as well then? This info seems to duplicate most of that.
Summary
Closes #853 (parent epic #726).
Why
We have
test/MARKERS_GUIDE.mdas a marker reference andtest/README.mdfor operational notes, but a new contributor landing on the repo has no single place that explains the strategy: what each tier means, how to decide between them, what runs in CI, how coverage works, where qualitative tests sit. They have to piece it together frompyproject.toml, the workflow YAML, three docs, and the conftest.This PR adds
docs/docs/community/testing-strategy.mdas that missing top-level document. It is structured to be progressive: a short "What you need to do" action section first (first-time setup, the daily commands, a 5-step procedure for adding a test, the pre-PR checklist), then the philosophy and decision rules, per-tier definitions, authoring guidance, local workflow with explicit scoping axes, CI tiers, coverage, and finally advanced reference material on backend gating. A reader can stop wherever they have what they need rather than read the whole thing.While writing it I also closed the broken anchor link from
how-to/unit-test-generative-code.md, which has pointed at a non-existentcommunity/contributing-guide#testingfor some time. That link now resolves.What's in scope
docs/docs/community/testing-strategy.md. Sections: What you need to do, Philosophy, Classification (with backend overview, per-tier definitions, deprecatedllmmarker), Authoring guide, Local dev workflow (with scoping-axes table), CI pipeline (Pre-commit / PR CI / Nightly / planned tiers), Coverage, Examples as tests, Backend & resource gating.docs.jsonupdated to register the new page in the Community group.how-to/unit-test-generative-code.mdretargeted at the new page.AGENTS.md§3 and §10,CONTRIBUTING.mdTesting section,test/MARKERS_GUIDE.mdandtest/README.mdall cross-link to the new page rather than duplicate strategy material.MARKERS_GUIDE.mdstays the marker reference.generate_from_chat_context/generate_from_raw, not the previously-statedgenerate/astream); fixedCONTRIBUTING.mdCI replication command (wasCICD=1 uv run pytestbut PR CI runspytest test).CI tier accuracy
This was the hardest part to get right. The doc now reflects what actually runs:
quality.yml).nightly.pydriver that invokestest/scripts/run_tests_with_ollama_and_vllm.sh --group-by-backend. Failures auto-file a dated issue (e.g. Failing nightlies in2026-04-30-0f95f84#985 was the most recent).What's deliberately not here
MARKERS_GUIDE.mdrewrite or consolidation — it stays as the marker reference. Strategy doc covers principles; marker guide covers the lookup table.CONTRIBUTING.mdwith Mintlifycommunity/contributing-guide.md— out of scope; separate cleanup.Where this fits in Epic #726
This PR closes work item #853 (test strategy document) under the parent epic. The doc is honest about what's implemented today versus what's planned, and points at the relevant child issues only where they're meaningfully tracking active work.
Testing
uv run pre-commit run markdownlint --files docs/docs/community/testing-strategy.md)community/contributing-guide#testinganchor and stale#ci-pipeline-tiersanchor both fixedmellea/backends/ollama.py; conftest behaviour and capability caching checked againsttest/conftest.py; CI behaviour checked against.github/workflows/quality.yml;pytest.ini_optionsclaims checked againstpyproject.toml; nightly script confirmed intest/scripts/run_tests_with_ollama_and_vllm.shAssisted-by: Claude Code