feat: expand test corpus from 2 to 7 formats with 1,794 documents by StephanMeijer · Pull Request #147 · docspec/documents

StephanMeijer · 2026-04-01T21:22:50Z

Summary

Expand the DocSpec test corpus from 2 formats (DOCX, ODT) and ~256 documents to 7 formats and 1,794 documents — a 7× increase in coverage.

New formats added

Format	Files	Sources	License
Markdown	159	CommonMark spec, Rust Book, Docusaurus	CC-BY-SA-4.0, MIT, CC-BY-4.0
HTML	113	MDN Learning Area, W3C WPT, HTML5 Boilerplate	CC0-1.0, BSD-3-Clause, MIT
EPUB	101	Standard Ebooks	CC0-1.0
Legacy .doc	155	Apache POI HWPF test suite	Apache-2.0
RTF	118	LibreOffice Writer QA, Pandoc	MPL-2.0, GPL-2.0-or-later

Expanded existing format

Format	Before	After	New Sources
ODT	52	944	LibreOffice ODF QA (MPL-2.0), ODF Toolkit (Apache-2.0)

What's included

Scraper infrastructure (scrape/) — Standalone Python scripts per format with shared utilities (rate limiting, robots.txt compliance, magic-byte validation, attribution tracking)
ATTRIBUTION.json — 16 entries with verified SPDX license identifiers, all audited against actual source repository LICENSE files
Updated docs — README.md and CONTRIBUTING.md updated with new format examples
Updated tooling — check-duplicates.sh extended for new extensions

All CI checks pass

validate-licenses.sh ✅ (16/16 SPDX identifiers valid)
check-duplicates.sh ✅ (no duplicates)
check-zip-validity.sh ✅ (1,249 ZIP-based files validated)

License audit

All 16 attribution entries verified against actual source repository LICENSE files. 4 initial mismatches were found and corrected before this PR (see commit e89f749).

- 80 HTML examples from MDN Learning Area (CC-BY-SA-2.5) - 30 HTML test fixtures from W3C Web Platform Tests (W3C-20150513) - 3 HTML5 Boilerplate templates (MIT) - 113 total standalone HTML files in documents/html/ - Add ATTRIBUTION.json entries for all 3 sources - Remove 3 pre-existing duplicate .doc files to fix check-duplicates.sh

… epub scraper

Verified all 10 new attribution entries against actual source repositories: - Rust Book: CC-BY-4.0 -> MIT (dual-licensed MIT/Apache-2.0) - Docusaurus docs: MIT -> CC-BY-4.0 (per LICENSE-docs) - MDN Learning Area: CC-BY-SA-2.5 -> CC0-1.0 (per repo LICENSE) - W3C WPT: W3C-20150513 -> BSD-3-Clause (per LICENSE.md) - LibreOffice RTF: added triple-license note (MPL-2.0/LGPL-3.0+/GPL-3.0+)

Add 892 ODT test fixtures from two new sources: - LibreOffice Writer QA: 275 files (MPL-2.0, odfimport/odfexport/globalfilter) - ODF Toolkit: 617 files (Apache-2.0, DOM/validation/spec compliance tests) Includes odt_gather.py scraper for reproducibility. Removed 10 duplicates and 3 intentionally malformed test files.

…idation unzip handles Unicode ZIP entry paths inconsistently across distros, causing false negatives on CI (ubuntu-latest) for valid ODT files like unicode-path.odt. Python's zipfile module is consistent everywhere.

StephanMeijer added 14 commits April 1, 2026 17:21

fix(tooling): add .doc, .html, .rtf extensions to check-duplicates.sh

7411029

docs: add new format examples to README and CONTRIBUTING

373ec1a

feat(scrape): add scraper infrastructure with shared utilities

f0c11b7

feat(rtf): add RTF corpus from test suites and archives

419bbd4

feat(doc): add legacy .doc (WW8) corpus from test suites

5963745

feat(epub): add EPUB corpus from Standard Ebooks

bcbeb62

feat(markdown): add Markdown corpus from open-source repositories

4c2472b

chore: regenerate ATTRIBUTION.md from updated JSON

a720f81

fix(scrape): enforce 2s rate limit in html scraper; add main guard to…

9b4f1bc

… epub scraper

chore: remove scrape scripts from repository

31e5f8d

StephanMeijer merged commit 8064110 into main Apr 1, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expand test corpus from 2 to 7 formats with 1,794 documents#147

feat: expand test corpus from 2 to 7 formats with 1,794 documents#147
StephanMeijer merged 14 commits into
mainfrom
feature/new-formats

StephanMeijer commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StephanMeijer commented Apr 1, 2026

Summary

New formats added

Expanded existing format

What's included

All CI checks pass

License audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant