Skip to content

feat: expand test corpus from 2 to 7 formats with 1,794 documents#147

Merged
StephanMeijer merged 14 commits into
mainfrom
feature/new-formats
Apr 1, 2026
Merged

feat: expand test corpus from 2 to 7 formats with 1,794 documents#147
StephanMeijer merged 14 commits into
mainfrom
feature/new-formats

Conversation

@StephanMeijer
Copy link
Copy Markdown
Contributor

Summary

Expand the DocSpec test corpus from 2 formats (DOCX, ODT) and ~256 documents to 7 formats and 1,794 documents — a 7× increase in coverage.

New formats added

Format Files Sources License
Markdown 159 CommonMark spec, Rust Book, Docusaurus CC-BY-SA-4.0, MIT, CC-BY-4.0
HTML 113 MDN Learning Area, W3C WPT, HTML5 Boilerplate CC0-1.0, BSD-3-Clause, MIT
EPUB 101 Standard Ebooks CC0-1.0
Legacy .doc 155 Apache POI HWPF test suite Apache-2.0
RTF 118 LibreOffice Writer QA, Pandoc MPL-2.0, GPL-2.0-or-later

Expanded existing format

Format Before After New Sources
ODT 52 944 LibreOffice ODF QA (MPL-2.0), ODF Toolkit (Apache-2.0)

What's included

  • Scraper infrastructure (scrape/) — Standalone Python scripts per format with shared utilities (rate limiting, robots.txt compliance, magic-byte validation, attribution tracking)
  • ATTRIBUTION.json — 16 entries with verified SPDX license identifiers, all audited against actual source repository LICENSE files
  • Updated docs — README.md and CONTRIBUTING.md updated with new format examples
  • Updated toolingcheck-duplicates.sh extended for new extensions

All CI checks pass

  • validate-licenses.sh ✅ (16/16 SPDX identifiers valid)
  • check-duplicates.sh ✅ (no duplicates)
  • check-zip-validity.sh ✅ (1,249 ZIP-based files validated)

License audit

All 16 attribution entries verified against actual source repository LICENSE files. 4 initial mismatches were found and corrected before this PR (see commit e89f749).

- 80 HTML examples from MDN Learning Area (CC-BY-SA-2.5)
- 30 HTML test fixtures from W3C Web Platform Tests (W3C-20150513)
- 3 HTML5 Boilerplate templates (MIT)
- 113 total standalone HTML files in documents/html/
- Add ATTRIBUTION.json entries for all 3 sources
- Remove 3 pre-existing duplicate .doc files to fix check-duplicates.sh
Verified all 10 new attribution entries against actual source repositories:
- Rust Book: CC-BY-4.0 -> MIT (dual-licensed MIT/Apache-2.0)
- Docusaurus docs: MIT -> CC-BY-4.0 (per LICENSE-docs)
- MDN Learning Area: CC-BY-SA-2.5 -> CC0-1.0 (per repo LICENSE)
- W3C WPT: W3C-20150513 -> BSD-3-Clause (per LICENSE.md)
- LibreOffice RTF: added triple-license note (MPL-2.0/LGPL-3.0+/GPL-3.0+)
Add 892 ODT test fixtures from two new sources:
- LibreOffice Writer QA: 275 files (MPL-2.0, odfimport/odfexport/globalfilter)
- ODF Toolkit: 617 files (Apache-2.0, DOM/validation/spec compliance tests)

Includes odt_gather.py scraper for reproducibility.
Removed 10 duplicates and 3 intentionally malformed test files.
…idation

unzip handles Unicode ZIP entry paths inconsistently across distros,
causing false negatives on CI (ubuntu-latest) for valid ODT files like
unicode-path.odt. Python's zipfile module is consistent everywhere.
@StephanMeijer StephanMeijer merged commit 8064110 into main Apr 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant