Skip to content

feat: add PDF test corpus (1,905 files, 10 source groups)#148

Merged
StephanMeijer merged 11 commits into
mainfrom
feature/pdf-corpus
Apr 3, 2026
Merged

feat: add PDF test corpus (1,905 files, 10 source groups)#148
StephanMeijer merged 11 commits into
mainfrom
feature/pdf-corpus

Conversation

@StephanMeijer
Copy link
Copy Markdown
Contributor

Summary

Adds 1,905 openly-licensed PDFs as a new format to the DocSpec test corpus, across 10 diverse source groups.

Source Groups

Group Files License Coverage
pdf-js 711 Apache-2.0 Annotations, forms, CID fonts, Arabic/CJK, encryption, edge cases
qpdf 649 Apache-2.0 Encryption (AES/RC4), page operations, compression, malformed PDFs
apache-pdfbox 154 Apache-2.0 Forms/AcroForms, text extraction, image handling
pikepdf 35 MPL-2.0 Color spaces, PDF/A compliance
arxiv 65 CC-BY-4.0 Academic papers, equations, multi-column, diverse disciplines
openstax 23 CC-BY-4.0 Textbook excerpts, tables, figures, equations
irs-forms 144 CC0-1.0 Fillable forms, checkboxes, government structured forms
nasa-ntrs 86 CC0-1.0 Technical reports, figures, tables, graphs
gutenberg 10 CC0-1.0 Public domain literature (EN, DE, FR, ES, IT)
internet-archive 28 CC0-1.0 Scanned/OCR books (Arabic, Chinese, French, German, Spanish, Russian, English)

Validation

  • ✅ All 1,905 PDFs start with `%PDF-` magic bytes
  • ✅ No file exceeds 10MB
  • ✅ No duplicate files (`bin/check-duplicates.sh`)
  • ✅ All SPDX license identifiers valid (`bin/validate-licenses.sh`)
  • ✅ ATTRIBUTION.json updated with 10 entries (one per group)
  • ✅ ATTRIBUTION.md regenerated
  • ✅ All PDFs tracked via Git LFS (`.gitattributes` unchanged)
  • ✅ No download scripts committed (`scrape/` is gitignored)

Feature Coverage

text-heavy · tables · forms · fillable-forms · images · equations · multilingual · rtl · cjk · scanned · ocr · annotations · encryption · test-fixtures · multi-column · public-domain · academic · government · educational · technical-reports

@StephanMeijer StephanMeijer merged commit 9ab3cdb into main Apr 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant