Skip to content

Bump version to 0.1.6#1914

Merged
afourney merged 1 commit into
mainfrom
bump_version
May 26, 2026
Merged

Bump version to 0.1.6#1914
afourney merged 1 commit into
mainfrom
bump_version

Conversation

@afourney
Copy link
Copy Markdown
Member

No description provided.

@afourney afourney merged commit e144e0a into main May 26, 2026
6 checks passed
@afourney afourney deleted the bump_version branch May 26, 2026 22:40
Ahoo-Wang added a commit to Ahoo-Wang/markitdown that referenced this pull request Jun 1, 2026
* Resolved an issue with linked images in docx [mammoth] (microsoft#1405)

* Fixed documentation typos in _base_converter.py (microsoft#1393)

* Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399)

* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------

* Bump actions/checkout from 4 to 5 (microsoft#1394)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.

* Add HTML support to DocumentIntelligenceConverter (microsoft#1352)

* fix: correctly pass custom llm prompt parameter (microsoft#1319)

* fix: correctly pass custom llm prompt parameter

* Update README.md (microsoft#1335)

Fix typo in README.md

* Update README.md (microsoft#1350)

ISSUE microsoft#1339

* Update README.md (microsoft#1191)

Fix: Subtle spelling mistake fixed.

* Adding support for data-src Attribute (microsoft#1226)

* supportfordata-src

* docs: correct minor typos (microsoft#1173)

* fix docx parse error(\n in alt) (microsoft#1163)

* Handle PPTX shapes where position is None (microsoft#1161)

* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front

* feat: add checkbox support to Markdown converter (microsoft#1208)

This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>

* Test if mammoth resolves rlinks. (microsoft#1451)

* Upgrade mammoth to 1.11.0 (microsoft#1452)

* Bump versions of mammoth and pdfminer.six (microsoft#1492)

* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.

* [MS] Update PDF table extraction to support aligned Markdown (microsoft#1499)

* Added PDF table extraction feature with aligned Markdown (microsoft#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>

* Fix: PDF parsing doesn't support partially numbered lists (microsoft#1525)

* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests

* [MS] Extend table support for wide tables (microsoft#1552)

* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests

* Add text/markdown to Accept header (microsoft#1554)

* Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551)

* Bump version for release. (microsoft#1564)

* [MS] Add OCR layer service for embedded images and PDF scans (microsoft#1541)

* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files

* Fix O(n) memory growth in PDF conversion by calling page.close() afte… (microsoft#1612)

* Fix O(n) memory growth in PDF conversion by calling page.close() after each page

* Refactor PDF memory optimization tests for improved readability and consistency

* Add memory benchmarking tests for PDF conversion with page.close() fix

* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code

* Bump version to 0.1.6b2 in __about__.py

* Update PDF conversion tests to include mimetype in StreamInfo

* Updated warning about binding to non-local interfaces. (microsoft#1653)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1644)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>

* Clarify security posture in READMEs (microsoft#1807)

* feat: Add Azure Content Understanding converter (microsoft#1865)

* inital version

* improve mime type detection

* prebuilt-image custom analzyer route to image

* enhance cu priority over di

* fix: apply black formatting

* update cache of known prebuilt name and README improvement

* add test cases, run black

* update readme and deriving content_type from the resolved file_type

* update readme

* Bump version to 0.1.6 (microsoft#1914)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: JonahDelman <jonah.delman@gmail.com>
Co-authored-by: t3tra <admin@t3tra.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: safen0s <99965118+safen0s@users.noreply.github.com>
Co-authored-by: Stefan Rink <stefan-rink@users.noreply.github.com>
Co-authored-by: [W]DOS_ <40659600+W-DOS0@users.noreply.github.com>
Co-authored-by: Utkarsh kumar <m83610278@gmail.com>
Co-authored-by: Ebrahim Tayabali <47640402+ebrahimHakimuddin@users.noreply.github.com>
Co-authored-by: Noah Zhu <118643158+Noah-Zhuhaotian@users.noreply.github.com>
Co-authored-by: Dmitry <98899785+mdqst@users.noreply.github.com>
Co-authored-by: Yuzhong Zhang <141388234+BetterAndBetterII@users.noreply.github.com>
Co-authored-by: Richard Ye <33409792+richardye101@users.noreply.github.com>
Co-authored-by: Meirna <61427701+Meirna-kamal@users.noreply.github.com>
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
Co-authored-by: lesyk <lesyk@users.noreply.github.com>
Co-authored-by: Bas Nijholt <basnijholt@gmail.com>
Co-authored-by: jigangz <115519042+jigangz@users.noreply.github.com>
Co-authored-by: jigangz <jigangz@github.com>
Co-authored-by: Chien Yuan Chang <ds.chienyuanchang@gmail.com>
Ahoo-Wang added a commit to Ahoo-Wang/markitdown that referenced this pull request Jun 1, 2026
* Resolved an issue with linked images in docx [mammoth] (microsoft#1405)

* Fixed documentation typos in _base_converter.py (microsoft#1393)

* Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399)

* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------

* Bump actions/checkout from 4 to 5 (microsoft#1394)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.

* Add HTML support to DocumentIntelligenceConverter (microsoft#1352)

* fix: correctly pass custom llm prompt parameter (microsoft#1319)

* fix: correctly pass custom llm prompt parameter

* Update README.md (microsoft#1335)

Fix typo in README.md

* Update README.md (microsoft#1350)

ISSUE microsoft#1339

* Update README.md (microsoft#1191)

Fix: Subtle spelling mistake fixed.

* Adding support for data-src Attribute (microsoft#1226)

* supportfordata-src

* docs: correct minor typos (microsoft#1173)

* fix docx parse error(\n in alt) (microsoft#1163)

* Handle PPTX shapes where position is None (microsoft#1161)

* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front

* feat: add checkbox support to Markdown converter (microsoft#1208)

This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>

* Test if mammoth resolves rlinks. (microsoft#1451)

* Upgrade mammoth to 1.11.0 (microsoft#1452)

* Bump versions of mammoth and pdfminer.six (microsoft#1492)

* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.

* [MS] Update PDF table extraction to support aligned Markdown (microsoft#1499)

* Added PDF table extraction feature with aligned Markdown (microsoft#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>

* Fix: PDF parsing doesn't support partially numbered lists (microsoft#1525)

* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests

* [MS] Extend table support for wide tables (microsoft#1552)

* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests

* Add text/markdown to Accept header (microsoft#1554)

* Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551)

* Bump version for release. (microsoft#1564)

* [MS] Add OCR layer service for embedded images and PDF scans (microsoft#1541)

* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files

* Fix O(n) memory growth in PDF conversion by calling page.close() afte… (microsoft#1612)

* Fix O(n) memory growth in PDF conversion by calling page.close() after each page

* Refactor PDF memory optimization tests for improved readability and consistency

* Add memory benchmarking tests for PDF conversion with page.close() fix

* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code

* Bump version to 0.1.6b2 in __about__.py

* Update PDF conversion tests to include mimetype in StreamInfo

* Updated warning about binding to non-local interfaces. (microsoft#1653)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1644)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>

* Clarify security posture in READMEs (microsoft#1807)

* feat: Add Azure Content Understanding converter (microsoft#1865)

* inital version

* improve mime type detection

* prebuilt-image custom analzyer route to image

* enhance cu priority over di

* fix: apply black formatting

* update cache of known prebuilt name and README improvement

* add test cases, run black

* update readme and deriving content_type from the resolved file_type

* update readme

* Bump version to 0.1.6 (microsoft#1914)

* chore(api): remove package metadata

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: JonahDelman <jonah.delman@gmail.com>
Co-authored-by: t3tra <admin@t3tra.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: safen0s <99965118+safen0s@users.noreply.github.com>
Co-authored-by: Stefan Rink <stefan-rink@users.noreply.github.com>
Co-authored-by: [W]DOS_ <40659600+W-DOS0@users.noreply.github.com>
Co-authored-by: Utkarsh kumar <m83610278@gmail.com>
Co-authored-by: Ebrahim Tayabali <47640402+ebrahimHakimuddin@users.noreply.github.com>
Co-authored-by: Noah Zhu <118643158+Noah-Zhuhaotian@users.noreply.github.com>
Co-authored-by: Dmitry <98899785+mdqst@users.noreply.github.com>
Co-authored-by: Yuzhong Zhang <141388234+BetterAndBetterII@users.noreply.github.com>
Co-authored-by: Richard Ye <33409792+richardye101@users.noreply.github.com>
Co-authored-by: Meirna <61427701+Meirna-kamal@users.noreply.github.com>
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
Co-authored-by: lesyk <lesyk@users.noreply.github.com>
Co-authored-by: Bas Nijholt <basnijholt@gmail.com>
Co-authored-by: jigangz <115519042+jigangz@users.noreply.github.com>
Co-authored-by: jigangz <jigangz@github.com>
Co-authored-by: Chien Yuan Chang <ds.chienyuanchang@gmail.com>
Copy link
Copy Markdown

@koyulin24747-arch koyulin24747-arch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your response and keep my been stolen MIT lincence back to the mean person did stolen from me and use it a bad way to use it

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to Remove the MIT lincence back to read.me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants