Skip to content

[MS] Add OCR layer service for embedded images and PDF scans#1541

Open
lesyk wants to merge 23 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/inline_image
Open

[MS] Add OCR layer service for embedded images and PDF scans#1541
lesyk wants to merge 23 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/inline_image

Conversation

@lesyk
Copy link
Contributor

@lesyk lesyk commented Jan 26, 2026

This pull request introduces the new markitdown-ocr plugin, which adds LLM Vision-based OCR capabilities to MarkItDown. The plugin enables extraction of text from images embedded in PDF, DOCX, PPTX, and XLSX files using any OpenAI-compatible client, without requiring additional ML libraries or binaries.

#1344

Output for testing files:
---

## docx_complex_layout.docx

```markdown
Complex Document

|  |  |
| --- | --- |
| Feature | Status |
| Authentication | Active |
| Encryption | Enabled |

Security notice:

*[Image OCR]
NOTICE: SSL Certificate Expires 2025-12-31
[End OCR]*

docx_image_end.docx

Report

Main findings of the report.

Details and analysis.

Recommendations.

*[Image OCR]
FOOTER: Document ID: DOC-2024-001
[End OCR]*

docx_image_middle.docx

# Introduction

This is the introduction section.

We will see an image below.

*[Image OCR]
FIGURE 1: System Architecture
[End OCR]*

# Analysis

This section comes after the image.

docx_image_start.docx

Document with Image at Start

*[Image OCR]
HEADER: Company Logo - ACME Corp
[End OCR]*

This is the main content after the header image.

More text content here.

docx_multipage.docx

# Page 1 - Mixed Content

This is the first paragraph on page 1.

BEFORE IMAGE: Important content appears here.

*[Image OCR]
DOCX PAGE 1: Section Title
[End OCR]*

AFTER IMAGE: This content follows the image.

More text on page 1.

# Page 2 - Image at End

Content on page 2.

Multiple paragraphs of text.

Building up to the image...

Final paragraph before image.

*[Image OCR]
DOCX PAGE 2: Footer Note
[End OCR]*

# Page 3 - Image at Start

*[Image OCR]
DOCX PAGE 3: Header Image
[End OCR]*

Content that follows the header image.

AFTER IMAGE: This text is after the image.

docx_multiple_images.docx

Multi-Image Document

First section

*[Image OCR]
Chart 1: Revenue Growth
[End OCR]*

Second section with another image

*[Image OCR]
Chart 2: Customer Satisfaction
[End OCR]*

Conclusion

pdf_complex_layout.pdf

## Page 1

Complex Layout Document

Table:

ItemQuantity

*[Image OCR]
WARNING: Handle with care
[End OCR]*

Widget A5

pdf_image_end.pdf

## Page 1

Main Content

This is the main text content.

The image will appear at the end.

Keep reading...

*[Image OCR]
END: Contact: support@example.com
[End OCR]*

pdf_image_middle.pdf

## Page 1

Section 1: Introduction

This document contains an image in the middle.

Here is some introductory text.

*[Image OCR]
MIDDLE: Product Code: ABC-12345
[End OCR]*

Section 2: Details

This text appears AFTER the image.

pdf_image_start.pdf

## Page 1

*[Image OCR]
START: This is the first image in PDF
[End OCR]*

This is text BEFORE the image.

The image should appear above this text.

This is more content after the image.

pdf_multiple_images.pdf

## Page 1

Document with Multiple Images

*[Image OCR]
Image 1: Serial Number SN-001
[End OCR]*

Text between first and second image.

*[Image OCR]
Image 2: Model Number M-2024
[End OCR]*

Final text after all images.

pdf_scanned_invoice.pdf

## Page 1

*[Image OCR]
# INVOICE

Company: TechCorp Industries

Invoice Number: INV-2024-001

Date: January 15, 2024

BILL TO:

Acme Corporation

123 Main Street

New York, NY 10001

DESCRIPTION:

Software Development Services

Professional Consulting

Technical Support

TOTAL AMOUNT DUE: $5,000.00
[End OCR]*

pdf_scanned_meeting_minutes.pdf

## Page 1

*[Image OCR]
# MEETING MINUTES

Date: March 10, 2024

Attendees: John Smith, Jane Doe, Bob Johnson

AGENDA ITEMS

1. Project Status Update

- Phase 1 completed successfully

- Phase 2 on track for Q2 delivery

2. Budget Review

- Current spend: 75% of allocated budget

- Forecast: Within budget

3. Action Items

- John: Finalize requirements document
[End OCR]*

pdf_scanned_minimal.pdf

## Page 1

*[Image OCR]
# NOTICE

This is a minimal test document

with just a few lines of text.

It should still be processed correctly.
[End OCR]*

pdf_scanned_report.pdf

## Page 1

*[Image OCR]
# TECHNICAL REPORT

# Page 1

EXECUTIVE SUMMARY

This document presents the findings of our

technical analysis conducted in Q1 2024.

Key highlights include:

- System performance improvements

- Security enhancements

- User experience updates

The following pages detail our methodology

and recommendations.
[End OCR]*

## Page 2

*[Image OCR]
# TECHNICAL REPORT

# Page 2

METHODOLOGY

Our analysis involved three phases:

1. Data Collection

    Gathered metrics from production systems

    over a 90-day period.

2. Performance Analysis

    Identified bottlenecks and optimization

    opportunities.

3. Security Review

    Conducted vulnerability assessment and
[End OCR]*

## Page 3

*[Image OCR]
# TECHNICAL REPORT

# Page 3

RECOMMENDATIONS

Based on our findings, we recommend:

1. Implement caching layer to improve

    response times by 40%.

2. Upgrade authentication system to

    support multi-factor authentication.

3. Optimize database queries to reduce

    server load by 30%.

CONCLUSION
[End OCR]*

pdf_scanned_sales_report.pdf

## Page 1

*[Image OCR]
# QUARTERLY SALES REPORT

Q1 2024 Performance Summary

REGIONAL BREAKDOWN

Region        Revenue        Growth
North America  $2.5M         +15%
Europe        $1.8M         +22%
Asia Pacific  $3.2M         +35%
Latin America $0.9M         +12%

TOTAL         $8.4M         +23%

Top performing products:

- Product A: $3.1M

- Product B: $2.7M
[End OCR]*

pptx_complex_layout.pptx

\n\n<!-- Slide number: 1 -->\n# Product Comparison\n\nOur products lead the market\n
*[Image OCR]
Market Share: 35%
[End OCR]*

pptx_image_end.pptx

\n\n<!-- Slide number: 1 -->\n# Presentation\n\n\n\n<!-- Slide number: 2 -->\n# Thank You\n\n
*[Image OCR]
Contact: info@techcorp.com
[End OCR]*

pptx_image_middle.pptx

\n\n<!-- Slide number: 1 -->\n# Introduction\n\n\n\n<!-- Slide number: 2 -->\n# Architecture\n\n
*[Image OCR]
Diagram: System Components
[End OCR]*\n\n<!-- Slide number: 3 -->\n# Conclusion\n\n

pptx_image_start.pptx

\n\n<!-- Slide number: 1 -->\n# Welcome\n\n
*[Image OCR]
Company: TechCorp Inc.
[End OCR]*

pptx_multiple_images.pptx

\n\n<!-- Slide number: 1 -->\n# \n
*[Image OCR]
Before: 50% Efficiency
[End OCR]*

*[Image OCR]
After: 95% Efficiency
[End OCR]*

xlsx_complex_layout.xlsx

## Complex Report

| Annual Report 2024 | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Month | Sales |
| Jan | 1000 |
| Feb | 1200 |
| NaN | NaN |
| Total | 2200 |

### Images in this sheet:

*[Image OCR]
Figure 1: Monthly Trend
[End OCR]*

*[Image OCR]
Figure 2: Year Overview
[End OCR]*

## Customers

| Customer Metrics | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| New Customers | 250 |
| Retention Rate | 92% |

### Images in this sheet:

*[Image OCR]
Customer Growth: +25% Year-over-Year
[End OCR]*

## Regions

| Regional Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Region | Revenue |
| North | $800K |
| South | $600K |

### Images in this sheet:

*[Image OCR]
Regional Map: Top Perform
[End OCR]*

xlsx_image_end.xlsx

## Sheet

| Financial Summary | Unnamed: 1 |
| --- | --- |
| Total Revenue | $500,000 |
| Total Expenses | $300,000 |
| Net Profit | $200,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Signature: | NaN |

### Images in this sheet:

*[Image OCR]
Approved by: John Doe, CFO
[End OCR]*

## Budget

| Budget Allocation | Unnamed: 1 |
| --- | --- |
| Marketing | $100,000 |
| R&D | $150,000 |
| Operations | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Approved: | NaN |

### Images in this sheet:

*[Image OCR]
viewed by: Jane Smith, CTO
[End OCR]*

xlsx_image_middle.xlsx

## Revenue

| Q1 Report | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Revenue | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Profit Margin | 40% |

### Images in this sheet:

*[Image OCR]
Growth Trend: +15%
[End OCR]*

## Expenses

| Expense Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Expenses | $30,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Savings | $5,000 |

### Images in this sheet:

*[Image OCR]
Cost Analysis: Optimized
[End OCR]*

xlsx_image_start.xlsx

## Sales Q1

| Product | Sales |
| --- | --- |
| Widget A | 100 |
| Widget B | 150 |

### Images in this sheet:

*[Image OCR]
Q1 Sales Chart
[End OCR]*

## Forecast Q2

| Projected Sales | Unnamed: 1 |
| --- | --- |
| Widget A | 120 |
| Widget B | 180 |

### Images in this sheet:

*[Image OCR]
Q2 Forecast: +20% Growth
[End OCR]*

xlsx_multiple_images.xlsx

## Overview

| Dashboard |
| --- |
| Status: Active |
| NaN |
| NaN |
| NaN |
| NaN |
| Performance Summary |

### Images in this sheet:

*[Image OCR]
KPI: 95% Success Rate
[End OCR]*

*[Image OCR]
Uptime: 99.9%
[End OCR]*

## Details

| Detailed Metrics |
| --- |
| System Health |

### Images in this sheet:

*[Image OCR]
Metric: Response Time 50ms
[End OCR]*

## Summary

| Quarter Summary |
| --- |
| Overall Performance |

### Images in this sheet:

*[Image OCR]
Q1 Results: Exceeded Goals
[End OCR]*

</details>

lesyk and others added 4 commits January 26, 2026 19:44
- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.
@lesyk lesyk marked this pull request as ready for review January 27, 2026 10:21
@lesyk lesyk changed the title Add OCR test data and implement tests for various document formats Add OCR service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR service for embedded images and PDF scans Add OCR layer service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR layer service for embedded images and PDF scans [MS] Add OCR layer service for embedded images and PDF scans Jan 27, 2026
lesyk and others added 19 commits February 12, 2026 09:55
…nctionality across DOCX, PDF, PPTX, and XLSX converters
- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.
… and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants