- How We Score
- /olmOCR-Bench
olmOCR-Bench
Unit-test style evaluation for PDF linearization quality
The Key Insight
olmOCR-Bench is nota character-level OCR accuracy benchmark like classic OCR evals. It's a benchmark for PDF linearization quality: “Given a whole PDF page, can your system output clean text/markdown that preserves meaning, structure, order, tables, and math—and doesn't hallucinate garbage?”
olmOCR-Bench Leaderboard
Benchmark results compiled from the olmOCR GitHub and the LightOnOCR arXiv paper. Scores are macro-averaged across document categories. Higher is better. Models marked with * use pipeline approaches.
| Model | Size | ArXiv | Old Scans Math | Tables | Old Scans | Headers & Footers | Multi Column | Tiny Text | Base | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| LightOnOCR-2-1B | 1B | 89.6 | 85.6 | 89 | 42.2 | — | 84.8 | 91.4 | 99.6 | 83.2±0.9 |
| Chandra OCR 0.1.0* | — | 82.2 | 80.3 | 88 | 50.4 | 90.8 | 81.2 | 92.3 | 99.9 | 83.1±0.9 |
| Infinity-Parser 7B* | — | 84.4 | 83.8 | 85 | 47.9 | 88.7 | 84.2 | 86.4 | 99.8 | 82.5 |
| olmOCR v0.4.0 | — | 83 | 82.3 | 84.9 | 47.7 | 96.1 | 83.7 | 81.9 | 99.7 | 82.4±1.1 |
| LightOnOCR-2-1B-ocr-soup | 1B | 86.8 | 81.2 | 89 | 45.4 | — | 84.2 | 90.3 | 99.7 | 82.4±0.9 |
| LightOnOCR-2-1B-base | 1B | 84.9 | 80.3 | 86.7 | 47 | — | 84.6 | 89.1 | 99.8 | 81.8±0.9 |
| Chandra-9B | 9B | 82.2 | 80.3 | 88 | 50.4 | — | 81.2 | 92.3 | 99.9 | 81.7±0.9 |
| LightOnOCR-2-1B-bbox-base | 1B | 84.6 | 78.6 | 84.7 | 46 | — | 83.8 | 88 | 99.8 | 80.8±0.9 |
| LightOnOCR-2-1B-bbox-soup | 1B | 86.1 | 77.9 | 88.2 | 41.2 | — | 85.4 | 87.3 | 99.7 | 80.8±0.9 |
| olmOCR-2-8B | 8B | 82.9 | 82.1 | 84.3 | 48.3 | — | 84.3 | 81.4 | 99.7 | 80.4±1.1 |
| LightOnOCR-2-1B-bbox | 1B | 86.9 | 74.7 | 88.6 | 39.7 | — | 85 | 86.4 | 99.8 | 80.2±0.9 |
| PaddleOCR-VL* | 0.9B | 85.7 | 71 | 84.1 | 37.8 | 97 | 79.9 | 85.7 | 98.5 | 80±1 |
| Mistral OCR 3 API | — | 85.6 | 69.7 | 85.5 | 43.5 | — | 81.2 | 88.5 | 99.7 | 79.1±1 |
| LightOnOCR-1B-1025-GRPO | 1B | 86.5 | 73.8 | 74.5 | 32.9 | — | 85.1 | 91.6 | 99.7 | 77.7±1 |
| dots.ocr | 3B | 82.1 | 64.2 | 88.3 | 40.9 | — | 82.4 | 81.2 | 99.5 | 76.9±1 |
| Marker 1.10.1 | — | 83.8 | 66.8 | 72.9 | 33.5 | 86.6 | 80 | 85.7 | 99.3 | 76.1±1.1 |
| olmOCR v0.3.0 | 8B | 78.6 | 79.9 | 72.9 | 43.9 | — | 77.3 | 81.2 | 98.9 | 76.1±1.1 |
| LightOnOCR-1B-1025 | 1B | 81.4 | 71.6 | 76.4 | 35.2 | — | 80 | 88.7 | 99.5 | 76.1±1.1 |
| DeepSeek-OCR | 3B | 77.2 | 73.6 | 80.2 | 33.3 | 96.1 | 66.4 | 79.4 | 99.8 | 75.7±1 |
| MinerU 2.5.4* | — | 76.6 | 54.6 | 84.9 | 33.7 | 96.6 | 78.2 | 83.5 | 93.7 | 75.2±1.1 |
| MonkeyOCR-pro-3B | 3B | 83.8 | 68.8 | 74.7 | 36.1 | — | 76.6 | 80.1 | 95.3 | 73.6±1 |
| Mistral OCR API | — | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72±1.1 |
| MinerU2.5 | 1.2B | 76.6 | 54.6 | 84.9 | 33.7 | — | 78.2 | 81.2 | 83.5 | 70.4±1 |
| Nanonets-OCR2-3B | — | 75.4 | 46.1 | 86.8 | 40.9 | 32.1 | 81.9 | 93 | 99.6 | 69.5±1.1 |
| MonkeyOCR-pro-1.2B | 1.2B | 80.5 | 62.9 | 71.1 | 32.9 | — | 68.3 | 74 | 92.6 | 68.9±1.1 |
| Qwen2.5-VL-8B | 8B | 63.1 | 65.7 | 67.3 | 38.6 | — | 68.3 | 49.1 | 98.3 | 64.3±1.2 |
| Gemini Flash 2 | — | 32.1 | 56.3 | 61.4 | 27.8 | — | 58.7 | 84.4 | 94 | 59.2±1.1 |
Data source: Allen AI olmOCR GitHub • HuggingFace Dataset • LightOnOCR arXiv Paper. We did not perform these benchmarks—results are reproduced from the original sources for informational purposes. Last updated: 2026-01-22. For our own reproduced runs and notes on methodology differences, read What We Learned Reproducing olmOCR-Bench.
Understanding the Score Categories
The overall score is a macro-average across 8 document categories. Each category tests a specific challenge where OCR pipelines commonly fail:
ArXiv
Born-digital academic papers with dense math notation, inline equations, and complex formatting. Tests formula preservation and LaTeX-style output quality.
Old Scans Math
Scanned historical math textbooks with degraded print quality. Combines OCR difficulty of old documents with math formula recognition.
Tables
Complex table structures with merged cells, nested headers, and spanning rows/columns. Tests whether cell relationships are preserved in output.
Old Scans
Historical letters and typewritten documents with noise, fading, and artifacts. The hardest category—most models score below 50% here.
Headers & Footers
Tests removal of repeated page elements (page numbers, running headers). High scores mean the model correctly strips these from output.
Multi Column
Documents with multiple columns where naive left-to-right reading produces garbage. Tests reading order preservation.
Tiny Text
Dense small fonts like dictionaries, reference tables, and fine print. Tests character-level accuracy under challenging visual conditions.
Base
Clean, well-formatted documents. Sanity check—if a model fails here (below 95%), something is fundamentally broken.
Key insight:The “Old Scans” category is where most models struggle (30-50% range). If your use case involves historical or degraded documents, pay extra attention to this column rather than the overall score.
What It Actually Measures
The benchmark evaluates PDF-to-text pipelines using 7,010 pass/fail “unit tests” across approximately 1,403 PDF pages. Instead of producing a perfect “gold text” transcript for every page (which is expensive and brittle), they create small verifiable assertions.
The core questions it answers:
- Did you preserve the important content?
- Did you skip the junk (headers/footers/page numbers)?
- Did the reading order make sense?
- Did you keep tables and math usable?
- Did you hallucinate or produce garbage output?
This makes it fundamentally different from traditional OCR benchmarks that measure character or word error rates. It tests downstream text usability for applications like RAG, search, summarization, and compliance extraction.
Document Categories
They explicitly chose document types where OCR/linearization tends to fail:
| Category | What It Tests |
|---|---|
| arXiv_math | Born-digital PDFs with heavy formula content |
| old_scans_math | Public domain scanned math textbooks |
| tables_tests | Complex table structures and cell relationships |
| old_scans | Historical letters and typewritten documents |
| headers_footers | Testing removal of repeated page elements |
| multi_column | Multi-column layouts and reading order |
| long_tiny_text | Dense tiny fonts like dictionaries/references |
Notably, 3,385 of the 7,010 testsare math formula tests alone—math is heavily represented because it's where most pipelines break.
The 5 Test Types
1. Text Presence
Does a target snippet appear in the output? Supports fuzzy matching and position constraints (first/last N characters). Case-sensitive by default.
2. Text Absence
Ensures some snippet doesn't appear—used for headers, footers, and page numbers that should be stripped. Fuzzy matching with positional constraints. Not case-sensitive by default.
3. Natural Reading Order
Checks ordering between two spans—“before” must appear before “after” in the output. Critical for multi-column layouts where naive OCR often scrambles content.
4. Table Accuracy
Checks that a table contains expected values and that neighbor relationships match (above/below/left/right). Supports both Markdown and HTML table formats. Harder cases with rowspan/colspan rely on HTML structure.
5. Math Formula Accuracy
The sophisticated part: they render a reference equation with KaTeX, extract symbol bounding boxes, and verify whether the OCR output contains a matching symbol layout. Checks relative positions like “∫ left of x” or “3 above −3”.
Important detail:Before comparison, they normalize strings heavily (whitespace, markdown formatting, quotes/hyphens, Unicode NFC). This reduces “format noise” so they test meaning more than exact typography.
How Scoring Works
Each test is pass/fail. Per category, you get the percentage of tests passed:
The final score is a macro-average across document categories:
This means categories are weighted equally regardless of test count. A category with 100 tests counts the same as one with 1,000 tests. This avoids “math dominates everything ” purely because it has the most tests.
They also report confidence intervals via bootstrapping in their results.
What It's Good At
If your goal is LLM-ready document text (RAG, search, summarization, compliance extraction), olmOCR-Bench is genuinely useful because it rewards:
- Not hallucinating content that isn't there
- Not dropping important content
- Preserving sensible reading order
- Keeping tables and math usable
- Handling ugly scans and weird layouts
And since it's unit-test based, it's:
- Deterministic — same input always produces same score
- Cheap to run — just string matching, no GPU required
- Not “LLM-as-judge” — avoids bias toward the evaluator model
Limitations & Criticisms
LlamaIndex published a detailed critique that raises valid points:
Domain Skew
The benchmark is heavily skewed toward academic/scientific PDFs, not real business documents. If you process invoices, contracts, or forms, high olmOCR-Bench scores may not transfer to your use case.
Coarse Pass/Fail
Tests are binary—you pass or fail. This doesn't measure how wrong something is. A minor typo and a completely garbled output both count as failures.
Exactness Brittleness
Even with normalization, some checks can penalize harmless variations. Edge cases in string matching can create false failures.
Opinionated Header/Footer Removal
The benchmark assumes you want to remove headers/footers. But sometimes you need them—depends on your use case. This design choice is baked in.
Data Generation Bias
Some tests were created using LLM pipelines. Tables were filtered by what Gemini could parse. This can skew what “counts as hard” toward what existing models struggle with.
English-Focused
The benchmark is primarily English. Multilingual document processing isn't tested.
Context matters:LlamaIndex's incentives lean toward “benchmarks should match enterprise doc workflows,” while olmOCR is explicitly optimizing for PDF-to-LLM-text at scale. Different target, different “correct.”
How to Use It Intelligently
Use olmOCR-Bench as a strong signal, not the final word.
It's great for quickly telling:
- Does this model collapse on math?
- Does it scramble multi-column reading order?
- Does it destroy tables?
- Does it hallucinate or omit chunks?
But always add a “your docs” slice. The best practice:
- Keep olmOCR-Bench as a public baseline
- Add a private suite of your own PDFs + your own unit tests
That avoids optimizing for a leaderboard that doesn't match your actual pain points.
What olmOCR-Bench Is Not
Don't expect these capabilities:
- Field extraction benchmark — no invoice totals, line-items, etc.
- Multilingual OCR quality — primarily English
- Fine-grained character accuracy — no CER/WER metrics
- Layout reconstruction fidelity — not about exact PDF structure preservation
It's a pragmatic eval for downstream text usability, not an exhaustive OCR quality assessment.