Leaderboard

olmOCR-Bench

Unit-test style evaluation for PDF linearization quality

The Key Insight

olmOCR-Bench is nota character-level OCR accuracy benchmark like classic OCR evals. It's a benchmark for PDF linearization quality: “Given a whole PDF page, can your system output clean text/markdown that preserves meaning, structure, order, tables, and math—and doesn't hallucinate garbage?”

olmOCR-Bench Leaderboard

Benchmark results compiled from the olmOCR GitHub and the LightOnOCR arXiv paper. Scores are macro-averaged across document categories. Higher is better. Models marked with * use pipeline approaches.

ModelSizeArXivOld Scans MathTablesOld ScansHeaders & FootersMulti ColumnTiny TextBaseOverall
LightOnOCR-2-1B1B89.685.68942.284.891.499.683.2±0.9
Chandra OCR 0.1.0*82.280.38850.490.881.292.399.983.1±0.9
Infinity-Parser 7B*84.483.88547.988.784.286.499.882.5
olmOCR v0.4.08382.384.947.796.183.781.999.782.4±1.1
LightOnOCR-2-1B-ocr-soup1B86.881.28945.484.290.399.782.4±0.9
LightOnOCR-2-1B-base1B84.980.386.74784.689.199.881.8±0.9
Chandra-9B9B82.280.38850.481.292.399.981.7±0.9
LightOnOCR-2-1B-bbox-base1B84.678.684.74683.88899.880.8±0.9
LightOnOCR-2-1B-bbox-soup1B86.177.988.241.285.487.399.780.8±0.9
olmOCR-2-8B8B82.982.184.348.384.381.499.780.4±1.1
LightOnOCR-2-1B-bbox1B86.974.788.639.78586.499.880.2±0.9
PaddleOCR-VL*0.9B85.77184.137.89779.985.798.580±1
Mistral OCR 3 API85.669.785.543.581.288.599.779.1±1
LightOnOCR-1B-1025-GRPO1B86.573.874.532.985.191.699.777.7±1
dots.ocr3B82.164.288.340.982.481.299.576.9±1
Marker 1.10.183.866.872.933.586.68085.799.376.1±1.1
olmOCR v0.3.08B78.679.972.943.977.381.298.976.1±1.1
LightOnOCR-1B-10251B81.471.676.435.28088.799.576.1±1.1
DeepSeek-OCR3B77.273.680.233.396.166.479.499.875.7±1
MinerU 2.5.4*76.654.684.933.796.678.283.593.775.2±1.1
MonkeyOCR-pro-3B3B83.868.874.736.176.680.195.373.6±1
Mistral OCR API77.267.560.629.393.671.377.199.472±1.1
MinerU2.51.2B76.654.684.933.778.281.283.570.4±1
Nanonets-OCR2-3B75.446.186.840.932.181.99399.669.5±1.1
MonkeyOCR-pro-1.2B1.2B80.562.971.132.968.37492.668.9±1.1
Qwen2.5-VL-8B8B63.165.767.338.668.349.198.364.3±1.2
Gemini Flash 232.156.361.427.858.784.49459.2±1.1

Data source: Allen AI olmOCR GitHubHuggingFace DatasetLightOnOCR arXiv Paper. We did not perform these benchmarks—results are reproduced from the original sources for informational purposes. Last updated: 2026-01-22. For our own reproduced runs and notes on methodology differences, read What We Learned Reproducing olmOCR-Bench.

Understanding the Score Categories

The overall score is a macro-average across 8 document categories. Each category tests a specific challenge where OCR pipelines commonly fail:

ArXiv

Born-digital academic papers with dense math notation, inline equations, and complex formatting. Tests formula preservation and LaTeX-style output quality.

Old Scans Math

Scanned historical math textbooks with degraded print quality. Combines OCR difficulty of old documents with math formula recognition.

Tables

Complex table structures with merged cells, nested headers, and spanning rows/columns. Tests whether cell relationships are preserved in output.

Old Scans

Historical letters and typewritten documents with noise, fading, and artifacts. The hardest category—most models score below 50% here.

Headers & Footers

Tests removal of repeated page elements (page numbers, running headers). High scores mean the model correctly strips these from output.

Multi Column

Documents with multiple columns where naive left-to-right reading produces garbage. Tests reading order preservation.

Tiny Text

Dense small fonts like dictionaries, reference tables, and fine print. Tests character-level accuracy under challenging visual conditions.

Base

Clean, well-formatted documents. Sanity check—if a model fails here (below 95%), something is fundamentally broken.

Key insight:The “Old Scans” category is where most models struggle (30-50% range). If your use case involves historical or degraded documents, pay extra attention to this column rather than the overall score.

What It Actually Measures

The benchmark evaluates PDF-to-text pipelines using 7,010 pass/fail “unit tests” across approximately 1,403 PDF pages. Instead of producing a perfect “gold text” transcript for every page (which is expensive and brittle), they create small verifiable assertions.

The core questions it answers:

  • Did you preserve the important content?
  • Did you skip the junk (headers/footers/page numbers)?
  • Did the reading order make sense?
  • Did you keep tables and math usable?
  • Did you hallucinate or produce garbage output?

This makes it fundamentally different from traditional OCR benchmarks that measure character or word error rates. It tests downstream text usability for applications like RAG, search, summarization, and compliance extraction.

Document Categories

They explicitly chose document types where OCR/linearization tends to fail:

CategoryWhat It Tests
arXiv_mathBorn-digital PDFs with heavy formula content
old_scans_mathPublic domain scanned math textbooks
tables_testsComplex table structures and cell relationships
old_scansHistorical letters and typewritten documents
headers_footersTesting removal of repeated page elements
multi_columnMulti-column layouts and reading order
long_tiny_textDense tiny fonts like dictionaries/references

Notably, 3,385 of the 7,010 testsare math formula tests alone—math is heavily represented because it's where most pipelines break.

The 5 Test Types

1. Text Presence

Does a target snippet appear in the output? Supports fuzzy matching and position constraints (first/last N characters). Case-sensitive by default.

2. Text Absence

Ensures some snippet doesn't appear—used for headers, footers, and page numbers that should be stripped. Fuzzy matching with positional constraints. Not case-sensitive by default.

3. Natural Reading Order

Checks ordering between two spans—“before” must appear before “after” in the output. Critical for multi-column layouts where naive OCR often scrambles content.

4. Table Accuracy

Checks that a table contains expected values and that neighbor relationships match (above/below/left/right). Supports both Markdown and HTML table formats. Harder cases with rowspan/colspan rely on HTML structure.

5. Math Formula Accuracy

The sophisticated part: they render a reference equation with KaTeX, extract symbol bounding boxes, and verify whether the OCR output contains a matching symbol layout. Checks relative positions like “∫ left of x” or “3 above −3”.

Important detail:Before comparison, they normalize strings heavily (whitespace, markdown formatting, quotes/hyphens, Unicode NFC). This reduces “format noise” so they test meaning more than exact typography.

How Scoring Works

Each test is pass/fail. Per category, you get the percentage of tests passed:

Score(category) = tests passed / total tests in category

The final score is a macro-average across document categories:

Overall = (1/N) × Σ Score(category)

This means categories are weighted equally regardless of test count. A category with 100 tests counts the same as one with 1,000 tests. This avoids “math dominates everything ” purely because it has the most tests.

They also report confidence intervals via bootstrapping in their results.

What It's Good At

If your goal is LLM-ready document text (RAG, search, summarization, compliance extraction), olmOCR-Bench is genuinely useful because it rewards:

  • Not hallucinating content that isn't there
  • Not dropping important content
  • Preserving sensible reading order
  • Keeping tables and math usable
  • Handling ugly scans and weird layouts

And since it's unit-test based, it's:

  • Deterministic — same input always produces same score
  • Cheap to run — just string matching, no GPU required
  • Not “LLM-as-judge” — avoids bias toward the evaluator model

Limitations & Criticisms

LlamaIndex published a detailed critique that raises valid points:

Domain Skew

The benchmark is heavily skewed toward academic/scientific PDFs, not real business documents. If you process invoices, contracts, or forms, high olmOCR-Bench scores may not transfer to your use case.

Coarse Pass/Fail

Tests are binary—you pass or fail. This doesn't measure how wrong something is. A minor typo and a completely garbled output both count as failures.

Exactness Brittleness

Even with normalization, some checks can penalize harmless variations. Edge cases in string matching can create false failures.

Opinionated Header/Footer Removal

The benchmark assumes you want to remove headers/footers. But sometimes you need them—depends on your use case. This design choice is baked in.

Data Generation Bias

Some tests were created using LLM pipelines. Tables were filtered by what Gemini could parse. This can skew what “counts as hard” toward what existing models struggle with.

English-Focused

The benchmark is primarily English. Multilingual document processing isn't tested.

Context matters:LlamaIndex's incentives lean toward “benchmarks should match enterprise doc workflows,” while olmOCR is explicitly optimizing for PDF-to-LLM-text at scale. Different target, different “correct.”

How to Use It Intelligently

Use olmOCR-Bench as a strong signal, not the final word.

It's great for quickly telling:

  • Does this model collapse on math?
  • Does it scramble multi-column reading order?
  • Does it destroy tables?
  • Does it hallucinate or omit chunks?

But always add a “your docs” slice. The best practice:

  1. Keep olmOCR-Bench as a public baseline
  2. Add a private suite of your own PDFs + your own unit tests

That avoids optimizing for a leaderboard that doesn't match your actual pain points.

What olmOCR-Bench Is Not

Don't expect these capabilities:

  • Field extraction benchmark — no invoice totals, line-items, etc.
  • Multilingual OCR quality — primarily English
  • Fine-grained character accuracy — no CER/WER metrics
  • Layout reconstruction fidelity — not about exact PDF structure preservation

It's a pragmatic eval for downstream text usability, not an exhaustive OCR quality assessment.