Document Processing Benchmarks Are a Mess. Here's How to Read Them.

The Confusion Nobody Talks About#

A few months ago, I needed to extract data from PDFs. Simple enough, I thought. I'd find the best OCR tool, run it, done.

So I did what anyone would do: I searched for benchmarks. And that's when the confusion started.

Tesseract has 95% accuracy. PaddleOCR claims 98%. Azure says 99%. GPT-4V is "state-of-the-art." A new model on Hugging Face just dropped claiming to beat everything. But beat everything on what? Receipts? Scientific papers? Handwriting? Tables? And wait, some of these numbers are for "OCR" and some are for "document understanding" and some are for "visual question answering." Are those the same thing?

They are not the same thing.

I spent weeks sorting this out. This guide is the mental model I wish someone had handed me at the start: what these benchmarks actually measure, and which ones matter for your actual problem.

The Core Insight: Document AI Has Layers#

"Document processing" isn't one task. It's a stack of tasks, and different benchmarks measure different layers.

The Document AI Stack

5ReasoningWhat's the answer to this question about the document?

4SemanticsWhat is this field? Is this the invoice total or the subtotal?

3StructureWhere are the tables? What's the reading order?

2PerceptionWhat text is on this page?

1PixelsThe raw image

When someone says "our model achieves 98% accuracy," the first question you should ask is: on which layer?

A model can nail layer 2 (reading text perfectly) and completely fail at layer 4 (understanding that "Total: $542.00" is the invoice total, not just some text). These are different skills. Comparing their scores is like comparing a spelling test to a reading comprehension test. Both involve words, but they measure different things.

The golden rule: compare within a layer, never across layers.

Why "High OCR Accuracy" Can Be Meaningless#

Imagine an invoice with this text:

Subtotal:    $450.00
Tax:          $42.00
Shipping:     $50.00
Total:       $542.00

A good OCR engine will read every character perfectly. 100% accuracy. Perfect score on layer 2.

But your actual task is extracting the total amount. That requires layer 4: understanding that "Total:" is a label, "$542.00" is its value, and this specific number (not the subtotal, not the tax) is what you need.

I've seen pipelines with 99% OCR accuracy that still fail on 30% of invoices because the extraction logic breaks. The text is read perfectly, but the system can't figure out which number is which.

2Perception (OCR)Perfect

ACME Corp Invoice

Invoice #: 2024-0847

---

Subtotal: $1,240.00

Tax (8%): $99.20

Total Due: $1,339.20

---

Previous Balance: $1,339.20

4Semantics (KIE)Wrong Field

invoice_number2024-0847

vendorACME Corp

total_amount$1,339.20

Grabbed "Previous Balance" instead of "Total Due" — same value, wrong field

The OCR is flawless. The extraction is wrong. This is why you need benchmarks at every layer.

This is why leaderboards can be misleading. A model can top the OCR benchmarks and still be useless for your invoice extraction task. Different layers, different skills.

The Five Layers, Explained#

Layer 1 is just raw pixels, the image itself. Nothing to benchmark there. The interesting stuff starts at Layer 2.

Layer 2: Perception (OCR)#

On clean, printed English text, OCR is basically solved. Every major tool (Tesseract, PaddleOCR, cloud APIs) gets above 95%. This is the layer people obsess over when they should be worrying about layers 3-5.

The interesting questions are about edge cases:

Rotated text: Tesseract hits about 98% error rate at 90-degree rotation. PaddleOCR handles it fine.
Handwriting: Still hard. Even the best models struggle with messy handwriting.
Low-resource languages: Arabic, Hindi, CJK characters. Performance varies wildly.
Degraded scans: Faxes, old photocopies, coffee stains on paper.

If you're processing clean business documents in English, OCR is probably not your bottleneck. If you're processing historical archives or multilingual forms, it very much is.

ConcerningOver 0.2 normalized error

GoodUnder 0.05 normalized error on clean text

State of ArtUnder 0.02 (clean print)

Layer 3: Structure (Layout & Tables)#

So your OCR reads the text perfectly. Now what? A document isn't just text. It's text arranged in space: two-column layouts, tables, headers, footnotes. Flatten everything into a single text stream and you lose information.

Table extraction is where this gets painful. A model can be great at finding tables but terrible at parsing them. These are two separate sub-problems (detection vs. structure recognition), and the metrics reflect that. TEDS (Tree-Edit-Distance Similarity) measures how close the predicted table structure is to ground truth. A TEDS of 0.95 means "almost perfect." Below 0.8 means broken tables.

The trap: the most popular benchmark (PubLayNet) is entirely scientific papers. Models trained on it score beautifully on academic PDFs and then fall apart on tax forms. Always check what the benchmark dataset actually contains.

GoodOver 0.9 TEDS / 90% mAP

State of ArtOver 0.95 TEDS / 96% mAP

DocLayNet is ~15 points harder than PubLayNet

Layer 4: Semantics (Key Information Extraction)#

This is where most real-world value lives, and where the numbers get humbling. Even state-of-the-art models (LayoutLMv3, etc.) hover around 90% F1 on complex forms. That sounds good until you realize it means roughly 1 in 10 fields has an error. For high-volume automation, that's a lot of manual review.

You usually don't want raw text. You want structured data: invoice numbers, dates, line items, totals. This requires understanding labels, relationships, and context. Not just "there's text that says $542.00" but "this is the invoice total."

Watch out for the entity-vs-document trap: a model might get 90% of fields right across all documents but still fail completely on 20% of documents (the ones with unusual layouts). If you need every field correct on every document, your effective accuracy is much lower than the headline number.

BaselineAbout 60% F1 (BERT)

GoodAbout 85% F1

State of ArtAbout 92% F1 (LayoutLMv3)

Layer 5: Reasoning (Document QA)#

Layer 4 extracts predefined fields. Layer 5 answers open-ended questions: "What is the total amount due?" or "Who signed this contract?" The model must find relevant text and reason about it.

Human performance on DocVQA (the main benchmark) is about 94%. In 2020, the best models managed 50-60%. Today, they're at 80-85%. Vision-language models (GPT-4V, Claude, Gemini) are pushing this higher, but there's still a meaningful gap.

If you know exactly what fields you need, layer 4 extraction is more reliable. If you need to answer arbitrary questions across varied documents, you're in layer 5 territory.

BaselineAbout 55% (2020 models)

GoodAbout 80-85% (current)

State of ArtAbout 85% (Qwen-VL 2.5)

Human performance is ~94%. Significant gap remains.

5Reasoning

4Semantics

3Structure

2Perception

Scroll through the layers to see details

The Benchmarks That Actually Matter#

Which benchmarks measure which layer? (For full details on each, see the Benchmark Atlas.)

Layer	Testing	Benchmarks
Perception	OCR accuracy	●OmniDocBench ◐SROIE ●IAM
Structure (Layout)	Page segmentation	●DocLayNet ●PubLayNet
Structure (Tables)	Table parsing	●PubTabNet ●FinTabNet
Semantics	Field extraction	●FUNSD ●DocILE ●CORD
Reasoning	Document QA	◐DocVQA ◐InfoVQA

● Fully reproducible◐ Leaderboard eval○ Private

Pick one benchmark per layer that matches your document type. Don't chase leaderboard positions across unrelated benchmarks.

The Reproducibility Problem#

Not all benchmark results are equally trustworthy.

Fully reproducible: Most academic benchmarks (FUNSD, DocVQA, PubTabNet) release both data and evaluation code. You can download them, run your model, and verify any claimed result.

Partially reproducible: Some benchmarks (SROIE, DocVQA test sets) withhold test labels, so you submit to a leaderboard for evaluation. This is fine for fair comparison but means you can't debug failures on the test set.

Not reproducible: Vendor benchmarks and many "state-of-the-art" claims use private datasets. When a company says "98% accuracy," ask: on what data? If they can't or won't share, treat the number as marketing, not science.

If a result can't be reproduced on public data with public evaluation code, discount it heavily.

What I Actually Do Now#

Step 1: Identify my layer. Am I trying to read text, parse tables, extract fields, or answer questions? This determines which benchmarks matter.

Step 2: Pick one public benchmark per relevant layer. I don't try to evaluate everything. One benchmark that matches my document type is enough for initial comparison.

Step 3: Run a baseline. Before trying anything fancy, I test a simple, well-documented tool (PaddleOCR for perception, LayoutLM for semantics). This gives me a floor to compare against.

Step 4: Test on my actual documents. Public benchmarks are proxies. A model might ace FUNSD and fail on my specific forms. I always reserve some real documents for final validation.

Step 5: Diagnose by layer. When things break, I figure out which layer failed. Is the text being read wrong (perception)? Is the layout being parsed wrong (structure)? Is the field mapping wrong (semantics)? This prevents chasing phantom problems.

Minimum Viable Eval Loop

Define the layer— OCR vs tables vs extraction vs QA

Select benchmarks— One public benchmark + one internal slice

Run a baseline— Record the metric before trying anything fancy

Diagnose by layer— Trace errors to a specific layer, not a model name

The Uncomfortable Truths#

Generic models underperform specialized ones. GPT-4V is impressive for general document understanding, but specialized table parsers (RapidTable, TableFormer) still beat it on table structure tasks. For production systems, the unglamorous specialized tool often wins.

Benchmark datasets don't represent your data. FUNSD has 199 documents. CORD has 1,000 receipts. Your pipeline will see thousands of document variations these benchmarks never imagined. Treat benchmark scores as a floor for comparison, not a prediction of production performance.

The "last 5%" is where all the cost lives. Going from 85% to 90% accuracy is engineering. Going from 95% to 99% is a different kind of problem entirely, one that benchmarks can't tell you much about because they measure the average case and you care about the worst case.

If You Only Remember Three Things#

Document AI is not one task. OCR, layout, extraction, and document QA are different layers with different failure modes.
High OCR accuracy can still produce a bad workflow. Reading the text correctly is not the same as extracting the right fields reliably.
Benchmark scores narrow the field. They do not replace testing on your own documents.

What I Would Do In Practice#

If I had to evaluate a document workflow tomorrow, this is the sequence:

Pick the exact task layer that matters most.
Choose one public benchmark that matches that layer.
Run one boring baseline before touching anything fancy.
Test on actual documents as early as possible.
Diagnose failures by stage, not by vague impressions like "the OCR is bad."

That is less exciting than screenshotting a leaderboard. It is also how you avoid wasting weeks.

Resources#

Benchmark Atlas — Full details on each benchmark: what it measures, performance ranges, gotchas, how to run evaluations
olmOCR-Bench — Our live benchmark data for PDF linearization quality (perception layer)
What We Learned Reproducing olmOCR-Bench — Our own benchmark reproduction notes, including where methodology changed the score
OmniDocBench — Comprehensive multi-layer evaluation suite
DocVQA Leaderboard — Current standings for document QA
PapersWithCode Document AI — Aggregated leaderboards