- Articles
- /
- Document Processing Benchmarks Are a Mess. Here's How to Read Them.
Document Processing Benchmarks Are a Mess. Here's How to Read Them.
A practical guide to evaluating document AI. Mental models for understanding OCR, table extraction, and document understanding benchmarks without drowning in metrics.
The Confusion Nobody Talks About#
A few months ago, I needed to extract data from PDFs. Simple enough, I thought. I'd find the best OCR tool, run it, done.
So I did what anyone would do: I searched for benchmarks. And that's when the confusion started.
Tesseract has 95% accuracy. PaddleOCR claims 98%. Azure says 99%. GPT-4V is "state-of-the-art." A new model on Hugging Face just dropped claiming to beat everything. But beat everything on what? Receipts? Scientific papers? Handwriting? Tables? And wait, some of these numbers are for "OCR" and some are for "document understanding" and some are for "visual question answering." Are those the same thing?
They are not the same thing.
I spent weeks sorting this out. This guide is the mental model I wish someone had handed me at the start: what these benchmarks actually measure, and which ones matter for your actual problem.
The Core Insight: Document AI Has Layers#
"Document processing" isn't one task. It's a stack of tasks, and different benchmarks measure different layers.
The Document AI Stack
When someone says "our model achieves 98% accuracy," the first question you should ask is: on which layer?
A model can nail layer 2 (reading text perfectly) and completely fail at layer 4 (understanding that "Total: $542.00" is the invoice total, not just some text). These are different skills. Comparing their scores is like comparing a spelling test to a reading comprehension test. Both involve words, but they measure different things.
The golden rule: compare within a layer, never across layers.
Why "High OCR Accuracy" Can Be Meaningless#
Imagine an invoice with this text:
Subtotal: $450.00
Tax: $42.00
Shipping: $50.00
Total: $542.00
A good OCR engine will read every character perfectly. 100% accuracy. Perfect score on layer 2.
But your actual task is extracting the total amount. That requires layer 4: understanding that "Total:" is a label, "$542.00" is its value, and this specific number (not the subtotal, not the tax) is what you need.
I've seen pipelines with 99% OCR accuracy that still fail on 30% of invoices because the extraction logic breaks. The text is read perfectly, but the system can't figure out which number is which.
ACME Corp Invoice
Invoice #: 2024-0847
---
Subtotal: $1,240.00
Tax (8%): $99.20
Total Due: $1,339.20
---
Previous Balance: $1,339.20
Grabbed "Previous Balance" instead of "Total Due" — same value, wrong field
The OCR is flawless. The extraction is wrong. This is why you need benchmarks at every layer.
This is why leaderboards can be misleading. A model can top the OCR benchmarks and still be useless for your invoice extraction task. Different layers, different skills.
The Five Layers, Explained#
Layer 1 is just raw pixels, the image itself. Nothing to benchmark there. The interesting stuff starts at Layer 2.
Layer 2: Perception (OCR)#
On clean, printed English text, OCR is basically solved. Every major tool (Tesseract, PaddleOCR, cloud APIs) gets above 95%. This is the layer people obsess over when they should be worrying about layers 3-5.
The interesting questions are about edge cases:
- Rotated text: Tesseract hits about 98% error rate at 90-degree rotation. PaddleOCR handles it fine.
- Handwriting: Still hard. Even the best models struggle with messy handwriting.
- Low-resource languages: Arabic, Hindi, CJK characters. Performance varies wildly.
- Degraded scans: Faxes, old photocopies, coffee stains on paper.
If you're processing clean business documents in English, OCR is probably not your bottleneck. If you're processing historical archives or multilingual forms, it very much is.
Layer 3: Structure (Layout & Tables)#
So your OCR reads the text perfectly. Now what? A document isn't just text. It's text arranged in space: two-column layouts, tables, headers, footnotes. Flatten everything into a single text stream and you lose information.
Table extraction is where this gets painful. A model can be great at finding tables but terrible at parsing them. These are two separate sub-problems (detection vs. structure recognition), and the metrics reflect that. TEDS (Tree-Edit-Distance Similarity) measures how close the predicted table structure is to ground truth. A TEDS of 0.95 means "almost perfect." Below 0.8 means broken tables.
The trap: the most popular benchmark (PubLayNet) is entirely scientific papers. Models trained on it score beautifully on academic PDFs and then fall apart on tax forms. Always check what the benchmark dataset actually contains.
DocLayNet is ~15 points harder than PubLayNet
Layer 4: Semantics (Key Information Extraction)#
This is where most real-world value lives, and where the numbers get humbling. Even state-of-the-art models (LayoutLMv3, etc.) hover around 90% F1 on complex forms. That sounds good until you realize it means roughly 1 in 10 fields has an error. For high-volume automation, that's a lot of manual review.
You usually don't want raw text. You want structured data: invoice numbers, dates, line items, totals. This requires understanding labels, relationships, and context. Not just "there's text that says $542.00" but "this is the invoice total."
Watch out for the entity-vs-document trap: a model might get 90% of fields right across all documents but still fail completely on 20% of documents (the ones with unusual layouts). If you need every field correct on every document, your effective accuracy is much lower than the headline number.
Layer 5: Reasoning (Document QA)#
Layer 4 extracts predefined fields. Layer 5 answers open-ended questions: "What is the total amount due?" or "Who signed this contract?" The model must find relevant text and reason about it.
Human performance on DocVQA (the main benchmark) is about 94%. In 2020, the best models managed 50-60%. Today, they're at 80-85%. Vision-language models (GPT-4V, Claude, Gemini) are pushing this higher, but there's still a meaningful gap.
If you know exactly what fields you need, layer 4 extraction is more reliable. If you need to answer arbitrary questions across varied documents, you're in layer 5 territory.
Human performance is ~94%. Significant gap remains.
Scroll through the layers to see details
The Benchmarks That Actually Matter#
Which benchmarks measure which layer? (For full details on each, see the Benchmark Atlas.)
| Layer | Testing | Benchmarks |
|---|---|---|
Perception | OCR accuracy | |
Structure (Layout) | Page segmentation | |
Structure (Tables) | Table parsing | |
Semantics | Field extraction | |
Reasoning | Document QA |
Pick one benchmark per layer that matches your document type. Don't chase leaderboard positions across unrelated benchmarks.
The Reproducibility Problem#
Not all benchmark results are equally trustworthy.
Fully reproducible: Most academic benchmarks (FUNSD, DocVQA, PubTabNet) release both data and evaluation code. You can download them, run your model, and verify any claimed result.
Partially reproducible: Some benchmarks (SROIE, DocVQA test sets) withhold test labels, so you submit to a leaderboard for evaluation. This is fine for fair comparison but means you can't debug failures on the test set.
Not reproducible: Vendor benchmarks and many "state-of-the-art" claims use private datasets. When a company says "98% accuracy," ask: on what data? If they can't or won't share, treat the number as marketing, not science.
If a result can't be reproduced on public data with public evaluation code, discount it heavily.
What I Actually Do Now#
Step 1: Identify my layer. Am I trying to read text, parse tables, extract fields, or answer questions? This determines which benchmarks matter.
Step 2: Pick one public benchmark per relevant layer. I don't try to evaluate everything. One benchmark that matches my document type is enough for initial comparison.
Step 3: Run a baseline. Before trying anything fancy, I test a simple, well-documented tool (PaddleOCR for perception, LayoutLM for semantics). This gives me a floor to compare against.
Step 4: Test on my actual documents. Public benchmarks are proxies. A model might ace FUNSD and fail on my specific forms. I always reserve some real documents for final validation.
Step 5: Diagnose by layer. When things break, I figure out which layer failed. Is the text being read wrong (perception)? Is the layout being parsed wrong (structure)? Is the field mapping wrong (semantics)? This prevents chasing phantom problems.
Minimum Viable Eval Loop
The Uncomfortable Truths#
Generic models underperform specialized ones. GPT-4V is impressive for general document understanding, but specialized table parsers (RapidTable, TableFormer) still beat it on table structure tasks. For production systems, the unglamorous specialized tool often wins.
Benchmark datasets don't represent your data. FUNSD has 199 documents. CORD has 1,000 receipts. Your pipeline will see thousands of document variations these benchmarks never imagined. Treat benchmark scores as a floor for comparison, not a prediction of production performance.
The "last 5%" is where all the cost lives. Going from 85% to 90% accuracy is engineering. Going from 95% to 99% is a different kind of problem entirely, one that benchmarks can't tell you much about because they measure the average case and you care about the worst case.
If You Only Remember Three Things#
- Document AI is not one task. OCR, layout, extraction, and document QA are different layers with different failure modes.
- High OCR accuracy can still produce a bad workflow. Reading the text correctly is not the same as extracting the right fields reliably.
- Benchmark scores narrow the field. They do not replace testing on your own documents.
What I Would Do In Practice#
If I had to evaluate a document workflow tomorrow, this is the sequence:
- Pick the exact task layer that matters most.
- Choose one public benchmark that matches that layer.
- Run one boring baseline before touching anything fancy.
- Test on actual documents as early as possible.
- Diagnose failures by stage, not by vague impressions like "the OCR is bad."
That is less exciting than screenshotting a leaderboard. It is also how you avoid wasting weeks.
Resources#
- Benchmark Atlas — Full details on each benchmark: what it measures, performance ranges, gotchas, how to run evaluations
- olmOCR-Bench — Our live benchmark data for PDF linearization quality (perception layer)
- What We Learned Reproducing olmOCR-Bench — Our own benchmark reproduction notes, including where methodology changed the score
- OmniDocBench — Comprehensive multi-layer evaluation suite
- DocVQA Leaderboard — Current standings for document QA
- PapersWithCode Document AI — Aggregated leaderboards