- Articles
- /
- Which Document AI Benchmarks Actually Matter?
Which Document AI Benchmarks Actually Matter?
A clear guide to document processing benchmarks. Understand what each evaluation actually measures, when to use it, and what the scores mean.
This atlas complements Document Processing Benchmarks: Making Sense of the Chaos, which explains the mental model. Here, we dig into each benchmark in detail.
The core organizing principle: benchmarks measure different layers of the document processing stack. When you hear "this model achieves 95% accuracy," the first question is always: on which layer?
| Layer | The Question It Answers | Example Benchmarks |
|---|---|---|
| Perception | Did we read the text correctly? | SROIE, OmniDocBench OCR, IAM |
| Structure | Did we understand the layout and tables? | PubLayNet, DocLayNet, PubTabNet |
| Semantics | Did we extract the right fields? | FUNSD, CORD, DocILE |
| Reasoning | Can we answer questions about the document? | DocVQA, InfographicVQA |
If you compare scores across layers, you're comparing apples to fruit salad. A model can nail OCR and completely fail at field extraction. These are different skills.
Perception: Can the Model Read?#
The oldest and most mature category. OCR benchmarks test whether the model can transcribe text from images. For clean printed English, this problem is largely solved. The interesting questions now are about edge cases: rotation, handwriting, multilingual text, degraded scans.
SROIE — The Receipt Baseline#
SROIE (Scanned Receipts OCR and Information Extraction) was introduced at ICDAR 2019 as a three-part challenge: detect text regions, recognize the text, and extract four key fields (company name, date, address, total).
It's a good starting point because receipts are messy but standardized. About 1,000 receipt images, fully public for training. The test set requires leaderboard submission for evaluation.
When to use it: If you're building receipt processing or just want a quick OCR sanity check. The small field count (only 4) means scores hit ceiling effects quickly, so don't over-interpret high numbers.
What "good" looks like: Over 95% word accuracy on OCR, over 95% F1 on field extraction. State-of-the-art is around 97%+ F1.
The gotcha: High SROIE scores don't transfer to other document types. Receipts are narrow domain. A model tuned for receipts may struggle with contracts or scientific papers.
OmniDocBench OCR — The Stress Test#
OmniDocBench (CVPR 2025) is a newer, more comprehensive evaluation that tests OCR across nine document types: academic papers, textbooks, forms, financial reports, handwritten notes, magazines, newspapers, exams, and research reports. It's bilingual (English and Chinese) and includes attribute breakdowns for rotation, background complexity, and more.
This is where you discover which OCR tools actually generalize. The results are revealing: PaddleOCR handles 90-degree rotated text with near-zero error, while Tesseract essentially fails (0.98 normalized error). That's not a minor difference—it's the difference between working and broken.
When to use it: If you need to evaluate OCR across varied document types, especially with edge cases like rotation or non-Latin scripts.
What "good" looks like: Under 0.05 normalized edit distance on clean text. Over 0.2 means something is seriously wrong. Performance varies dramatically by attribute—always check the breakdown.
The gotcha: This is just the OCR module of OmniDocBench. The full benchmark also covers layout, tables, and end-to-end evaluation. Don't conflate the scores.
IAM — The Handwriting Challenge#
The IAM Handwriting Database is the classic benchmark for cursive English handwriting recognition. About 1,500 pages of handwritten text from 657 different writers, with line-by-line ground truth.
Handwriting remains genuinely hard. Even state-of-the-art models (transformer-based architectures like TrOCR) achieve only about 3-5% word error rate on IAM's test set. That's impressive progress from a decade ago, but it's nowhere near the 99%+ accuracy we see on printed text.
When to use it: If your pipeline needs to handle handwritten input. Be realistic about expectations.
What "good" looks like: Under 5% WER is strong. Under 3% CER is state-of-the-art.
The gotcha: IAM uses clean, high-resolution scans of mostly neat handwriting. Real-world handwriting on forms, sticky notes, and whiteboards is messier. Expect worse performance outside the benchmark.
Structure: Does the Model Understand the Page?#
Structure benchmarks test whether the model can segment a page into meaningful regions (paragraphs, tables, figures) and, for tables, parse the internal structure (rows, columns, cells). This is where many pipelines break.
PubLayNet — The Academic Standard#
PubLayNet (IBM, 2019) is the most widely used layout detection benchmark. About 360,000 pages from PubMed Central scientific articles, automatically labeled into five categories: text, title, list, figure, and table.
Models trained on PubLayNet achieve impressive scores. DocLayout-YOLO and RoDLA hit around 96% mAP. The problem isn't performance—it's domain.
When to use it: If you're processing scientific papers or similar well-structured documents.
What "good" looks like: Over 90% mAP. State-of-the-art is around 96%.
The critical gotcha: PubLayNet is entirely scientific papers. Models that score 96% here often drop to 75-80% on tax forms, invoices, or contracts. If you're not processing academic PDFs, this benchmark will mislead you. Use DocLayNet instead.
DocLayNet — The Real-World Test#
DocLayNet (IBM, 2022) addresses PubLayNet's domain problem. About 80,000 pages from diverse sources: financial documents, legal filings, government reports, scientific papers, patents, and manuals. Manually annotated into 11 categories.
Scores here are about 15 points lower than PubLayNet. That's not because models got worse—it's because the test got more realistic. DocLayout-YOLO hits around 82% mAP on DocLayNet versus 96% on PubLayNet. The difference tells you how much domain shift matters.
When to use it: If you're building a production document processing system that handles varied document types. This is the benchmark that predicts real-world performance.
What "good" looks like: Over 75% mAP is solid. Over 80% is strong.
The gotcha: Even 11 categories may miss domain-specific elements like signature blocks, stamps, or watermarks. For specialized applications, you may need custom evaluation.
PubTabNet — Table Structure Recognition#
PubTabNet (IBM, 2020) focuses specifically on table structure: given a table image, can the model reconstruct the rows, columns, and cell boundaries? About 568,000 tables from scientific papers, with HTML structure as ground truth.
The standard metric is TEDS (Tree-Edit-Distance Similarity), which compares predicted table structure to ground truth as tree operations. A TEDS of 0.95+ means near-perfect structure recovery. Below 0.8 means visibly broken tables with merged cells, split columns, or wrong row assignments.
When to use it: If table extraction matters for your pipeline. Most document processing involves some tables.
What "good" looks like: Over 0.9 TEDS is good. Over 0.95 is excellent.
The gotcha: Scientific tables are relatively simple. Financial tables (complex spanning cells, merged headers) are harder. If you process financial documents, look at FinTabNet instead.
FinTabNet — Financial Tables#
FinTabNet (Microsoft, 2021) is PubTabNet for financial documents. About 113,000 tables from S&P 500 annual reports, with their complex multi-level headers, currency alignments, and percentage columns.
State-of-the-art TEDS on FinTabNet is around 0.90—significantly lower than PubTabNet's 0.95. Financial tables are genuinely harder.
When to use it: If you process financial documents, invoices, or anything with complex tabular data.
Semantics: Can the Model Extract the Right Fields?#
Semantic benchmarks test whether the model understands what text means in context. Not just "there's a number here" but "this number is the invoice total, not the subtotal." This is where most production value lives.
FUNSD — Form Understanding Basics#
FUNSD (Form Understanding in Noisy Scanned Documents) is the entry-level benchmark for form extraction. Just 199 forms, annotated with four entity types: question, answer, header, and other—plus linking relations between questions and their answers.
The small size means high variance in results, but it's a useful quick test. LayoutLMv3 achieves about 92% entity F1, up from 60% for a basic BERT baseline. That 30+ point improvement came from models that jointly understand text and spatial layout.
When to use it: As a quick sanity check for form understanding. Not as a final benchmark for production systems—it's too small and too generic.
What "good" looks like: Over 85% entity F1 is solid. Over 90% is strong.
The gotcha: The entity types are generic (question/answer). Real forms have domain-specific fields like "Invoice Number" or "Ship-To Address." High FUNSD scores don't guarantee good performance on your specific form type.
CORD — Receipts with Depth#
CORD (Consolidated Receipt Dataset) goes deeper than SROIE. About 1,000 receipts with 30 entity types covering store information, menu items, prices, totals, and payment details. Multilingual (English and Korean).
This is closer to what real receipt processing looks like. You need to extract not just "the total" but also individual line items, quantities, and unit prices. State-of-the-art is around 95% overall F1, but per-field accuracy varies—rare fields have lower accuracy.
When to use it: If you're building receipt or invoice processing that needs line-item extraction.
DocILE — The Business Document Benchmark#
DocILE (Document Information Localization and Extraction, ICDAR 2023) is the most comprehensive open benchmark for business document extraction. About 6,700 documents with 55 field types, covering invoices, purchase orders, and other business forms.
Critically, DocILE includes Line Item Recognition (LIR)—extracting the tabular line items from an invoice. This is one of the hardest practical problems in document AI, and few models handle it well.
Baseline performance (LayoutLMv3) is around 70-80% F1 depending on field type. That's significantly lower than FUNSD or CORD. DocILE reflects the actual difficulty of business document extraction.
When to use it: If you're building production invoice processing or any system that needs to handle diverse business documents.
What "good" looks like: The field is still developing. 80%+ F1 on common fields is solid; rare fields remain challenging.
The gotcha: 55 field types means many fields are rare in the dataset, creating long-tail challenges. High aggregate F1 can mask poor performance on specific fields you need.
Reasoning: Can the Model Answer Questions?#
Reasoning benchmarks test comprehension, not just extraction. Given a document and a question, can the model find and synthesize the answer?
DocVQA — The Main Event#
DocVQA (Document Visual Question Answering) is the primary benchmark for document comprehension. About 50,000 questions across 12,000 document images, covering a wide range of question types from simple lookups to multi-step reasoning.
Human performance is around 94% exact match accuracy. When DocVQA launched in 2020, the best models achieved only 55%. Today, large vision-language models (Qwen-VL 2.5, GPT-4V) hit 80-85%. The gap is closing, but it's not closed.
When to use it: If your application involves open-ended questions about documents rather than predefined field extraction.
What "good" looks like: 80%+ accuracy puts you in the top tier of current models.
The gotcha: Questions vary wildly in difficulty. Some are trivial lookups ("What is the date?"), others require multi-step reasoning. The headline number averages over all of them. Also, DocVQA is single-page only—no cross-document or multi-page reasoning.
InfographicVQA — Visual Reasoning#
InfographicVQA tests question answering on infographics: charts, diagrams, and mixed visual/text content. About 30,000 questions on 5,000 infographic images.
This is harder than DocVQA because pure text extraction won't work. The model needs to understand chart axes, legends, and visual relationships. Best models achieve only 50-60% accuracy—a significant gap to human performance.
When to use it: If you need to extract information from charts, graphs, or data visualizations.
End-to-End: Full Pipeline Evaluation#
OmniDocBench Full — The Complete Picture#
The full OmniDocBench evaluation tests the entire document parsing pipeline: OCR, layout analysis, table structure, formula recognition, and reading order. 981 pages across nine document types.
Key finding: Pipeline approaches with specialized components outperform general vision-language models. MinerU (with DocLayout-YOLO) leads, followed by Marker and Docling. General-purpose VLMs like GPT-4V perform lower on structured extraction tasks.
When to use it: If you need a single "can I ship this?" signal for your document parsing pipeline.
The gotcha: End-to-end scores hide which component failed. Use component-level evaluation to diagnose problems.
Quick Reference Tables#
Metrics Decoder#
| Metric | What It Measures | Good | Concerning |
|---|---|---|---|
| CER / WER | OCR accuracy | Under 5% | Over 10% |
| Normalized Edit Distance | OCR accuracy (0-1) | Under 0.05 | Over 0.2 |
| mAP | Region detection | Over 90% (matched domain) | Under 75% |
| TEDS | Table structure | Over 0.9 | Under 0.8 |
| Entity F1 | Field extraction | Over 90% | Under 80% |
| Exact Match | QA accuracy | 80%+ (SOTA range) | Under 60% |
Reproducibility Status#
Most of these benchmarks are fully reproducible with public data and evaluation code:
| Benchmark | Public Data | Eval Code | Test Labels |
|---|---|---|---|
| SROIE | Yes | Yes | Leaderboard |
| OmniDocBench | Yes | Yes | Yes |
| PubLayNet | Yes | Yes | Yes |
| DocLayNet | Yes | Yes | Yes |
| PubTabNet | Yes | Yes | Yes |
| FUNSD | Yes | Yes | Yes |
| DocILE | Yes | Yes | Leaderboard |
| DocVQA | Yes | Yes | Leaderboard |
Choosing Your Benchmark#
"I need to read text from documents" — Start with OmniDocBench OCR if you have varied documents, SROIE if receipts only. Test your edge cases (rotation, handwriting, languages).
"I need to extract tables" — PubTabNet for scientific documents, FinTabNet for financial. Use TEDS as your metric.
"I need to segment page layouts" — DocLayNet for diverse documents, PubLayNet only if you're processing academic papers.
"I need to extract form fields" — FUNSD for quick testing, DocILE for comprehensive evaluation. Watch out for domain mismatch.
"I need to answer questions about documents" — DocVQA. Expect 80-85% from current top models.
"I need a full pipeline evaluation" — OmniDocBench end-to-end.
If You Only Remember Three Things#
- Most benchmark confusion is category confusion. OCR, layout, extraction, and QA scores are not interchangeable.
- Domain match beats leaderboard glamour. A benchmark that matches your actual documents is usually more useful than a famous one that does not.
- End-to-end scores are only the starting point. If the pipeline fails, you still need to know which component failed first.
How To Use This Atlas#
Do not read this as a catalog to memorize. Use it as a decision aid.
Start with the task:
- text extraction
- layout understanding
- field extraction
- document question answering
- full-pipeline evaluation
Then pick the one benchmark here that is closest to your actual workload. One relevant benchmark beats five famous ones that test the wrong thing.
Resources#
- OmniDocBench — Multi-stage evaluation suite
- DocILE — Business document benchmark
- DocVQA — Document question answering
- ICDAR Robust Reading Competition — Ongoing challenges
- PapersWithCode Document AI — Leaderboards