What does V-token mean in the OmniDocBench leaderboard?

V-token (visual tokens) is the maximum number of image patches the model processes per page. Think of it as the model's "retina resolution" — each token is a learned embedding of a small image region. Why it matters: • Cost — More tokens = higher API costs or GPU time • Latency — Attention is O(n²), so 5× more tokens ≈ 25× more compute • Context — Multi-page docs exhaust context windows fast DeepSeek-OCR 2 uses only 1,120 tokens while most competitors use 6,000+, making it ~5× more efficient with comparable quality. This translates to significant cost savings at production scale.

What's the difference between OmniDocBench and olmOCR-Bench?

They test different aspects of document processing: OmniDocBench v1.5 evaluates end-to-end document parsing with granular metrics: • Text extraction (edit distance) • Table structure (TEDS) • Formula recognition (CDM) • Reading order preservation olmOCR-Bench uses pass/fail "unit tests" for PDF linearization — more like "did you preserve the important bits?" than a quality score. When to use each: • OmniDocBench gives more detailed diagnostics — useful for understanding where your model fails • olmOCR-Bench is faster to run and covers more document types (arXiv, historical scans, etc.) Both are valuable; they're complementary, not competing. See our olmOCR-Bench deep dive.

Why do pipeline models show '—' for V-token?

Pipeline systems (marked with *) use separate specialized components — a layout detector, an OCR engine, a table parser, a formula recognizer — each with its own vision encoder. There's no single "visual token budget" to report because: • Each component may process images differently • Some components don't use transformers at all • The total compute isn't reducible to one number The V-token metric only makes sense for unified VLM architectures that process the whole page in one forward pass.

Who created OmniDocBench?

OmniDocBench is from Shanghai AI Laboratory and academic collaborators, published at CVPR 2025. • Dataset & toolkit: Hosted on OpenDataLab, their data platform • Code: Available on GitHub • Paper affiliations: Shanghai AI Laboratory, plus collaborators from Abaka AI and 2077AI OpenDataLab is the data/benchmark platform project under Shanghai AI Lab — think of it as their "data hub" for AI research.

Can I compare OmniDocBench v1.0 and v1.5 scores?

No — v1.5 was a major update with breaking changes: • More pages in the dataset • Higher-resolution images for some document types (newspapers, notes) • Updated "hybrid matching" evaluation logic that merges/splits predictions to find best matches • Different penalty behavior for minor formatting differences A model that scored 88 on v1.0 might score 85 or 91 on v1.5. The benchmarks aren't directly comparable. Always check which version a paper used before comparing results. This page uses v1.5 results only.

What does '(paper)' mean next to DeepSeek models?

The DeepSeek rows are the paper authors' own evaluation runs, overlaid onto the OmniDocBench reference table. • DeepSeek rows: Evaluated by the DeepSeek team for their paper • Other rows: From the official OmniDocBench repository leaderboard Both use the same evaluation protocol and code, so results are comparable. The label just indicates the source of the evaluation, not a different methodology.

Why isn't Reading Order included in the Overall score?

The Overall formula explicitly focuses on three core pillars: ``` Overall = ((1 - Text_Edit) × 100 + Table_TEDS + Formula_CDM) / 3 ``` Reading order is tracked separately because: • It's a distinct challenge from text/table/formula quality • It matters a lot for multi-column documents but less for simple single-column pages • Including it would dilute the signal for the core capabilities If your use case involves complex layouts (academic papers, newspapers, financial reports), don't rely on Overall alone — check the R-Order Edit column specifically.

Leaderboard

OmniDocBench v1.5

The driver's license exam for PDF parsing. Not “can you read text?” but “can you reconstruct the document?”

GitHub →DeepSeek-OCR-2 Paper →OpenDataLab →

The Driver's License Exam for PDF Parsing

Think of OmniDocBench as a driving test, not an eye exam. It doesn't ask “can you see the letters?” It asks:

Can you reconstruct the page: paragraphs, headings, tables, formulas, figures?
Can you keep the structure intact— especially tables, which are where most models crash?
Can you keep the reading order intact? Multi-column pages are where models get drunk.
Can you output something machine-usable? They standardize around Markdown and score it automatically.

The benchmark is from OpenDataLab / Shanghai AI Laboratory (CVPR 2025). 1,355 PDF pages across 9 document types, 4 layout types, 3 language types—each with rich annotations for boxes, content, reading order, and element affiliations.

Why You Should Care

In real products, “OCR” is rarely the end goal. You typically want one of these downstream outcomes:

RAG Ingestion

Chunking a PDF correctly depends on layout + reading order. Wrong order silently ruins your vector embeddings.

Compliance & Extraction

Tables and formulas matter, not just plain text. Financial reports with garbled tables are useless.

Doc-to-Structured

You want Markdown/HTML/JSON you can diff, store, render, or validate—not a blob of text.

OmniDocBench punishes the classic failure modes: wrong column order, table cells shuffled, formula mangled, captions attached to the wrong figure, paragraph splits in the middle of a sentence. If your parser passes here, it can probably handle your production documents.

OmniDocBench v1.5 Leaderboard

Results from the DeepSeek-OCR-2 paper. Models with * use pipeline approaches. V-token is the max visual-token budget per page. DeepSeek rows are their own runs; others are from the OmniDocBench repo.

Model	V-token	Text Edit↓	Formula CDM↑	Table TEDs↑	Table TEDSs↑	R-Order Edit↓	Overall↑
PaddleOCR-VL*	—	0.035	91.22	90.89	94.76	0.043	92.86
DeepSeek-OCR 2(paper)	1,120	0.048	90.31	87.75	92.06	0.057	91.09
MinerU2.5*	—	0.047	88.46	88.22	92.38	0.044	90.67
Qwen3-VL-235B	>6000	0.069	88.14	86.21	90.55	0.068	89.15
MonkeyOCR-pro-3B*	—	0.075	87.25	86.78	90.63	0.128	88.85
OCRVerse	>6000	0.058	86.91	84.55	88.45	0.071	88.56
dots.ocr	>6000	0.048	83.22	86.78	90.62	0.053	88.41
Gemini-2.5 Pro	—	0.075	85.82	85.71	90.29	0.097	88.03
DeepSeek-OCR (9-crops)(paper)	1,156	0.073	84.14	85.25	89.01	0.085	87.36
MonkeyOCR-3B*	—	0.075	87.45	81.39	85.92	0.129	87.13
Qwen2.5-VL-72B	>6000	0.094	88.27	82.15	86.22	0.102	87.02
MonkeyOCR-pro-1.2B*	—	0.084	85.02	84.24	89.02	0.13	86.96
PP-StructureV3*	—	0.073	85.79	81.68	89.48	0.073	86.73
MinerU2-VLM	>7000	0.078	80.95	83.54	87.66	0.086	85.56
Nanonets-OCR-s	>7000	0.093	85.9	80.14	85.57	0.108	85.59
Dolphin-1.5*	—	0.092	80.78	78.06	84.1	0.08	83.21
InternVL3.5-241B	>7000	0.142	87.23	75	81.28	0.125	82.67
olmOCR	>6000	0.096	86.04	68.92	74.77	0.121	81.79
POINTS-Reader	>6000	0.134	79.2	77.13	81.66	0.145	80.98
InternVL3	>7000	0.131	83.42	70.64	77.74	0.113	80.33
GPT-4o	—	0.217	79.7	67.07	76.09	0.148	75.02
OCRFlux	>6000	0.193	68.03	75.75	80.23	0.202	74.82
Dolphin*	—	0.125	67.85	68.7	77.77	0.124	74.67
MinerU2-pp*	—	0.209	76.55	70.9	79.11	0.225	71.51
Marker-1.8.2*	—	0.206	76.66	57.88	71.17	0.25	71.3

Data source: DeepSeek-OCR-2 Paper • DeepSeek-OCR-2 HuggingFace • OmniDocBench GitHub. Last updated: 2026-01-28.

Visual Tokens: Why They Matter More Than You Think

A visual tokenis a learned embedding of a small image region—the model's “retina patches.” V-token^max tells you: how many patches did this model look at per page, at most?

Tokens = Latency + Cost + Context Burn

Attention is O(n²). A model using 6,000 tokens vs DeepSeek's 1,120 means ~5× more tokens → ~29× more attention work. At production scale, that's brutal for latency, API cost, and multi-page context windows.

How DeepSeek Gets 1,120 While Others Need 6,000+

DeepSeek-OCR-2 uses a dynamic resolution scheme that mimics how humans look at documents:

1 global viewat 1024×1024 → 256 tokens (the “glance”)
0-6 local cropsat 768×768 → 144 tokens each (the “fixations” on dense regions)
Total: 256 + (0–6) × 144 = 256–1,120 tokens

One full-page scan, then zoomed fixations where dense information lives. Other models use dense grids without this compression—hence 6,000+ counts.

Quick Mental Model

High score + low V-token: Efficient eyes—good for production at scale.
High score + huge V-token: Brute-force eyesight—great quality, but you pay in latency and cost.
Pipeline (—): No V-token because they use specialized encoders per component, not a unified VLM.

The 5 Metrics: What They Actually Test

OmniDocBench breaks “document reading” into sub-skills. Arrows indicate direction: ↑ higher is better, ↓ lower is better.

Text Edit↓ — Did You Read the Words Right?

Normalized edit distance between OCR output and ground truth. A score of 0.035 means only 3.5% of characters needed correction. Lower is better; 0.0 would be perfect. Under 0.05 is good; over 0.1 means something is broken.

Formula CDM↑ — Did You Preserve the Math?

Character Detection Matching for formulas. Tests whether your LaTeX output correctly represents the original—superscripts in the right place, fractions intact, Greek letters not garbled. Higher is better; over 85 is solid, over 90 is state-of-the-art.

Table TEDs↑ — Did You Parse the Table?

Tree-Edit-Distance Similarity. Tables are converted to HTML, then compared as tree operations. This version includes cell content—you get dinged for both wrong structure and wrong text inside cells. Over 85 is good; under 75 means broken tables.

Table TEDSs↑ — Structure Only

Same as TEDs but ignoring cell content. Isolates “did you find the table skeleton?” from “did you OCR the cells correctly?” If TEDSs is high but TEDs is low, your layout detection is good but your in-table OCR is failing.

R-Order Edit↓ — Did You Get Reading Order Right?

Normalized edit distance over the sequence of text blocks. Critical for multi-column layouts, sidebars, and footnotes. High scores mean the model scrambled content order—column 2 before column 1, or footnotes mixed into body text. Under 0.08 is good; over 0.15 is serious.

Note:R-Order Edit isn't included in the Overall formula—it's reported separately because reading order is its own special kind of pain. If your use case involves multi-column documents, don't rely on Overall alone.

How “Overall” Is Calculated

The v1.5 Overall score isn't an average of all five metrics. It's explicitly defined as:

Overall = ((1 - Text_Edit) × 100 + Table_TEDS + Formula_CDM) / 3

It grades you on three pillars: text extraction, table parsing, and formula recognition. Reading order and structure-only table metrics are tracked separately.

Pipeline vs End-to-End: The Architecture Split

The leaderboard shows two fundamentally different approaches:

Pipeline Systems (*)

PaddleOCR-VL, MinerU2.5, MonkeyOCR chain specialized components: layout detector → OCR → table parser → formula recognizer. Each component is optimized for its specific task.

Pro: Best-in-class on individual capabilities. PaddleOCR-VL leads with 92.86.
Con: Complex deployment, error propagation between stages, more moving parts.

End-to-End VLMs

DeepSeek-OCR 2, Qwen2.5-VL, Gemini-2.5 Pro process the entire page in one forward pass. Simpler architecture, but the model must learn all tasks simultaneously.

Pro: Single model, simpler deployment, holistic understanding.
Con: Historically underperformed—until DeepSeek-OCR 2 showed you can hit 91+ with just 1,120 tokens.

Version Drift Warning

v1.5 was a major update.If you're comparing numbers across papers:

More pages than v1.0
Higher-resolution images for some document types
Updated “hybrid matching” evaluation logic that merges/splits to find best matches (reduces unfair penalties from minor formatting differences)
Leaderboard positions shifted

A model that scored 88 on v1.0 might score 85 or 91 on v1.5— they're not directly comparable. This page uses v1.5 results only.

Frequently Asked Questions

V-token (visual tokens) is the maximum number of image patches the model processes per page. Think of it as the model's "retina resolution" — each token is a learned embedding of a small image region.

Why it matters:

Cost — More tokens = higher API costs or GPU time
Latency — Attention is O(n²), so 5× more tokens ≈ 25× more compute
Context — Multi-page docs exhaust context windows fast

DeepSeek-OCR 2 uses only 1,120 tokens while most competitors use 6,000+, making it ~5× more efficient with comparable quality. This translates to significant cost savings at production scale.

The Overall formula explicitly focuses on three core pillars:

Overall = ((1 - Text_Edit) × 100 + Table_TEDS + Formula_CDM) / 3

Reading order is tracked separately because:

It's a distinct challenge from text/table/formula quality
It matters a lot for multi-column documents but less for simple single-column pages
Including it would dilute the signal for the core capabilities

If your use case involves complex layouts (academic papers, newspapers, financial reports), don't rely on Overall alone — check the R-Order Edit column specifically.

Resources

OmniDocBench GitHub — Code, configs, and Docker image to run your own evaluation
OpenDataLab — Dataset hosting (also mirrors to HuggingFace)
DeepSeek-OCR-2 — Model and paper that contributed the leaderboard data
olmOCR-Bench — Complementary benchmark for PDF linearization quality
Document Processing Benchmarks Guide — Mental models for understanding document AI evaluation