Leaderboard

OmniDocBench v1.5

The driver's license exam for PDF parsing. Not “can you read text?” but “can you reconstruct the document?”

The Driver's License Exam for PDF Parsing

Think of OmniDocBench as a driving test, not an eye exam. It doesn't ask “can you see the letters?” It asks:

  • Can you reconstruct the page: paragraphs, headings, tables, formulas, figures?
  • Can you keep the structure intact— especially tables, which are where most models crash?
  • Can you keep the reading order intact? Multi-column pages are where models get drunk.
  • Can you output something machine-usable? They standardize around Markdown and score it automatically.

The benchmark is from OpenDataLab / Shanghai AI Laboratory (CVPR 2025). 1,355 PDF pages across 9 document types, 4 layout types, 3 language types—each with rich annotations for boxes, content, reading order, and element affiliations.

Why You Should Care

In real products, “OCR” is rarely the end goal. You typically want one of these downstream outcomes:

RAG Ingestion

Chunking a PDF correctly depends on layout + reading order. Wrong order silently ruins your vector embeddings.

Compliance & Extraction

Tables and formulas matter, not just plain text. Financial reports with garbled tables are useless.

Doc-to-Structured

You want Markdown/HTML/JSON you can diff, store, render, or validate—not a blob of text.

OmniDocBench punishes the classic failure modes: wrong column order, table cells shuffled, formula mangled, captions attached to the wrong figure, paragraph splits in the middle of a sentence. If your parser passes here, it can probably handle your production documents.

OmniDocBench v1.5 Leaderboard

Results from the DeepSeek-OCR-2 paper. Models with * use pipeline approaches. V-token is the max visual-token budget per page. DeepSeek rows are their own runs; others are from the OmniDocBench repo.

ModelV-tokenText Edit↓Formula CDM↑Table TEDs↑Table TEDSs↑R-Order Edit↓Overall↑
PaddleOCR-VL*0.03591.2290.8994.760.04392.86
DeepSeek-OCR 2(paper)1,1200.04890.3187.7592.060.05791.09
MinerU2.5*0.04788.4688.2292.380.04490.67
Qwen3-VL-235B>60000.06988.1486.2190.550.06889.15
MonkeyOCR-pro-3B*0.07587.2586.7890.630.12888.85
OCRVerse>60000.05886.9184.5588.450.07188.56
dots.ocr>60000.04883.2286.7890.620.05388.41
Gemini-2.5 Pro0.07585.8285.7190.290.09788.03
DeepSeek-OCR (9-crops)(paper)1,1560.07384.1485.2589.010.08587.36
MonkeyOCR-3B*0.07587.4581.3985.920.12987.13
Qwen2.5-VL-72B>60000.09488.2782.1586.220.10287.02
MonkeyOCR-pro-1.2B*0.08485.0284.2489.020.1386.96
PP-StructureV3*0.07385.7981.6889.480.07386.73
MinerU2-VLM>70000.07880.9583.5487.660.08685.56
Nanonets-OCR-s>70000.09385.980.1485.570.10885.59
Dolphin-1.5*0.09280.7878.0684.10.0883.21
InternVL3.5-241B>70000.14287.237581.280.12582.67
olmOCR>60000.09686.0468.9274.770.12181.79
POINTS-Reader>60000.13479.277.1381.660.14580.98
InternVL3>70000.13183.4270.6477.740.11380.33
GPT-4o0.21779.767.0776.090.14875.02
OCRFlux>60000.19368.0375.7580.230.20274.82
Dolphin*0.12567.8568.777.770.12474.67
MinerU2-pp*0.20976.5570.979.110.22571.51
Marker-1.8.2*0.20676.6657.8871.170.2571.3

Data source: DeepSeek-OCR-2 PaperDeepSeek-OCR-2 HuggingFaceOmniDocBench GitHub. Last updated: 2026-01-28.

Visual Tokens: Why They Matter More Than You Think

A visual tokenis a learned embedding of a small image region—the model's “retina patches.” V-tokenmax tells you: how many patches did this model look at per page, at most?

Tokens = Latency + Cost + Context Burn

Attention is O(n²). A model using 6,000 tokens vs DeepSeek's 1,120 means ~5× more tokens → ~29× more attention work. At production scale, that's brutal for latency, API cost, and multi-page context windows.

How DeepSeek Gets 1,120 While Others Need 6,000+

DeepSeek-OCR-2 uses a dynamic resolution scheme that mimics how humans look at documents:

  1. 1 global viewat 1024×1024 → 256 tokens (the “glance”)
  2. 0-6 local cropsat 768×768 → 144 tokens each (the “fixations” on dense regions)
  3. Total: 256 + (0–6) × 144 = 256–1,120 tokens

One full-page scan, then zoomed fixations where dense information lives. Other models use dense grids without this compression—hence 6,000+ counts.

Quick Mental Model

  • High score + low V-token: Efficient eyes—good for production at scale.
  • High score + huge V-token: Brute-force eyesight—great quality, but you pay in latency and cost.
  • Pipeline (—): No V-token because they use specialized encoders per component, not a unified VLM.

The 5 Metrics: What They Actually Test

OmniDocBench breaks “document reading” into sub-skills. Arrows indicate direction: ↑ higher is better, ↓ lower is better.

Text Edit↓ — Did You Read the Words Right?

Normalized edit distance between OCR output and ground truth. A score of 0.035 means only 3.5% of characters needed correction. Lower is better; 0.0 would be perfect. Under 0.05 is good; over 0.1 means something is broken.

Formula CDM↑ — Did You Preserve the Math?

Character Detection Matching for formulas. Tests whether your LaTeX output correctly represents the original—superscripts in the right place, fractions intact, Greek letters not garbled. Higher is better; over 85 is solid, over 90 is state-of-the-art.

Table TEDs↑ — Did You Parse the Table?

Tree-Edit-Distance Similarity. Tables are converted to HTML, then compared as tree operations. This version includes cell content—you get dinged for both wrong structure and wrong text inside cells. Over 85 is good; under 75 means broken tables.

Table TEDSs↑ — Structure Only

Same as TEDs but ignoring cell content. Isolates “did you find the table skeleton?” from “did you OCR the cells correctly?” If TEDSs is high but TEDs is low, your layout detection is good but your in-table OCR is failing.

R-Order Edit↓ — Did You Get Reading Order Right?

Normalized edit distance over the sequence of text blocks. Critical for multi-column layouts, sidebars, and footnotes. High scores mean the model scrambled content order—column 2 before column 1, or footnotes mixed into body text. Under 0.08 is good; over 0.15 is serious.

Note:R-Order Edit isn't included in the Overall formula—it's reported separately because reading order is its own special kind of pain. If your use case involves multi-column documents, don't rely on Overall alone.

How “Overall” Is Calculated

The v1.5 Overall score isn't an average of all five metrics. It's explicitly defined as:

Overall = ((1 - Text_Edit) × 100 + Table_TEDS + Formula_CDM) / 3

It grades you on three pillars: text extraction, table parsing, and formula recognition. Reading order and structure-only table metrics are tracked separately.

Pipeline vs End-to-End: The Architecture Split

The leaderboard shows two fundamentally different approaches:

Pipeline Systems (*)

PaddleOCR-VL, MinerU2.5, MonkeyOCR chain specialized components: layout detector → OCR → table parser → formula recognizer. Each component is optimized for its specific task.

  • Pro: Best-in-class on individual capabilities. PaddleOCR-VL leads with 92.86.
  • Con: Complex deployment, error propagation between stages, more moving parts.

End-to-End VLMs

DeepSeek-OCR 2, Qwen2.5-VL, Gemini-2.5 Pro process the entire page in one forward pass. Simpler architecture, but the model must learn all tasks simultaneously.

  • Pro: Single model, simpler deployment, holistic understanding.
  • Con: Historically underperformed—until DeepSeek-OCR 2 showed you can hit 91+ with just 1,120 tokens.

Version Drift Warning

v1.5 was a major update.If you're comparing numbers across papers:

  • More pages than v1.0
  • Higher-resolution images for some document types
  • Updated “hybrid matching” evaluation logic that merges/splits to find best matches (reduces unfair penalties from minor formatting differences)
  • Leaderboard positions shifted

A model that scored 88 on v1.0 might score 85 or 91 on v1.5— they're not directly comparable. This page uses v1.5 results only.

Frequently Asked Questions

V-token (visual tokens) is the maximum number of image patches the model processes per page. Think of it as the model's "retina resolution" — each token is a learned embedding of a small image region.

Why it matters:

  • Cost — More tokens = higher API costs or GPU time
  • Latency — Attention is O(n²), so 5× more tokens ≈ 25× more compute
  • Context — Multi-page docs exhaust context windows fast

DeepSeek-OCR 2 uses only 1,120 tokens while most competitors use 6,000+, making it ~5× more efficient with comparable quality. This translates to significant cost savings at production scale.

Resources