Docling: What IBM's 37K-Star Toolkit Actually Does

You need to extract data from PDFs. Simple enough, right? Run some OCR, get the text, feed it to your LLM.

Then you try it. The multi-column layout becomes word salad. Tables turn into meaningless strings of numbers. Your RAG pipeline chunks a paragraph in the middle of a sentence because it couldn't tell where sections end. And when you ask "where did this fact come from?"—you have no idea which page it was on.

This is the problem Docling solves. It's not another OCR tool. It's not an LLM. It's the boring infrastructure that reconstructs document structure—the thing you don't realize you need until you've spent a week debugging why your extraction pipeline keeps failing.

The Mental Model: Docling is a Compiler for Documents#

Think of it like a compiler. A compiler doesn't just read source code character by character—it parses structure (functions, classes, blocks), understands relationships, and produces a clean intermediate representation you can work with.

Docling does the same for documents:

Parse the source — Extract text, coordinates, and visual elements from PDFs, DOCX, PPTX, HTML, even audio (WAV, MP3)
Recover structure — Use specialized AI models to identify what's a header, what's a table, what's a footnote, and crucially, the reading order
Output a structured representation — Not raw text, but a DoclingDocument with hierarchy, bounding boxes, and provenance

The output isn't "here's all the text." It's "here's paragraph 3 of section 2.1, which belongs under heading 'Methods', located at coordinates (x, y) on page 4." That context is what makes downstream processing actually work.

What Docling Gets You (The Non-Trivial Parts)#

Is Docling just "glue code"? Mostly yes—but it's glue you'd be annoyed to recreate. Here's what's actually custom:

Layout Analysis Model — A specialized object detector (RT-DETR architecture) trained on IBM's DocLayNet dataset. It visually identifies document regions: paragraphs, headers, figures, footnotes. This is what preserves the parent-child hierarchy that simple text extraction destroys.

TableFormer — IBM's transformer model specifically for table structure recognition. It reconstructs rows, columns, and cell spans—historically one of the hardest problems in PDF conversion. When your table has merged headers or spanning cells, this is what keeps it from becoming garbage.

Custom PDF Parser — They wrote their own (docling-parse) because existing Python libraries were too slow or inaccurate. It extracts text plus coordinates, and can render pages as images for the vision models.

DoclingDocument Data Model — The "AST for documents." Every element has a type, position, parent relationship, and provenance (which page, which region). This is what lets you reliably export to Markdown, JSON, or HTML while keeping structure consistent.

What Docling doesn't build custom:

OCR — Wraps EasyOCR, Tesseract, RapidOCR, or macOS OCR. OCR only runs when needed (scanned pages, bitmap regions).
Office formats — Uses python-docx, python-pptx, openpyxl
HTML parsing — Uses BeautifulSoup

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Not just text—a structured document object
doc = result.document

# Export preserves structure
markdown = doc.export_to_markdown()
json_data = doc.export_to_dict()

The 2025 Update: Granite-Docling#

In September 2025, IBM released Granite-Docling-258M—a compact vision-language model that can do end-to-end document conversion in a single pass. With only 258 million parameters, it matches models several times larger.

This matters because it gives you a choice:

Pipeline mode (default): Multiple specialized models (layout → tables → OCR → assembly). Faithful extraction—text comes from the PDF, not generated. Best when accuracy and provenance matter.

VLM mode (optional): Single model looks at the page image and outputs structured text directly. Faster, potentially better for messy handwritten documents, but with hallucination risk.

from docling.document_converter import DocumentConverter
from docling.pipeline.vlm_pipeline import VlmPipeline

# Use the VLM for end-to-end conversion
converter = DocumentConverter(pipeline=VlmPipeline())
result = converter.convert("scanned_handwritten.pdf")

The new model also adds experimental multilingual support (Arabic, Chinese, Japanese) and better equation handling. It became the #1 trending model on Hugging Face when released—one of the few small VLMs to hit that milestone.

Docling vs. Marker: Which One?#

The other major open-source option is Marker, built on the Surya vision toolkit from Datalab.

Surya is the engine—detects text lines, layout boxes, reading order. Doesn't output Markdown directly.

Marker is the full pipeline—uses Surya internally, plus Texify for math equations, to produce clean Markdown or JSON.

Aspect	Docling (IBM)	Marker (Datalab)
License	MIT	GPL (code), separate terms for model weights
Strength	Structure preservation (parent-child hierarchy)	Math/LaTeX handling via Texify
OCR	EasyOCR/Tesseract (reliable)	Surya (faster, 90+ languages)
Best for	RAG pipelines needing structured chunking	Scientific papers with heavy math
New direction	Granite-Docling VLM	Chandra (end-to-end page model)

The practical difference: Docling excels at preserving document hierarchy—knowing that "this paragraph belongs to section 2.3 under chapter 4." That's crucial for RAG chunking. Marker excels at rendering complex equations as LaTeX.

For most business documents (invoices, contracts, reports), Docling is likely the better fit. For scientific papers with lots of math, Marker has the edge.

The licensing difference may matter more than model quality: MIT (Docling) vs GPL (Marker) could determine which you can use commercially.

When to Use Docling#

Use Docling when:

You need layout, tables, reading order, and provenance
You're building RAG and need intelligent chunking by document structure
You process multiple formats (PDF, DOCX, PPTX, HTML)
You want reproducible, non-hallucinated extraction

Docling is overkill when:

You only need raw text from simple, single-column documents
Your documents are already clean (born-digital, simple layouts)
You have a working pipeline and don't need structure

The Bottom Line#

Docling is "mostly glue, but it's the kind of glue you'd be annoyed to recreate." It solves the frustrating edge cases in document conversion: reading order that doesn't scramble multi-column layouts, tables that stay tables, and provenance so you can trace extracted facts back to their source.

With 37k+ GitHub stars and adoption by the Linux Foundation, it's become the de facto open-source choice for structured document-to-data workflows. Whether that complexity is worth it depends on whether your documents actually need structure preservation—or if raw text extraction would suffice.

For RAG applications where chunking quality matters, Docling is worth the dependency. For simple text extraction, it's more than you need.

If You Only Remember Three Things#

Docling is not just OCR. The useful part is structure, provenance, and sane document-to-markdown conversion.
The right comparison is Docling vs. another document pipeline, not Docling vs. plain text extraction. If all you need is raw text, it is overkill.
Pipeline vs. VLM is now a real design choice. Granite-Docling makes that explicit instead of hiding it.

What I'd Do In Practice#

If the workflow is RAG, search, or knowledge ingestion, I would start with Docling before trying to invent a document parsing stack from scratch.

If the workflow is invoice extraction, form processing, or anything with explicit schemas and validation, I would treat Docling as one component inside a larger pipeline, not the whole answer.

That is the recurring theme in document AI: tools look more general in demos than they feel in production.

Resources#

Docling Official Website — Documentation and getting started
GitHub Repository — Source code (37k+ stars)
Granite-Docling on Hugging Face — The new VLM
Technical Report (arXiv) — Architecture deep dive
Marker — The main alternative