- Document Processing
- /
- Marker
Marker
by Datalab
End-to-end document conversion pipeline that transforms PDFs and images into clean Markdown, JSON, or HTML. Built on Surya with deterministic layout parsing.
Overview
Marker is a complete document conversion pipeline from Datalab that turns PDFs and images into structured output formats. It builds on Surya for OCR and adds deterministic parsing for tables, equations, code blocks, and other document elements.
Unlike general-purpose OCR tools, Marker is designed for knowledge pipelines—producing clean, machine-readable output that works well with LLMs and RAG systems. The structured output preserves document hierarchy without generative hallucination.
With ~25,000 GitHub stars (combined with Surya), Marker has become a popular choice for teams building document processing infrastructure.
Strengths
- Complete PDF-to-Markdown pipeline
- Handles tables, equations, and code blocks
- No hallucination (deterministic output)
- Well-suited for RAG and LLM workflows
- Large community and active development
Limitations
- GPL license restricts commercial use
- Requires GPU for reasonable performance
- Heavier than pure OCR tools
Best Use Cases
- PDF to Markdown conversion
- RAG document preprocessing
- Knowledge base ingestion
- Research paper digitization