Marker

by Datalab

Open SourceSelf-HostedGPL-3.0

End-to-end document conversion pipeline that transforms PDFs and images into clean Markdown, JSON, or HTML. Built on Surya with deterministic layout parsing.

OCRLayout AnalysisTable ExtractionDocument Conversion

GitHub

Overview

Marker is a complete document conversion pipeline from Datalab that turns PDFs and images into structured output formats. It builds on Surya for OCR and adds deterministic parsing for tables, equations, code blocks, and other document elements.

Unlike general-purpose OCR tools, Marker is designed for knowledge pipelines—producing clean, machine-readable output that works well with LLMs and RAG systems. The structured output preserves document hierarchy without generative hallucination.

With ~25,000 GitHub stars (combined with Surya), Marker has become a popular choice for teams building document processing infrastructure.

Strengths

Complete PDF-to-Markdown pipeline
Handles tables, equations, and code blocks
No hallucination (deterministic output)
Well-suited for RAG and LLM workflows
Large community and active development

Limitations

GPL license restricts commercial use
Requires GPU for reasonable performance
Heavier than pure OCR tools

Best Use Cases

PDF to Markdown conversion
RAG document preprocessing
Knowledge base ingestion
Research paper digitization