MinerU

by OpenDataLab

Open SourceSelf-HostedApache-2.0

Open-source PDF extraction toolkit with high-quality conversion to Markdown and structured formats

OCRLayout AnalysisTable ExtractionDocument Conversion

Overview

MinerU is a comprehensive PDF extraction toolkit from OpenDataLab designed for converting documents into structured, machine-readable formats. It combines multiple models for OCR, layout analysis, and content understanding into a unified pipeline.

The 2.5 release (1.2B parameters) significantly improved table extraction, formula recognition, and multi-column layout handling. MinerU outputs clean Markdown with preserved document structure, making it ideal for RAG pipelines and document processing workflows.

It's positioned as an open-source alternative to commercial document AI services, with particular strength in academic and technical documents.

Strengths

  • High-quality PDF to Markdown conversion
  • Strong table and formula extraction
  • Multi-column layout support
  • Active development with frequent updates
  • Good for RAG preprocessing

Limitations

  • Requires GPU for optimal performance
  • Processing speed varies with document complexity
  • May struggle with heavily stylized documents

Best Use Cases

  • PDF to Markdown conversion
  • RAG pipeline preprocessing
  • Academic paper processing
  • Technical documentation extraction