olmOCR 2

by Allen AI

Open SourceSelf-HostedApache-2.0

Research-focused 7B OCR model from Allen AI optimized for scientific documents. Scores 80.4 on olmOCR-Bench.

OCRLayout Analysis

Overview

olmOCR 2 is Allen AI's open OCR model designed for extracting text from scientific papers, technical documents, and research materials. Built on their OLMo language model foundation, it combines visual understanding with strong text generation capabilities.

The 7B/8B parameter model excels at handling the complex layouts common in academic papers: multi-column text, inline equations, figures with captions, and bibliographies. olmOCR-2-8B scores 80.4 on olmOCR-Bench, placing it among the top PDF linearization systems.

Allen AI also created olmOCR-Bench, the benchmark used to evaluate PDF-to-text quality across the industry. As part of their commitment to open research, olmOCR is fully open-source with permissive licensing.

Strengths

  • Optimized for scientific documents
  • Strong multi-column layout handling
  • Good equation and formula recognition
  • Built on proven OLMo foundation
  • Fully open-source with permissive license

Limitations

  • 7B model requires significant GPU memory
  • May be overkill for simple documents
  • Less tested on non-academic content

Best Use Cases

  • Scientific paper digitization
  • Research document processing
  • Academic archive conversion
  • Technical report extraction