Surya

by Datalab

Open SourceSelf-HostedGPL-3.0

Modern OCR toolkit with line-level text detection, recognition in 90+ languages, and layout analysis. Outperforms Tesseract on accuracy and inference speed.

OCRLayout Analysis

GitHub

Overview

Surya is a Python-based OCR toolkit designed for modern document understanding. It provides line-level text detection and recognition with support for over 90 languages, alongside layout analysis that identifies document elements like tables, images, and headers.

Benchmarks show Surya outperforming Tesseract in both accuracy and inference time. It's particularly strong at handling varied fonts, kerning, and document layouts where traditional OCR struggles.

Surya serves as the OCR backbone for Marker (also from Datalab), making it a key component in the open-source document processing ecosystem.

Strengths

90+ languages with strong multilingual support
Superior accuracy on varied fonts and layouts
Faster inference than Tesseract
Built-in layout analysis for document structure
Active development with frequent updates

Limitations

Requires GPU for optimal performance
GPL license may restrict commercial use
Higher memory footprint than lightweight OCR

Best Use Cases

Modern document digitization
Layout-aware text extraction
Preprocessing for document conversion pipelines
Academic and research document processing