- Document Processing
- /
- HunyuanOCR
HunyuanOCR
by Tencent
1B parameter end-to-end OCR VLM achieving SOTA on OCRBench with native resolution support
Overview
HunyuanOCR is Tencent's expert OCR vision-language model that achieves state-of-the-art performance with only 1 billion parameters. It combines a 0.4B native resolution ViT with a 0.5B Hunyuan language model through an MLP adapter.
The model embraces end-to-end design, achieving top-tier results with a single instruction and inferenceāsuperior efficiency over traditional cascade solutions. Its native resolution encoder supports arbitrary input resolutions through adaptive patching, preserving original aspect ratios for challenging scenarios like long-text documents.
Trained on 200M+ image-text pairs across 9 scenarios (documents, street views, handwriting, invoices, etc.), HunyuanOCR won first place in Track 2.2 of the ICDAR 2025 DIMT competition and achieves comparable results to Qwen3-VL-235B in photo translation tasks.
Strengths
- SOTA on OCRBench for models under 3B parameters (score: 860)
- 94.1 on OmniDocBench for complex document parsing
- 100+ language support including handwriting and art text
- Native resolution support preserves aspect ratios
- End-to-end photo translation for 14 languages
- Lightweight 1B parameters enables efficient deployment
Limitations
- Newer model with smaller community than established tools
- May require fine-tuning for specialized domains
- Limited to VLM inference requirements
Best Use Cases
- Complex document parsing with tables and formulas
- Street view and scene text recognition
- Invoice and business card extraction
- Video subtitle extraction
- Multi-language document translation