HunyuanOCR

Overview

HunyuanOCR is Tencent's expert OCR vision-language model that achieves state-of-the-art performance with only 1 billion parameters. It combines a 0.4B native resolution ViT with a 0.5B Hunyuan language model through an MLP adapter.

The model embraces end-to-end design, achieving top-tier results with a single instruction and inference—superior efficiency over traditional cascade solutions. Its native resolution encoder supports arbitrary input resolutions through adaptive patching, preserving original aspect ratios for challenging scenarios like long-text documents.

Trained on 200M+ image-text pairs across 9 scenarios (documents, street views, handwriting, invoices, etc.), HunyuanOCR won first place in Track 2.2 of the ICDAR 2025 DIMT competition and achieves comparable results to Qwen3-VL-235B in photo translation tasks.

Strengths

SOTA on OCRBench for models under 3B parameters (score: 860)

94.1 on OmniDocBench for complex document parsing

100+ language support including handwriting and art text

Native resolution support preserves aspect ratios

End-to-end photo translation for 14 languages

Lightweight 1B parameters enables efficient deployment

Overview

Strengths

Limitations

Best Use Cases