HunyuanOCR

by Tencent

Open SourceSelf-HostedApache-2.0

1B parameter end-to-end OCR VLM achieving SOTA on OCRBench with native resolution support

OCRTable ExtractionData Extraction

Overview

HunyuanOCR is Tencent's expert OCR vision-language model that achieves state-of-the-art performance with only 1 billion parameters. It combines a 0.4B native resolution ViT with a 0.5B Hunyuan language model through an MLP adapter.

The model embraces end-to-end design, achieving top-tier results with a single instruction and inference—superior efficiency over traditional cascade solutions. Its native resolution encoder supports arbitrary input resolutions through adaptive patching, preserving original aspect ratios for challenging scenarios like long-text documents.

Trained on 200M+ image-text pairs across 9 scenarios (documents, street views, handwriting, invoices, etc.), HunyuanOCR won first place in Track 2.2 of the ICDAR 2025 DIMT competition and achieves comparable results to Qwen3-VL-235B in photo translation tasks.

Strengths

  • SOTA on OCRBench for models under 3B parameters (score: 860)
  • 94.1 on OmniDocBench for complex document parsing
  • 100+ language support including handwriting and art text
  • Native resolution support preserves aspect ratios
  • End-to-end photo translation for 14 languages
  • Lightweight 1B parameters enables efficient deployment

Limitations

  • Newer model with smaller community than established tools
  • May require fine-tuning for specialized domains
  • Limited to VLM inference requirements

Best Use Cases

  • Complex document parsing with tables and formulas
  • Street view and scene text recognition
  • Invoice and business card extraction
  • Video subtitle extraction
  • Multi-language document translation