- Document Processing
- /
- InternVL 3.5
InternVL 3.5
by OpenGVLab
Leading open-source multimodal model with competitive document, chart, and OCR understanding
Overview
InternVL 3.5 is OpenGVLab's pioneering open-source alternative to GPT-4o, demonstrating superior multimodal perception and reasoning capabilities. The series spans from 2B to 78B parameters, offering options for various deployment scenarios.
The model achieves competitive results across nine document understanding benchmarks (AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED-2-Plus, CharXiv, VCR), outperforming other open-source and many closed-source models.
InternVL3 integrates Variable Visual Position Encoding (V2PE) for better long context understanding. The pre-training corpus covers diverse domains including OCR, charts, documents, mathematics, knowledge grounding, and multi-turn dialogue.
Strengths
- Competitive with GPT-4o on document understanding
- 92.7 on DocVQA (8B model)
- Strong chart and table understanding
- Variable Visual Position Encoding for long contexts
- Multiple sizes from 2B to 78B
- MIT license for commercial use
Limitations
- Struggles with low-frequency texts (dot matrix, CAPTCHAs)
- Larger models require significant resources
- Complex mathematical reasoning still challenging
Best Use Cases
- Document question answering
- Chart and table extraction
- Multi-image understanding
- GUI agents and tool usage
- Industrial image analysis