DocVQA
A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images.
26Models
96.4Top score
93.2Median
A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's ability to understand document structure and content, requiring models to comprehend document layout and perform information retrieval to answer questions about document images.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Qwen2.5 VL 72B Instruct | 96.4 |
| 2 | Qwen2.5 VL 7B Instruct | 95.7 |
| 3 | Qwen2.5-Omni-7B | 95.2 |
| 4 | Claude 3.5 Sonnet | 95.2 |
| 5 | Mistral Small 3.2 24B Instruct | 94.9 |
| 6 | Qwen2.5 VL 32B Instruct | 94.8 |
| 7 | Llama 4 Scout | 94.4 |
| 8 | Llama 4 Maverick | 94.4 |
| 9 | Grok-2 | 93.6 |
| 10 | Nova Pro | 93.5 |
| 11 | DeepSeek VL2 | 93.3 |
| 12 | Pixtral Large | 93.3 |
| 13 | Phi-4-multimodal-instruct | 93.2 |
| 14 | Grok-2 mini | 93.2 |
| 15 | GPT-4o | 92.8 |
| 16 | Nova Lite | 92.4 |
| 17 | DeepSeek VL2 Small | 92.3 |
| 18 | Pixtral-12B | 90.7 |
| 19 | Llama 3.2 90B Instruct | 90.1 |
| 20 | DeepSeek VL2 Tiny | 88.9 |
| 21 | Llama 3.2 11B Instruct | 88.4 |
| 22 | Gemma 3 12B | 87.1 |
| 23 | Gemma 3 27B | 86.6 |
| 24 | Grok-1.5V | 85.6 |
| 25 | Grok-1.5 | 85.6 |
| 26 | Gemma 3 4B | 75.8 |