AI Hub
All benchmarks
Multimodal

DocVQA

A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images.

26Models
96.4Top score
93.2Median

A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's ability to understand document structure and content, requiring models to comprehend document layout and perform information retrieval to answer questions about document images.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1009385787020242025Grok-1.5V: 85.6 (2024-04-12)Grok-2: 93.6 (2024-08-13)Grok-2 mini: 93.2 (2024-08-13)Pixtral-12B: 90.7 (2024-09-17)Llama 3.2 90B Instruct: 90.1 (2024-09-25)Llama 3.2 11B Instruct: 88.4 (2024-09-25)Pixtral Large: 93.3 (2024-11-19)Nova Pro: 93.5 (2024-11-20)Nova Lite: 92.4 (2024-11-20)DeepSeek VL2: 93.3 (2024-12-13)DeepSeek VL2 Small: 92.3 (2024-12-13)DeepSeek VL2 Tiny: 88.9 (2024-12-13)Phi-4-multimodal-instruct: 93.2 (2025-02-01)Qwen2.5 VL 32B Instruct: 94.8 (2025-02-28)Gemma 3 27B: 86.6 (2025-03-12)Gemma 3 12B: 87.1 (2025-03-13)Gemma 3 4B: 75.8 (2025-03-13)Qwen2.5-Omni-7B: 95.2 (2025-03-27)Llama 4 Scout: 94.4 (2025-04-05)Llama 4 Maverick: 94.4 (2025-04-05)Mistral Small 3.2 24B Instruct: 94.9 (2025-06-20)Grok-1.5: 85.6 (2024-03-28)Grok-1.5GPT-4o: 92.8 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 95.2 (2024-06-20)Claude 3.5 SonnetQwen2.5 VL 7B Instruct: 95.7 (2025-01-26)Qwen2.5 VL 7B InstructQwen2.5 VL 72B Instruct: 96.4 (2025-02-01)Qwen2.5 VL 72B Instruct

Ranking

1Qwen2.5 VL 72B Instruct
96.4
2Qwen2.5 VL 7B Instruct
95.7
3Qwen2.5-Omni-7B
95.2
4Claude 3.5 Sonnet
95.2
5Mistral Small 3.2 24B Instruct
94.9
6Qwen2.5 VL 32B Instruct
94.8
7Llama 4 Scout
94.4
8Llama 4 Maverick
94.4
9Grok-2
93.6
10Nova Pro
93.5
11DeepSeek VL2
93.3
12Pixtral Large
93.3
13Phi-4-multimodal-instruct
93.2
14Grok-2 mini
93.2
15GPT-4o
92.8
16Nova Lite
92.4
17DeepSeek VL2 Small
92.3
18Pixtral-12B
90.7
19Llama 3.2 90B Instruct
90.1
20DeepSeek VL2 Tiny
88.9
21Llama 3.2 11B Instruct
88.4
22Gemma 3 12B
87.1
23Gemma 3 27B
86.6
24Grok-1.5V
85.6
25Grok-1.5
85.6
26Gemma 3 4B
75.8

Related Multimodal benchmarks