Multimodal

DocVQA

A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images.

Source

26Models

96.4Top score

93.2Median

A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's ability to understand document structure and content, requiring models to comprehend document layout and perform information retrieval to answer questions about document images.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Qwen2.5 VL 72B InstructAlibaba	96.4
2	Qwen2.5 VL 7B InstructAlibaba	95.7
3	Qwen2.5-Omni-7BAlibaba	95.2
4	Claude 3.5 SonnetAnthropic	95.2
5	Mistral Small 3.2 24B InstructMistral AI	94.9
6	Qwen2.5 VL 32B InstructAlibaba	94.8
7	Llama 4 ScoutMeta	94.4
8	Llama 4 MaverickMeta	94.4
9	Grok-2xAI	93.6
10	Nova ProAmazon	93.5
11	DeepSeek VL2DeepSeek	93.3
12	Pixtral LargeMistral AI	93.3
13	Phi-4-multimodal-instructMicrosoft	93.2
14	Grok-2 minixAI	93.2
15	GPT-4oOpenAI	92.8
16	Nova LiteAmazon	92.4
17	DeepSeek VL2 SmallDeepSeek	92.3
18	Pixtral-12BMistral AI	90.7
19	Llama 3.2 90B InstructMeta	90.1
20	DeepSeek VL2 TinyDeepSeek	88.9
21	Llama 3.2 11B InstructMeta	88.4
22	Gemma 3 12BGoogle	87.1
23	Gemma 3 27BGoogle	86.6
24	Grok-1.5VxAI	85.6
25	Grok-1.5xAI	85.6
26	Gemma 3 4BGoogle	75.8

Related Multimodal benchmarks

MMMU52 MathVista34 ChartQA24 AI2D17 MMMU-Pro13