AI Hub
All benchmarks
Multimodal

ChartQA

ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities

24Models
90.8Top score
85.5Median

ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1009080706020242025Phi-3.5-vision-instruct: 81.8 (2024-08-23)Qwen2-VL-72B-Instruct: 88.3 (2024-08-29)Pixtral-12B: 81.8 (2024-09-17)Llama 3.2 90B Instruct: 85.5 (2024-09-25)Llama 3.2 11B Instruct: 83.4 (2024-09-25)Pixtral Large: 88.1 (2024-11-19)Nova Pro: 89.2 (2024-11-20)Nova Lite: 86.8 (2024-11-20)DeepSeek VL2: 86 (2024-12-13)DeepSeek VL2 Small: 84.5 (2024-12-13)DeepSeek VL2 Tiny: 81 (2024-12-13)Qwen2.5 VL 7B Instruct: 87.3 (2025-01-26)Qwen2.5 VL 72B Instruct: 89.5 (2025-02-01)Phi-4-multimodal-instruct: 81.4 (2025-02-01)Gemma 3 27B: 78 (2025-03-12)Gemma 3 12B: 75.7 (2025-03-13)Gemma 3 4B: 68.8 (2025-03-13)Qwen2.5-Omni-7B: 85.3 (2025-03-27)Llama 4 Maverick: 90 (2025-04-05)Llama 4 Scout: 88.8 (2025-04-05)Mistral Small 3.2 24B Instruct: 87.4 (2025-06-20)Grok-1.5V: 76.1 (2024-04-12)Grok-1.5VGPT-4o: 85.7 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 90.8 (2024-06-20)Claude 3.5 Sonnet

Ranking

1Claude 3.5 Sonnet
90.8
2Llama 4 Maverick
90
3Qwen2.5 VL 72B Instruct
89.5
4Nova Pro
89.2
5Llama 4 Scout
88.8
6Qwen2-VL-72B-Instruct
88.3
7Pixtral Large
88.1
8Mistral Small 3.2 24B Instruct
87.4
9Qwen2.5 VL 7B Instruct
87.3
10Nova Lite
86.8
11DeepSeek VL2
86
12GPT-4o
85.7
13Llama 3.2 90B Instruct
85.5
14Qwen2.5-Omni-7B
85.3
15DeepSeek VL2 Small
84.5
16Llama 3.2 11B Instruct
83.4
17Pixtral-12B
81.8
18Phi-3.5-vision-instruct
81.8
19Phi-4-multimodal-instruct
81.4
20DeepSeek VL2 Tiny
81
21Gemma 3 27B
78
22Grok-1.5V
76.1
23Gemma 3 12B
75.7
24Gemma 3 4B
68.8

Related Multimodal benchmarks