ChartQA
ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities
24Models
90.8Top score
85.5Median
ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude 3.5 Sonnet | 90.8 |
| 2 | Llama 4 Maverick | 90 |
| 3 | Qwen2.5 VL 72B Instruct | 89.5 |
| 4 | Nova Pro | 89.2 |
| 5 | Llama 4 Scout | 88.8 |
| 6 | Qwen2-VL-72B-Instruct | 88.3 |
| 7 | Pixtral Large | 88.1 |
| 8 | Mistral Small 3.2 24B Instruct | 87.4 |
| 9 | Qwen2.5 VL 7B Instruct | 87.3 |
| 10 | Nova Lite | 86.8 |
| 11 | DeepSeek VL2 | 86 |
| 12 | GPT-4o | 85.7 |
| 13 | Llama 3.2 90B Instruct | 85.5 |
| 14 | Qwen2.5-Omni-7B | 85.3 |
| 15 | DeepSeek VL2 Small | 84.5 |
| 16 | Llama 3.2 11B Instruct | 83.4 |
| 17 | Pixtral-12B | 81.8 |
| 18 | Phi-3.5-vision-instruct | 81.8 |
| 19 | Phi-4-multimodal-instruct | 81.4 |
| 20 | DeepSeek VL2 Tiny | 81 |
| 21 | Gemma 3 27B | 78 |
| 22 | Grok-1.5V | 76.1 |
| 23 | Gemma 3 12B | 75.7 |
| 24 | Gemma 3 4B | 68.8 |