AI Hub
All benchmarks
Multimodal

MathVista

MathVista evaluates mathematical reasoning of foundation models in visual contexts.

34Models
86.8Top score
65.8Median

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

906845230202320242025Gemini 1.0 Pro: 46.6 (2024-02-15)Gemini 1.5 Flash 8B: 54.7 (2024-03-15)Grok-1.5: 52.8 (2024-03-28)Grok-1.5V: 52.8 (2024-04-12)Gemini 1.5 Flash: 65.8 (2024-05-01)GPT-4o: 61.4 (2024-05-13)Claude 3.5 Sonnet: 67.7 (2024-06-20)GPT-4o-mini: 56.7 (2024-07-18)Grok-2 mini: 68.1 (2024-08-13)Phi-3.5-vision-instruct: 43.9 (2024-08-23)Pixtral-12B: 58 (2024-09-17)Llama 3.2 90B Instruct: 57.3 (2024-09-25)Llama 3.2 11B Instruct: 51.5 (2024-09-25)DeepSeek VL2: 62.8 (2024-12-13)DeepSeek VL2 Small: 60.7 (2024-12-13)DeepSeek VL2 Tiny: 53.6 (2024-12-13)QvQ-72B-Preview: 71.4 (2024-12-25)Phi-4-multimodal-instruct: 62.4 (2025-02-01)GPT-4.5: 72.3 (2025-02-27)Qwen2.5-Omni-7B: 67.9 (2025-03-27)Llama 4 Maverick: 73.7 (2025-04-05)Llama 4 Scout: 70.7 (2025-04-05)GPT-4.1 Mini: 73.1 (2025-04-14)GPT-4.1: 72.2 (2025-04-14)GPT-4.1 Nano: 56.2 (2025-04-14)o4-mini: 84.3 (2025-04-16)Mistral Small 3.2 24B Instruct: 67.1 (2025-06-20)GPT-3.5 Turbo: 0 (2023-03-01)GPT-3.5 TurboGemini 1.5 Pro: 68.1 (2024-02-15)Gemini 1.5 ProGrok-2: 69 (2024-08-13)Grok-2Pixtral Large: 69.4 (2024-11-19)Pixtral Largeo1: 71.8 (2024-12-05)o1Kimi-k1.5: 74.9 (2025-01-20)Kimi-k1.5o3: 86.8 (2025-04-16)o3

Ranking

1o3
86.8
2o4-mini
84.3
3Kimi-k1.5
74.9
4Llama 4 Maverick
73.7
5GPT-4.1 Mini
73.1
6GPT-4.5
72.3
7GPT-4.1
72.2
8o1
71.8
9QvQ-72B-Preview
71.4
10Llama 4 Scout
70.7
11Pixtral Large
69.4
12Grok-2
69
13Grok-2 mini
68.1
14Gemini 1.5 Pro
68.1
15Qwen2.5-Omni-7B
67.9
16Claude 3.5 Sonnet
67.7
17Mistral Small 3.2 24B Instruct
67.1
18Gemini 1.5 Flash
65.8
19DeepSeek VL2
62.8
20Phi-4-multimodal-instruct
62.4
21GPT-4o
61.4
22DeepSeek VL2 Small
60.7
23Pixtral-12B
58
24Llama 3.2 90B Instruct
57.3
25GPT-4o-mini
56.7
26GPT-4.1 Nano
56.2
27Gemini 1.5 Flash 8B
54.7
28DeepSeek VL2 Tiny
53.6
29Grok-1.5V
52.8
30Grok-1.5
52.8
31Llama 3.2 11B Instruct
51.5
32Gemini 1.0 Pro
46.6
33Phi-3.5-vision-instruct
43.9
34GPT-3.5 Turbo
0

Related Multimodal benchmarks