MathVista
MathVista evaluates mathematical reasoning of foundation models in visual contexts.
34Models
86.8Top score
65.8Median
MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | o3 | 86.8 |
| 2 | o4-mini | 84.3 |
| 3 | Kimi-k1.5 | 74.9 |
| 4 | Llama 4 Maverick | 73.7 |
| 5 | GPT-4.1 Mini | 73.1 |
| 6 | GPT-4.5 | 72.3 |
| 7 | GPT-4.1 | 72.2 |
| 8 | o1 | 71.8 |
| 9 | QvQ-72B-Preview | 71.4 |
| 10 | Llama 4 Scout | 70.7 |
| 11 | Pixtral Large | 69.4 |
| 12 | Grok-2 | 69 |
| 13 | Grok-2 mini | 68.1 |
| 14 | Gemini 1.5 Pro | 68.1 |
| 15 | Qwen2.5-Omni-7B | 67.9 |
| 16 | Claude 3.5 Sonnet | 67.7 |
| 17 | Mistral Small 3.2 24B Instruct | 67.1 |
| 18 | Gemini 1.5 Flash | 65.8 |
| 19 | DeepSeek VL2 | 62.8 |
| 20 | Phi-4-multimodal-instruct | 62.4 |
| 21 | GPT-4o | 61.4 |
| 22 | DeepSeek VL2 Small | 60.7 |
| 23 | Pixtral-12B | 58 |
| 24 | Llama 3.2 90B Instruct | 57.3 |
| 25 | GPT-4o-mini | 56.7 |
| 26 | GPT-4.1 Nano | 56.2 |
| 27 | Gemini 1.5 Flash 8B | 54.7 |
| 28 | DeepSeek VL2 Tiny | 53.6 |
| 29 | Grok-1.5V | 52.8 |
| 30 | Grok-1.5 | 52.8 |
| 31 | Llama 3.2 11B Instruct | 51.5 |
| 32 | Gemini 1.0 Pro | 46.6 |
| 33 | Phi-3.5-vision-instruct | 43.9 |
| 34 | GPT-3.5 Turbo | 0 |