Multimodal

MathVista

MathVista evaluates mathematical reasoning of foundation models in visual contexts.

34Models

86.8Top score

65.8Median

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	o3OpenAI	86.8
2	o4-miniOpenAI	84.3
3	Kimi-k1.5Moonshot AI	74.9
4	Llama 4 MaverickMeta	73.7
5	GPT-4.1 MiniOpenAI	73.1
6	GPT-4.5OpenAI	72.3
7	GPT-4.1OpenAI	72.2
8	o1OpenAI	71.8
9	QvQ-72B-PreviewAlibaba	71.4
10	Llama 4 ScoutMeta	70.7
11	Pixtral LargeMistral AI	69.4
12	Grok-2xAI	69
13	Grok-2 minixAI	68.1
14	Gemini 1.5 ProGoogle	68.1
15	Qwen2.5-Omni-7BAlibaba	67.9
16	Claude 3.5 SonnetAnthropic	67.7
17	Mistral Small 3.2 24B InstructMistral AI	67.1
18	Gemini 1.5 FlashGoogle	65.8
19	DeepSeek VL2DeepSeek	62.8
20	Phi-4-multimodal-instructMicrosoft	62.4
21	GPT-4oOpenAI	61.4
22	DeepSeek VL2 SmallDeepSeek	60.7
23	Pixtral-12BMistral AI	58
24	Llama 3.2 90B InstructMeta	57.3
25	GPT-4o-miniOpenAI	56.7
26	GPT-4.1 NanoOpenAI	56.2
27	Gemini 1.5 Flash 8BGoogle	54.7
28	DeepSeek VL2 TinyDeepSeek	53.6
29	Grok-1.5VxAI	52.8
30	Grok-1.5xAI	52.8
31	Llama 3.2 11B InstructMeta	51.5
32	Gemini 1.0 ProGoogle	46.6
33	Phi-3.5-vision-instructMicrosoft	43.9
34	GPT-3.5 TurboOpenAI	0

Related Multimodal benchmarks

MMMU52 DocVQA26 ChartQA24 AI2D17 MMMU-Pro13