AI Hub
All benchmarks
Math

MATH

Competition mathematics problems requiring multi-step symbolic reasoning.

67Models
97.9Top score
70.6Median

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

10080604020202320242025GPT-4: 42 (2023-03-14)Gemini 1.0 Pro: 32.6 (2024-02-15)Claude 3 Sonnet: 43.1 (2024-02-29)Claude 3 Opus: 60.1 (2024-03-04)Claude 3 Haiku: 38.9 (2024-03-13)Gemini 1.5 Flash 8B: 58.7 (2024-03-15)Grok-1.5: 50.6 (2024-03-28)Mixtral 8x22B: 41.8 (2024-04-10)Gemini 1.5 Flash: 77.9 (2024-05-01)DeepSeek-V2.5: 74.7 (2024-05-08)Claude 3.5 Sonnet: 78.3 (2024-06-20)Gemma 2 9B: 36.6 (2024-06-27)Gemma 2 27B: 42.3 (2024-07-13)GPT-4o-mini: 70.2 (2024-07-18)Llama 3.1 405B Instruct: 73.8 (2024-07-23)Qwen2 72B Instruct: 59.7 (2024-07-23)Qwen2 7B Instruct: 49.6 (2024-07-23)Grok-2: 76.1 (2024-08-13)Grok-2 mini: 73 (2024-08-13)Phi-3.5-MoE-instruct: 59.5 (2024-08-23)Phi-3.5-mini-instruct: 48.5 (2024-08-23)o1-preview: 85.5 (2024-09-12)Pixtral-12B: 48.1 (2024-09-17)Qwen2.5 32B Instruct: 83.1 (2024-09-19)Qwen2.5 72B Instruct: 83.1 (2024-09-19)Qwen2.5 14B Instruct: 80 (2024-09-19)Qwen2.5-Coder 7B Instruct: 46.6 (2024-09-19)Llama 3.2 90B Instruct: 68 (2024-09-25)Llama 3.2 11B Instruct: 51.9 (2024-09-25)Llama 3.2 3B Instruct: 48 (2024-09-25)Qwen2.5 7B Instruct: 75.5 (2024-10-16)Ministral 8B Instruct: 54.5 (2024-10-16)Claude 3.5 Haiku: 69.4 (2024-11-04)Qwen2.5 Coder 32B Instruct: 57.2 (2024-11-11)Nova Pro: 76.6 (2024-11-20)Nova Lite: 73.3 (2024-11-20)Nova Micro: 69.3 (2024-11-20)Llama 3.3 70B Instruct: 77 (2024-12-06)Gemini 2.0 Flash: 89.7 (2024-12-11)Phi 4: 80.4 (2025-01-10)Mistral Small 3 24B Instruct: 70.6 (2025-01-30)Mistral Small 3 24B Base: 46 (2025-01-30)Phi 4 Mini: 64 (2025-02-01)Claude 3.7 Sonnet: 82 (2025-02-24)Gemini 2.0 Flash Lite: 86.8 (2025-02-25)GPT-4.5: 85 (2025-02-27)Qwen2.5 VL 32B Instruct: 82.2 (2025-02-28)Gemma 3 27B: 89 (2025-03-12)Gemma 3 1B: 48 (2025-03-12)Gemma 3 12B: 83.8 (2025-03-13)Gemma 3 4B: 75.6 (2025-03-13)Mistral Small 3.1 24B Instruct: 69.3 (2025-03-17)Gemini 2.5 Pro: 92 (2025-03-25)Qwen2.5-Omni-7B: 71.5 (2025-03-27)Llama 4 Maverick: 61.2 (2025-04-05)Llama 4 Scout: 50.3 (2025-04-05)GPT-4.1: 87 (2025-04-14)Qwen3 235B A22B: 71.8 (2025-04-28)Mistral Small 3.2 24B Instruct: 69.4 (2025-06-20)Kimi K2 Base: 70.2 (2025-07-11)GPT-5: 84.7 (2025-08-07)Kimi K2 0905: 89.1 (2025-09-05)GPT-3.5 Turbo: 43.1 (2023-03-01)GPT-3.5 TurboGPT-4 Turbo: 72.6 (2023-11-06)GPT-4 TurboGemini 1.5 Pro: 86.5 (2024-02-15)Gemini 1.5 Proo1: 96.4 (2024-12-05)o1o3-mini: 97.9 (2025-01-31)o3-mini

Ranking

1o3-mini
97.9
2o1
96.4
3Gemini 2.5 Pro
92
4Gemini 2.0 Flash
89.7
5Kimi K2 0905
89.1
6Gemma 3 27B
89
7GPT-4.1
87
8Gemini 2.0 Flash Lite
86.8
9Gemini 1.5 Pro
86.5
10o1-preview
85.5
11GPT-4.5
85
12GPT-5
84.7
13Gemma 3 12B
83.8
14Qwen2.5 32B Instruct
83.1
15Qwen2.5 72B Instruct
83.1
16Qwen2.5 VL 32B Instruct
82.2
17Claude 3.7 Sonnet
82
18Phi 4
80.4
19Qwen2.5 14B Instruct
80
20Claude 3.5 Sonnet
78.3
21Gemini 1.5 Flash
77.9
22Llama 3.3 70B Instruct
77
23Nova Pro
76.6
24Grok-2
76.1
25Gemma 3 4B
75.6
26Qwen2.5 7B Instruct
75.5
27DeepSeek-V2.5
74.7
28Llama 3.1 405B Instruct
73.8
29Nova Lite
73.3
30Grok-2 mini
73
31GPT-4 Turbo
72.6
32Qwen3 235B A22B
71.8
33Qwen2.5-Omni-7B
71.5
34Mistral Small 3 24B Instruct
70.6
35Kimi K2 Base
70.2
36GPT-4o-mini
70.2
37Mistral Small 3.2 24B Instruct
69.4
38Claude 3.5 Haiku
69.4
39Nova Micro
69.3
40Mistral Small 3.1 24B Instruct
69.3
41Llama 3.2 90B Instruct
68
42Phi 4 Mini
64
43Llama 4 Maverick
61.2
44Claude 3 Opus
60.1
45Qwen2 72B Instruct
59.7
46Phi-3.5-MoE-instruct
59.5
47Gemini 1.5 Flash 8B
58.7
48Qwen2.5 Coder 32B Instruct
57.2
49Ministral 8B Instruct
54.5
50Llama 3.2 11B Instruct
51.9
51Grok-1.5
50.6
52Llama 4 Scout
50.3
53Qwen2 7B Instruct
49.6
54Phi-3.5-mini-instruct
48.5
55Pixtral-12B
48.1
56Gemma 3 1B
48
57Llama 3.2 3B Instruct
48
58Qwen2.5-Coder 7B Instruct
46.6
59Mistral Small 3 24B Base
46
60Claude 3 Sonnet
43.1
61GPT-3.5 Turbo
43.1
62Gemma 2 27B
42.3
63GPT-4
42
64Mixtral 8x22B
41.8
65Claude 3 Haiku
38.9
66Gemma 2 9B
36.6
67Gemini 1.0 Pro
32.6

Related Math benchmarks