AI Hub
All benchmarks
Math

MATH-500

MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions.

169Models
99.4Top score
83.9Median

MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1007550250202320242025Claude Instant: 26.4 (2023-03-14)Llama 2 Chat 13B: 32.9 (2023-07-18)Llama 2 Chat 70B: 32.3 (2023-07-18)Llama 2 Chat 7B: 5.9 (2023-07-18)Mistral 7B Instruct: 12.1 (2023-09-27)Claude 2.1: 37.4 (2023-11-21)Mistral Medium: 40.5 (2023-12-11)Mixtral 8x7B Instruct: 29.9 (2023-12-11)OpenChat 3.5: 30.7 (2023-12-18)Solar Mini: 33.1 (2024-01-25)Gemini 1.0 Pro: 40.3 (2024-02-15)Mistral Small: 56.3 (2024-02-26)Claude 3 Sonnet: 41.4 (2024-02-29)Claude 3 Opus: 64.1 (2024-03-04)Claude 3 Haiku: 39.4 (2024-03-13)Gemini 1.5 Flash 8B: 68.9 (2024-03-15)DBRX Instruct: 27.9 (2024-03-27)Command R+: 27.9 (2024-04-04)Mixtral 8x22B Instruct: 54.5 (2024-04-17)Llama 3 8B Instruct: 49.9 (2024-04-18)Llama 3 70B Instruct: 48.3 (2024-04-18)Phi-3 Mini Instruct 3.8B: 45.7 (2024-04-23)Gemini 1.5 Flash: 82.7 (2024-05-01)DeepSeek-V2.5: 76.3 (2024-05-08)DeepSeek-Coder-V2: 74.3 (2024-06-17)Claude 3.5 Sonnet: 77.1 (2024-06-20)GPT-4o-mini: 78.9 (2024-07-18)Llama 3.1 405B Instruct: 70.3 (2024-07-23)Qwen2 72B Instruct: 70.1 (2024-07-23)Llama 3.1 70B Instruct: 64.9 (2024-07-23)Llama 3.1 8B Instruct: 51.9 (2024-07-23)Mistral Large 2: 73.6 (2024-07-24)Grok-2: 77.8 (2024-08-13)Grok: 73.7 (2024-08-13)Hermes 3 - Llama-3.1 70B: 53.8 (2024-08-15)Jamba 1.5 Large: 60.6 (2024-08-22)Jamba 1.5 Mini: 35.7 (2024-08-22)o1-mini: 90 (2024-09-12)Qwen2.5 72B Instruct: 85.8 (2024-09-19)Qwen2.5 32B Instruct: 80.5 (2024-09-19)Qwen2.5-Coder 7B Instruct: 66 (2024-09-19)Llama 3.2 90B Instruct: 62.9 (2024-09-25)Llama 3.2 11B Instruct: 51.6 (2024-09-25)Llama 3.2 3B Instruct: 48.9 (2024-09-25)Llama 3.2 1B Instruct: 14 (2024-09-25)LFM 40B: 48 (2024-09-30)Llama 3.1 Nemotron 70B Instruct: 73.3 (2024-10-01)Reka Flash: 52.9 (2024-10-04)Claude 3.5 Haiku: 72.1 (2024-11-04)Qwen2.5 Coder 32B Instruct: 76.7 (2024-11-11)Qwen2.5 Turbo: 80.5 (2024-11-18)Pixtral Large: 71.4 (2024-11-19)Mistral Large: 52.7 (2024-11-19)Nova Pro: 78.6 (2024-11-20)Nova Lite: 76.5 (2024-11-20)Nova Micro: 70.3 (2024-11-20)QwQ-32B-Preview: 90.6 (2024-11-28)Llama 3.3 70B Instruct: 77.3 (2024-12-06)Gemini 2.0 Flash: 93 (2024-12-11)DeepSeek-V3: 90.2 (2024-12-26)Phi 4: 81 (2025-01-10)DeepSeek-R1: 96.6 (2025-01-20)Kimi-k1.5: 96.2 (2025-01-20)DeepSeek R1 Zero: 95.9 (2025-01-20)DeepSeek R1 Distill Llama 70B: 94.5 (2025-01-20)DeepSeek R1 Distill Qwen 32B: 94.3 (2025-01-20)DeepSeek R1 Distill Qwen 14B: 93.9 (2025-01-20)DeepSeek R1 Distill Qwen 7B: 92.8 (2025-01-20)DeepSeek R1 Distill Llama 8B: 89.1 (2025-01-20)DeepSeek R1 Distill Qwen 1.5B: 83.9 (2025-01-20)Gemini 2.0 Flash Thinking: 94.4 (2025-01-21)Sonar: 81.7 (2025-01-27)Sonar Reasoning: 92.1 (2025-01-28)Qwen2.5 Max: 83.5 (2025-01-28)Llama 3.1 Tulu3 405B: 77.8 (2025-01-30)Mistral Small 3: 71.5 (2025-01-30)Phi-4-multimodal-instruct: 69.3 (2025-02-01)Gemini 2.0 Pro: 92.3 (2025-02-05)DeepHermes 3 - Llama-3.1 8B: 21.8 (2025-02-13)Grok-3: 87 (2025-02-17)Mistral Saba: 67.7 (2025-02-17)R1 1776: 95.4 (2025-02-18)Claude 3.7 Sonnet: 96.2 (2025-02-24)Gemini 2.0 Flash Lite: 87.3 (2025-02-25)QwQ-32B: 90.6 (2025-03-05)Jamba 1.6 Large: 58 (2025-03-06)Jamba 1.6 Mini: 25.7 (2025-03-06)Sonar Reasoning Pro: 95.7 (2025-03-07)Sonar Pro: 74.5 (2025-03-07)Reka Flash 3: 89.3 (2025-03-12)Gemma 3 27B Instruct: 88.3 (2025-03-12)Gemma 3 12B Instruct: 85.3 (2025-03-12)Gemma 3 4B Instruct: 76.6 (2025-03-12)Command A: 81.9 (2025-03-13)DeepHermes 3 - Mistral 24B: 59.5 (2025-03-13)Gemma 3 1B Instruct: 48.4 (2025-03-13)Mistral Small 3.1: 70.7 (2025-03-17)Llama-3.3 Nemotron Super 49B v1: 96.6 (2025-03-18)Llama 3.1 Nemotron Nano 8B V1: 95.4 (2025-03-18)Gemini 2.5 Pro: 96.7 (2025-03-25)DeepSeek-V3 0324: 94 (2025-03-25)Llama 4 Maverick: 88.9 (2025-04-05)Llama 4 Scout: 84.4 (2025-04-05)Llama 3.1 Nemotron Ultra 253B v1: 97 (2025-04-07)GPT-4.1 Mini: 92.5 (2025-04-14)GPT-4.1: 91.3 (2025-04-14)GPT-4.1 Nano: 84.8 (2025-04-14)o3: 99.2 (2025-04-16)o4-mini: 98.9 (2025-04-16)Granite 3.3 8B Instruct: 69 (2025-04-16)Granite 3.3 8B Base: 69 (2025-04-16)Granite 3.3 8B: 66.5 (2025-04-16)Gemini 2.5 Flash: 98.1 (2025-04-17)Qwen3 32B: 96.1 (2025-04-28)Qwen3 14B: 96.1 (2025-04-28)Qwen3 30B A3B: 95.9 (2025-04-28)Qwen3 4B: 93.3 (2025-04-28)Qwen3 235B A22B: 93 (2025-04-28)Qwen3 8B: 90.4 (2025-04-28)Qwen3 1.7B: 89.4 (2025-04-28)Qwen3 0.6B: 75 (2025-04-28)Phi 4 Mini Reasoning: 94.6 (2025-04-30)Nova Premier: 83.9 (2025-04-30)Mistral Medium 3: 90.7 (2025-05-07)Solar Pro 2: 96.7 (2025-05-20)Llama 3.1 Nemotron Nano 4B v1.1: 94.7 (2025-05-20)Gemma 3n E4B Instruct: 77.1 (2025-05-20)Devstral Small: 68.4 (2025-05-21)Claude Sonnet 4: 99.1 (2025-05-22)Claude Opus 4: 98.2 (2025-05-22)Sarvam M: 84.7 (2025-05-23)DeepSeek-R1-0528: 98.3 (2025-05-28)DeepSeek R1 0528 Qwen3 8B: 93.2 (2025-05-29)Magistral Small 1: 96.3 (2025-06-10)Magistral Medium 1: 91.7 (2025-06-10)MiniMax M1 80k: 98 (2025-06-17)MiniMax M1 40k: 97.2 (2025-06-17)Mistral Small 3.2: 88.3 (2025-06-20)Gemma 3n E2B Instruct: 69.1 (2025-06-26)ERNIE 4.5 300B A47B: 93.1 (2025-06-30)Jamba 1.7 Mini: 25.8 (2025-07-07)Grok 4: 99 (2025-07-09)Devstral Medium: 70.7 (2025-07-10)Kimi K2 Instruct: 97.4 (2025-07-11)Kimi K2: 97.1 (2025-07-11)EXAONE 4.0 32B: 97.7 (2025-07-15)Qwen3-235B-A22B-Instruct-2507: 98 (2025-07-22)Gemini 2.5 Flash Lite: 96.9 (2025-07-22)Qwen3 Coder 480B A35B Instruct: 94.2 (2025-07-22)Qwen3 235B A22B 2507: 98.4 (2025-07-25)Llama Nemotron Super 49B v1.5: 98.3 (2025-07-25)GLM 4.5 Air: 98.1 (2025-07-25)GLM-4.5: 98.2 (2025-07-28)Qwen3 30B A3B 2507 Instruct: 97.5 (2025-07-29)Qwen3 30B A3B 2507: 97.6 (2025-07-30)Qwen3 Coder 30B A3B Instruct: 89.3 (2025-07-31)Jamba Large 1.7: 60 (2025-08-08)Nemotron Nano 9B V2: 97.8 (2025-09-05)Kimi K2-Instruct-0905: 97.4 (2025-09-05)Phi 4 Mini Instruct: 69.6 (2025-10-17)GPT-3.5 Turbo: 44.1 (2023-03-01)GPT-3.5 TurboGPT-4 Turbo: 73.7 (2023-11-06)GPT-4 TurboGemini 1.5 Pro: 87.6 (2024-02-15)Gemini 1.5 ProGPT-4o: 89.3 (2024-05-13)GPT-4oo1-preview: 92.4 (2024-09-12)o1-previewo1: 97 (2024-12-05)o1o3-mini: 98.5 (2025-01-31)o3-miniGrok 3 mini Reasoning: 99.2 (2025-02-19)Grok 3 mini ReasoningGPT-5: 99.4 (2025-08-07)GPT-5

Ranking

1GPT-5
99.4
2Grok 3 mini Reasoning
99.2
3o3
99.2
4Claude Sonnet 4
99.1
5Grok 4
99
6o4-mini
98.9
7o3-mini
98.5
8Qwen3 235B A22B 2507
98.4
9Llama Nemotron Super 49B v1.5
98.3
10DeepSeek-R1-0528
98.3
11GLM-4.5
98.2
12Claude Opus 4
98.2
13GLM 4.5 Air
98.1
14Gemini 2.5 Flash
98.1
15MiniMax M1 80k
98
16Qwen3-235B-A22B-Instruct-2507
98
17Nemotron Nano 9B V2
97.8
18EXAONE 4.0 32B
97.7
19Qwen3 30B A3B 2507
97.6
20Qwen3 30B A3B 2507 Instruct
97.5
21Kimi K2-Instruct-0905
97.4
22Kimi K2 Instruct
97.4
23MiniMax M1 40k
97.2
24Kimi K2
97.1
25Llama 3.1 Nemotron Ultra 253B v1
97
26o1
97
27Gemini 2.5 Flash Lite
96.9
28Solar Pro 2
96.7
29Gemini 2.5 Pro
96.7
30Llama-3.3 Nemotron Super 49B v1
96.6
31DeepSeek-R1
96.6
32Magistral Small 1
96.3
33Kimi-k1.5
96.2
34Claude 3.7 Sonnet
96.2
35Qwen3 32B
96.1
36Qwen3 14B
96.1
37DeepSeek R1 Zero
95.9
38Qwen3 30B A3B
95.9
39Sonar Reasoning Pro
95.7
40R1 1776
95.4
41Llama 3.1 Nemotron Nano 8B V1
95.4
42Llama 3.1 Nemotron Nano 4B v1.1
94.7
43Phi 4 Mini Reasoning
94.6
44DeepSeek R1 Distill Llama 70B
94.5
45Gemini 2.0 Flash Thinking
94.4
46DeepSeek R1 Distill Qwen 32B
94.3
47Qwen3 Coder 480B A35B Instruct
94.2
48DeepSeek-V3 0324
94
49DeepSeek R1 Distill Qwen 14B
93.9
50Qwen3 4B
93.3
51DeepSeek R1 0528 Qwen3 8B
93.2
52ERNIE 4.5 300B A47B
93.1
53Qwen3 235B A22B
93
54Gemini 2.0 Flash
93
55DeepSeek R1 Distill Qwen 7B
92.8
56GPT-4.1 Mini
92.5
57o1-preview
92.4
58Gemini 2.0 Pro
92.3
59Sonar Reasoning
92.1
60Magistral Medium 1
91.7
61GPT-4.1
91.3
62Mistral Medium 3
90.7
63QwQ-32B-Preview
90.6
64QwQ-32B
90.6
65Qwen3 8B
90.4
66DeepSeek-V3
90.2
67o1-mini
90
68Qwen3 1.7B
89.4
69Reka Flash 3
89.3
70Qwen3 Coder 30B A3B Instruct
89.3
71GPT-4o
89.3
72DeepSeek R1 Distill Llama 8B
89.1
73Llama 4 Maverick
88.9
74Mistral Small 3.2
88.3
75Gemma 3 27B Instruct
88.3
76Gemini 1.5 Pro
87.6
77Gemini 2.0 Flash Lite
87.3
78Grok-3
87
79Qwen2.5 72B Instruct
85.8
80Gemma 3 12B Instruct
85.3
81GPT-4.1 Nano
84.8
82Sarvam M
84.7
83Llama 4 Scout
84.4
84Nova Premier
83.9
85DeepSeek R1 Distill Qwen 1.5B
83.9
86Qwen2.5 Max
83.5
87Gemini 1.5 Flash
82.7
88Command A
81.9
89Sonar
81.7
90Phi 4
81
91Qwen2.5 Turbo
80.5
92Qwen2.5 32B Instruct
80.5
93GPT-4o-mini
78.9
94Nova Pro
78.6
95Llama 3.1 Tulu3 405B
77.8
96Grok-2
77.8
97Llama 3.3 70B Instruct
77.3
98Gemma 3n E4B Instruct
77.1
99Claude 3.5 Sonnet
77.1
100Qwen2.5 Coder 32B Instruct
76.7
101Gemma 3 4B Instruct
76.6
102Nova Lite
76.5
103DeepSeek-V2.5
76.3
104Qwen3 0.6B
75
105Sonar Pro
74.5
106DeepSeek-Coder-V2
74.3
107Grok
73.7
108GPT-4 Turbo
73.7
109Mistral Large 2
73.6
110Llama 3.1 Nemotron 70B Instruct
73.3
111Claude 3.5 Haiku
72.1
112Mistral Small 3
71.5
113Pixtral Large
71.4
114Mistral Small 3.1
70.7
115Devstral Medium
70.7
116Nova Micro
70.3
117Llama 3.1 405B Instruct
70.3
118Qwen2 72B Instruct
70.1
119Phi 4 Mini Instruct
69.6
120Phi-4-multimodal-instruct
69.3
121Gemma 3n E2B Instruct
69.1
122Granite 3.3 8B Instruct
69
123Granite 3.3 8B Base
69
124Gemini 1.5 Flash 8B
68.9
125Devstral Small
68.4
126Mistral Saba
67.7
127Granite 3.3 8B
66.5
128Qwen2.5-Coder 7B Instruct
66
129Llama 3.1 70B Instruct
64.9
130Claude 3 Opus
64.1
131Llama 3.2 90B Instruct
62.9
132Jamba 1.5 Large
60.6
133Jamba Large 1.7
60
134DeepHermes 3 - Mistral 24B
59.5
135Jamba 1.6 Large
58
136Mistral Small
56.3
137Mixtral 8x22B Instruct
54.5
138Hermes 3 - Llama-3.1 70B
53.8
139Reka Flash
52.9
140Mistral Large
52.7
141Llama 3.1 8B Instruct
51.9
142Llama 3.2 11B Instruct
51.6
143Llama 3 8B Instruct
49.9
144Llama 3.2 3B Instruct
48.9
145Gemma 3 1B Instruct
48.4
146Llama 3 70B Instruct
48.3
147LFM 40B
48
148Phi-3 Mini Instruct 3.8B
45.7
149GPT-3.5 Turbo
44.1
150Claude 3 Sonnet
41.4
151Mistral Medium
40.5
152Gemini 1.0 Pro
40.3
153Claude 3 Haiku
39.4
154Claude 2.1
37.4
155Jamba 1.5 Mini
35.7
156Solar Mini
33.1
157Llama 2 Chat 13B
32.9
158Llama 2 Chat 70B
32.3
159OpenChat 3.5
30.7
160Mixtral 8x7B Instruct
29.9
161DBRX Instruct
27.9
162Command R+
27.9
163Claude Instant
26.4
164Jamba 1.7 Mini
25.8
165Jamba 1.6 Mini
25.7
166DeepHermes 3 - Llama-3.1 8B
21.8
167Llama 3.2 1B Instruct
14
168Mistral 7B Instruct
12.1
169Llama 2 Chat 7B
5.9

Related Math benchmarks