AI Hub
All benchmarks
Math

AIME 2024

American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions.

46Models
95.8Top score
80Median

American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions. Each problem requires an integer answer between 0-999 and tests advanced mathematical reasoning across algebra, geometry, combinatorics, and number theory. Used as a benchmark for evaluating mathematical reasoning capabilities in large language models at Olympiad-level difficulty.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

100755025020242025DeepSeek-V3: 39.2 (2024-12-26)DeepSeek R1 Distill Llama 70B: 86.7 (2025-01-20)DeepSeek R1 Distill Qwen 7B: 83.3 (2025-01-20)DeepSeek R1 Distill Qwen 32B: 83.3 (2025-01-20)DeepSeek R1 Distill Qwen 14B: 80 (2025-01-20)DeepSeek R1 Distill Llama 8B: 80 (2025-01-20)Kimi-k1.5: 77.5 (2025-01-20)DeepSeek R1 Distill Qwen 1.5B: 52.7 (2025-01-20)Gemini 2.0 Flash Thinking: 73.3 (2025-01-21)Grok-3: 93.3 (2025-02-17)Claude 3.7 Sonnet: 80 (2025-02-24)GPT-4.5: 36.7 (2025-02-27)QwQ-32B: 79.5 (2025-03-05)o1-pro: 86 (2025-03-19)Gemini 2.5 Pro: 92 (2025-03-25)DeepSeek-V3 0324: 59.4 (2025-03-25)GPT-4.1 Mini: 49.6 (2025-04-14)GPT-4.1: 48.1 (2025-04-14)GPT-4.1 Nano: 29.4 (2025-04-14)o4-mini: 93.4 (2025-04-16)o3: 91.6 (2025-04-16)Granite 3.3 8B Instruct: 81.2 (2025-04-16)Granite 3.3 8B Base: 81.2 (2025-04-16)Gemini 2.5 Flash: 88 (2025-04-17)Qwen3 235B A22B: 85.7 (2025-04-28)Qwen3 32B: 81.4 (2025-04-28)Qwen3 30B A3B: 80.4 (2025-04-28)Phi 4 Reasoning Plus: 81.3 (2025-04-30)Phi 4 Reasoning: 75.3 (2025-04-30)DeepSeek-R1-0528: 91.4 (2025-05-28)Magistral Medium: 73.6 (2025-06-10)Magistral Small 2506: 70.7 (2025-06-10)Kimi K2 Instruct: 69.6 (2025-07-11)Kimi K2: 69.6 (2025-07-11)GLM 4.5 Air: 89.4 (2025-07-25)GLM-4.5: 91 (2025-07-28)DeepSeek-V3.1: 66.3 (2025-08-21)Kimi K2 0905: 72 (2025-09-05)Kimi K2-Instruct-0905: 69.6 (2025-09-05)GPT-4o: 13.1 (2024-05-13)GPT-4oo1-preview: 42 (2024-09-12)o1-previewQwQ-32B-Preview: 50 (2024-11-28)QwQ-32B-Previewo1: 74.3 (2024-12-05)o1DeepSeek R1 Zero: 86.7 (2025-01-20)DeepSeek R1 Zeroo3-mini: 87.3 (2025-01-31)o3-miniGrok-3 Mini: 95.8 (2025-02-17)Grok-3 Mini

Ranking

1Grok-3 Mini
95.8
2o4-mini
93.4
3Grok-3
93.3
4Gemini 2.5 Pro
92
5o3
91.6
6DeepSeek-R1-0528
91.4
7GLM-4.5
91
8GLM 4.5 Air
89.4
9Gemini 2.5 Flash
88
10o3-mini
87.3
11DeepSeek R1 Zero
86.7
12DeepSeek R1 Distill Llama 70B
86.7
13o1-pro
86
14Qwen3 235B A22B
85.7
15DeepSeek R1 Distill Qwen 7B
83.3
16DeepSeek R1 Distill Qwen 32B
83.3
17Qwen3 32B
81.4
18Phi 4 Reasoning Plus
81.3
19Granite 3.3 8B Instruct
81.2
20Granite 3.3 8B Base
81.2
21Qwen3 30B A3B
80.4
22DeepSeek R1 Distill Qwen 14B
80
23DeepSeek R1 Distill Llama 8B
80
24Claude 3.7 Sonnet
80
25QwQ-32B
79.5
26Kimi-k1.5
77.5
27Phi 4 Reasoning
75.3
28o1
74.3
29Magistral Medium
73.6
30Gemini 2.0 Flash Thinking
73.3
31Kimi K2 0905
72
32Magistral Small 2506
70.7
33Kimi K2-Instruct-0905
69.6
34Kimi K2 Instruct
69.6
35Kimi K2
69.6
36DeepSeek-V3.1
66.3
37DeepSeek-V3 0324
59.4
38DeepSeek R1 Distill Qwen 1.5B
52.7
39QwQ-32B-Preview
50
40GPT-4.1 Mini
49.6
41GPT-4.1
48.1
42o1-preview
42
43DeepSeek-V3
39.2
44GPT-4.5
36.7
45GPT-4.1 Nano
29.4
46GPT-4o
13.1

Related Math benchmarks