AIME 2024
American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions.
46Models
95.8Top score
80Median
American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions. Each problem requires an integer answer between 0-999 and tests advanced mathematical reasoning across algebra, geometry, combinatorics, and number theory. Used as a benchmark for evaluating mathematical reasoning capabilities in large language models at Olympiad-level difficulty.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Grok-3 Mini | 95.8 |
| 2 | o4-mini | 93.4 |
| 3 | Grok-3 | 93.3 |
| 4 | Gemini 2.5 Pro | 92 |
| 5 | o3 | 91.6 |
| 6 | DeepSeek-R1-0528 | 91.4 |
| 7 | GLM-4.5 | 91 |
| 8 | GLM 4.5 Air | 89.4 |
| 9 | Gemini 2.5 Flash | 88 |
| 10 | o3-mini | 87.3 |
| 11 | DeepSeek R1 Zero | 86.7 |
| 12 | DeepSeek R1 Distill Llama 70B | 86.7 |
| 13 | o1-pro | 86 |
| 14 | Qwen3 235B A22B | 85.7 |
| 15 | DeepSeek R1 Distill Qwen 7B | 83.3 |
| 16 | DeepSeek R1 Distill Qwen 32B | 83.3 |
| 17 | Qwen3 32B | 81.4 |
| 18 | Phi 4 Reasoning Plus | 81.3 |
| 19 | Granite 3.3 8B Instruct | 81.2 |
| 20 | Granite 3.3 8B Base | 81.2 |
| 21 | Qwen3 30B A3B | 80.4 |
| 22 | DeepSeek R1 Distill Qwen 14B | 80 |
| 23 | DeepSeek R1 Distill Llama 8B | 80 |
| 24 | Claude 3.7 Sonnet | 80 |
| 25 | QwQ-32B | 79.5 |
| 26 | Kimi-k1.5 | 77.5 |
| 27 | Phi 4 Reasoning | 75.3 |
| 28 | o1 | 74.3 |
| 29 | Magistral Medium | 73.6 |
| 30 | Gemini 2.0 Flash Thinking | 73.3 |
| 31 | Kimi K2 0905 | 72 |
| 32 | Magistral Small 2506 | 70.7 |
| 33 | Kimi K2-Instruct-0905 | 69.6 |
| 34 | Kimi K2 Instruct | 69.6 |
| 35 | Kimi K2 | 69.6 |
| 36 | DeepSeek-V3.1 | 66.3 |
| 37 | DeepSeek-V3 0324 | 59.4 |
| 38 | DeepSeek R1 Distill Qwen 1.5B | 52.7 |
| 39 | QwQ-32B-Preview | 50 |
| 40 | GPT-4.1 Mini | 49.6 |
| 41 | GPT-4.1 | 48.1 |
| 42 | o1-preview | 42 |
| 43 | DeepSeek-V3 | 39.2 |
| 44 | GPT-4.5 | 36.7 |
| 45 | GPT-4.1 Nano | 29.4 |
| 46 | GPT-4o | 13.1 |