GSM8K
Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
45Models
97.3Top score
91.1Median
Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Kimi K2 Instruct | 97.3 |
| 2 | o1 | 97.1 |
| 3 | GPT-4.5 | 97 |
| 4 | Llama 3.1 405B Instruct | 96.8 |
| 5 | Claude 3.5 Sonnet | 96.4 |
| 6 | Qwen2.5 32B Instruct | 95.9 |
| 7 | Gemma 3 27B | 95.9 |
| 8 | Qwen2.5 72B Instruct | 95.8 |
| 9 | DeepSeek-V2.5 | 95.1 |
| 10 | Claude 3 Opus | 95 |
| 11 | Qwen2.5 14B Instruct | 94.8 |
| 12 | Nova Pro | 94.8 |
| 13 | Nova Lite | 94.5 |
| 14 | Gemma 3 12B | 94.4 |
| 15 | Qwen3 235B A22B | 94.4 |
| 16 | Mistral Large 2 | 93 |
| 17 | Nova Micro | 92.3 |
| 18 | Claude 3 Sonnet | 92.3 |
| 19 | Kimi K2 Base | 92.1 |
| 20 | Qwen2.5 7B Instruct | 91.6 |
| 21 | Llama 3.1 Nemotron 70B Instruct | 91.4 |
| 22 | Qwen2 72B Instruct | 91.1 |
| 23 | Qwen2.5 Coder 32B Instruct | 91.1 |
| 24 | Gemini 1.5 Pro | 90.8 |
| 25 | Grok-1.5 | 90 |
| 26 | Gemma 3 4B | 89.2 |
| 27 | Claude 3 Haiku | 88.9 |
| 28 | Qwen2.5-Omni-7B | 88.7 |
| 29 | Phi-3.5-MoE-instruct | 88.7 |
| 30 | Phi 4 Mini | 88.6 |
| 31 | Jamba 1.5 Large | 87 |
| 32 | Phi-3.5-mini-instruct | 86.2 |
| 33 | Gemini 1.5 Flash | 86.2 |
| 34 | Qwen2.5-Coder 7B Instruct | 83.9 |
| 35 | Qwen2 7B Instruct | 82.3 |
| 36 | Granite 3.3 8B Instruct | 80.9 |
| 37 | Mistral Small 3 24B Base | 80.7 |
| 38 | Llama 3.2 3B Instruct | 77.7 |
| 39 | Jamba 1.5 Mini | 75.8 |
| 40 | Gemma 2 27B | 74 |
| 41 | Command R+ | 70.7 |
| 42 | IBM Granite 4.0 Tiny Preview | 70.1 |
| 43 | Gemma 2 9B | 68.6 |
| 44 | Gemma 3 1B | 62.8 |
| 45 | Granite 3.3 8B Base | 59 |