AI Hub
All benchmarks
Math

GSM8K

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

45Models
97.3Top score
91.1Median

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1008875635020242025Claude 3 Haiku: 88.9 (2024-03-13)Grok-1.5: 90 (2024-03-28)Command R+: 70.7 (2024-04-04)Gemini 1.5 Flash: 86.2 (2024-05-01)Gemma 2 9B: 68.6 (2024-06-27)Gemma 2 27B: 74 (2024-07-13)Qwen2 72B Instruct: 91.1 (2024-07-23)Qwen2 7B Instruct: 82.3 (2024-07-23)Mistral Large 2: 93 (2024-07-24)Jamba 1.5 Large: 87 (2024-08-22)Jamba 1.5 Mini: 75.8 (2024-08-22)Phi-3.5-MoE-instruct: 88.7 (2024-08-23)Phi-3.5-mini-instruct: 86.2 (2024-08-23)Qwen2.5 32B Instruct: 95.9 (2024-09-19)Qwen2.5 72B Instruct: 95.8 (2024-09-19)Qwen2.5 14B Instruct: 94.8 (2024-09-19)Qwen2.5-Coder 7B Instruct: 83.9 (2024-09-19)Llama 3.2 3B Instruct: 77.7 (2024-09-25)Llama 3.1 Nemotron 70B Instruct: 91.4 (2024-10-01)Qwen2.5 7B Instruct: 91.6 (2024-10-16)Qwen2.5 Coder 32B Instruct: 91.1 (2024-11-11)Nova Pro: 94.8 (2024-11-20)Nova Lite: 94.5 (2024-11-20)Nova Micro: 92.3 (2024-11-20)Mistral Small 3 24B Base: 80.7 (2025-01-30)Phi 4 Mini: 88.6 (2025-02-01)GPT-4.5: 97 (2025-02-27)Gemma 3 27B: 95.9 (2025-03-12)Gemma 3 1B: 62.8 (2025-03-12)Gemma 3 12B: 94.4 (2025-03-13)Gemma 3 4B: 89.2 (2025-03-13)Qwen2.5-Omni-7B: 88.7 (2025-03-27)Granite 3.3 8B Instruct: 80.9 (2025-04-16)Granite 3.3 8B Base: 59 (2025-04-16)Qwen3 235B A22B: 94.4 (2025-04-28)IBM Granite 4.0 Tiny Preview: 70.1 (2025-05-02)Kimi K2 Base: 92.1 (2025-07-11)Gemini 1.5 Pro: 90.8 (2024-02-15)Gemini 1.5 ProClaude 3 Sonnet: 92.3 (2024-02-29)Claude 3 SonnetClaude 3 Opus: 95 (2024-03-04)Claude 3 OpusDeepSeek-V2.5: 95.1 (2024-05-08)DeepSeek-V2.5Claude 3.5 Sonnet: 96.4 (2024-06-20)Claude 3.5 SonnetLlama 3.1 405B Instruct: 96.8 (2024-07-23)Llama 3.1 405B Instructo1: 97.1 (2024-12-05)o1Kimi K2 Instruct: 97.3 (2025-07-11)Kimi K2 Instruct

Ranking

1Kimi K2 Instruct
97.3
2o1
97.1
3GPT-4.5
97
4Llama 3.1 405B Instruct
96.8
5Claude 3.5 Sonnet
96.4
6Qwen2.5 32B Instruct
95.9
7Gemma 3 27B
95.9
8Qwen2.5 72B Instruct
95.8
9DeepSeek-V2.5
95.1
10Claude 3 Opus
95
11Qwen2.5 14B Instruct
94.8
12Nova Pro
94.8
13Nova Lite
94.5
14Gemma 3 12B
94.4
15Qwen3 235B A22B
94.4
16Mistral Large 2
93
17Nova Micro
92.3
18Claude 3 Sonnet
92.3
19Kimi K2 Base
92.1
20Qwen2.5 7B Instruct
91.6
21Llama 3.1 Nemotron 70B Instruct
91.4
22Qwen2 72B Instruct
91.1
23Qwen2.5 Coder 32B Instruct
91.1
24Gemini 1.5 Pro
90.8
25Grok-1.5
90
26Gemma 3 4B
89.2
27Claude 3 Haiku
88.9
28Qwen2.5-Omni-7B
88.7
29Phi-3.5-MoE-instruct
88.7
30Phi 4 Mini
88.6
31Jamba 1.5 Large
87
32Phi-3.5-mini-instruct
86.2
33Gemini 1.5 Flash
86.2
34Qwen2.5-Coder 7B Instruct
83.9
35Qwen2 7B Instruct
82.3
36Granite 3.3 8B Instruct
80.9
37Mistral Small 3 24B Base
80.7
38Llama 3.2 3B Instruct
77.7
39Jamba 1.5 Mini
75.8
40Gemma 2 27B
74
41Command R+
70.7
42IBM Granite 4.0 Tiny Preview
70.1
43Gemma 2 9B
68.6
44Gemma 3 1B
62.8
45Granite 3.3 8B Base
59

Related Math benchmarks