Math

GSM8K

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

Source

45Models

97.3Top score

91.1Median

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Kimi K2 InstructMoonshot AI	97.3
2	o1OpenAI	97.1
3	GPT-4.5OpenAI	97
4	Llama 3.1 405B InstructMeta	96.8
5	Claude 3.5 SonnetAnthropic	96.4
6	Qwen2.5 32B InstructAlibaba	95.9
7	Gemma 3 27BGoogle	95.9
8	Qwen2.5 72B InstructAlibaba	95.8
9	DeepSeek-V2.5DeepSeek	95.1
10	Claude 3 OpusAnthropic	95
11	Qwen2.5 14B InstructAlibaba	94.8
12	Nova ProAmazon	94.8
13	Nova LiteAmazon	94.5
14	Gemma 3 12BGoogle	94.4
15	Qwen3 235B A22BAlibaba	94.4
16	Mistral Large 2Mistral AI	93
17	Nova MicroAmazon	92.3
18	Claude 3 SonnetAnthropic	92.3
19	Kimi K2 BaseMoonshot AI	92.1
20	Qwen2.5 7B InstructAlibaba	91.6
21	Llama 3.1 Nemotron 70B InstructNVIDIA	91.4
22	Qwen2 72B InstructAlibaba	91.1
23	Qwen2.5 Coder 32B InstructAlibaba	91.1
24	Gemini 1.5 ProGoogle	90.8
25	Grok-1.5xAI	90
26	Gemma 3 4BGoogle	89.2
27	Claude 3 HaikuAnthropic	88.9
28	Qwen2.5-Omni-7BAlibaba	88.7
29	Phi-3.5-MoE-instructMicrosoft	88.7
30	Phi 4 MiniMicrosoft	88.6
31	Jamba 1.5 LargeAI21 Labs	87
32	Phi-3.5-mini-instructMicrosoft	86.2
33	Gemini 1.5 FlashGoogle	86.2
34	Qwen2.5-Coder 7B InstructAlibaba	83.9
35	Qwen2 7B InstructAlibaba	82.3
36	Granite 3.3 8B InstructIBM	80.9
37	Mistral Small 3 24B BaseMistral AI	80.7
38	Llama 3.2 3B InstructMeta	77.7
39	Jamba 1.5 MiniAI21 Labs	75.8
40	Gemma 2 27BGoogle	74
41	Command R+Cohere	70.7
42	IBM Granite 4.0 Tiny PreviewIBM	70.1
43	Gemma 2 9BGoogle	68.6
44	Gemma 3 1BGoogle	62.8
45	Granite 3.3 8B BaseIBM	59

Related Math benchmarks

AIME 2025221 MATH-500169 MATH67 AIME 202446 MGSM29 HMMT 202511