AI Hub
All benchmarks
Math

MGSM

MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems.

29Models
92.3Top score
83.5Median

MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems. Contains 250 grade-school math problems manually translated from the GSM8K dataset into ten typologically diverse languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, and Telugu. Evaluates multilingual mathematical reasoning capabilities.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

10085705540202320242025Gemini 1.5 Pro: 87.5 (2024-02-15)Claude 3 Sonnet: 83.5 (2024-02-29)Claude 3 Haiku: 75.1 (2024-03-13)Gemini 1.5 Flash: 82.6 (2024-05-01)GPT-4o-mini: 87 (2024-07-18)Phi-3.5-MoE-instruct: 58.7 (2024-08-23)Phi-3.5-mini-instruct: 47.9 (2024-08-23)o1-preview: 90.8 (2024-09-12)Llama 3.2 90B Instruct: 86.9 (2024-09-25)Llama 3.2 11B Instruct: 68.9 (2024-09-25)Llama 3.2 3B Instruct: 58.2 (2024-09-25)Claude 3.5 Haiku: 85.6 (2024-11-04)o1: 89.3 (2024-12-05)Llama 3.3 70B Instruct: 91.1 (2024-12-06)Phi 4: 80.6 (2025-01-10)Phi 4 Mini: 63.9 (2025-02-01)Llama 4 Scout: 90.6 (2025-04-05)Qwen3 235B A22B: 83.5 (2025-04-28)Gemma 3n E4B Instructed LiteRT Preview: 60.7 (2025-05-20)Gemma 3n E2B Instructed LiteRT (Preview): 53.1 (2025-05-20)Gemma 3n E4B Instructed: 67 (2025-06-26)Gemma 3n E2B Instructed: 53.1 (2025-06-26)GPT-3.5 Turbo: 56.3 (2023-03-01)GPT-3.5 TurboGPT-4: 74.5 (2023-03-14)GPT-4GPT-4 Turbo: 88.5 (2023-11-06)GPT-4 TurboClaude 3 Opus: 90.7 (2024-03-04)Claude 3 OpusClaude 3.5 Sonnet: 91.6 (2024-06-20)Claude 3.5 Sonneto3-mini: 92 (2025-01-31)o3-miniLlama 4 Maverick: 92.3 (2025-04-05)Llama 4 Maverick

Ranking

1Llama 4 Maverick
92.3
2o3-mini
92
3Claude 3.5 Sonnet
91.6
4Llama 3.3 70B Instruct
91.1
5o1-preview
90.8
6Claude 3 Opus
90.7
7Llama 4 Scout
90.6
8o1
89.3
9GPT-4 Turbo
88.5
10Gemini 1.5 Pro
87.5
11GPT-4o-mini
87
12Llama 3.2 90B Instruct
86.9
13Claude 3.5 Haiku
85.6
14Claude 3 Sonnet
83.5
15Qwen3 235B A22B
83.5
16Gemini 1.5 Flash
82.6
17Phi 4
80.6
18Claude 3 Haiku
75.1
19GPT-4
74.5
20Llama 3.2 11B Instruct
68.9
21Gemma 3n E4B Instructed
67
22Phi 4 Mini
63.9
23Gemma 3n E4B Instructed LiteRT Preview
60.7
24Phi-3.5-MoE-instruct
58.7
25Llama 3.2 3B Instruct
58.2
26GPT-3.5 Turbo
56.3
27Gemma 3n E2B Instructed LiteRT (Preview)
53.1
28Gemma 3n E2B Instructed
53.1
29Phi-3.5-mini-instruct
47.9

Related Math benchmarks