MGSM
MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems.
29Models
92.3Top score
83.5Median
MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems. Contains 250 grade-school math problems manually translated from the GSM8K dataset into ten typologically diverse languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, and Telugu. Evaluates multilingual mathematical reasoning capabilities.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Llama 4 Maverick | 92.3 |
| 2 | o3-mini | 92 |
| 3 | Claude 3.5 Sonnet | 91.6 |
| 4 | Llama 3.3 70B Instruct | 91.1 |
| 5 | o1-preview | 90.8 |
| 6 | Claude 3 Opus | 90.7 |
| 7 | Llama 4 Scout | 90.6 |
| 8 | o1 | 89.3 |
| 9 | GPT-4 Turbo | 88.5 |
| 10 | Gemini 1.5 Pro | 87.5 |
| 11 | GPT-4o-mini | 87 |
| 12 | Llama 3.2 90B Instruct | 86.9 |
| 13 | Claude 3.5 Haiku | 85.6 |
| 14 | Claude 3 Sonnet | 83.5 |
| 15 | Qwen3 235B A22B | 83.5 |
| 16 | Gemini 1.5 Flash | 82.6 |
| 17 | Phi 4 | 80.6 |
| 18 | Claude 3 Haiku | 75.1 |
| 19 | GPT-4 | 74.5 |
| 20 | Llama 3.2 11B Instruct | 68.9 |
| 21 | Gemma 3n E4B Instructed | 67 |
| 22 | Phi 4 Mini | 63.9 |
| 23 | Gemma 3n E4B Instructed LiteRT Preview | 60.7 |
| 24 | Phi-3.5-MoE-instruct | 58.7 |
| 25 | Llama 3.2 3B Instruct | 58.2 |
| 26 | GPT-3.5 Turbo | 56.3 |
| 27 | Gemma 3n E2B Instructed LiteRT (Preview) | 53.1 |
| 28 | Gemma 3n E2B Instructed | 53.1 |
| 29 | Phi-3.5-mini-instruct | 47.9 |