MultiPL-E
MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages.
12Models
87.9Top score
72.8Median
MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages. It extends HumanEval and MBPP Python benchmarks to 18 additional programming languages, enabling evaluation of neural code generation models across diverse programming paradigms and language features.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Qwen3-235B-A22B-Instruct-2507 | 87.9 |
| 2 | Qwen3 Next 80B A3B Instruct | 87.8 |
| 3 | Kimi K2-Instruct-0905 | 85.7 |
| 4 | Kimi K2 Instruct | 85.7 |
| 5 | Qwen2.5 32B Instruct | 75.4 |
| 6 | Qwen2.5 72B Instruct | 75.1 |
| 7 | Qwen2.5 14B Instruct | 72.8 |
| 8 | Qwen2.5 7B Instruct | 70.4 |
| 9 | Qwen2 72B Instruct | 69.2 |
| 10 | Qwen3 235B A22B | 65.9 |
| 11 | Qwen2.5-Omni-7B | 65.8 |
| 12 | Qwen2 7B Instruct | 59.1 |