AI Hub
All benchmarks
Coding

MultiPL-E

MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages.

12Models
87.9Top score
72.8Median

MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages. It extends HumanEval and MBPP Python benchmarks to 18 additional programming languages, enabling evaluation of neural code generation models across diverse programming paradigms and language features.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

908070605020242025Qwen2 7B Instruct: 59.1 (2024-07-23)Qwen2.5 72B Instruct: 75.1 (2024-09-19)Qwen2.5 14B Instruct: 72.8 (2024-09-19)Qwen2.5 7B Instruct: 70.4 (2024-10-16)Qwen2.5-Omni-7B: 65.8 (2025-03-27)Qwen3 235B A22B: 65.9 (2025-04-28)Kimi K2-Instruct-0905: 85.7 (2025-09-05)Qwen3 Next 80B A3B Instruct: 87.8 (2025-09-11)Qwen2 72B Instruct: 69.2 (2024-07-23)Qwen2 72B InstructQwen2.5 32B Instruct: 75.4 (2024-09-19)Qwen2.5 32B InstructKimi K2 Instruct: 85.7 (2025-07-11)Kimi K2 InstructQwen3-235B-A22B-Instruct-2507: 87.9 (2025-07-22)Qwen3-235B-A22B-Instruct-2507

Ranking

Related Coding benchmarks