MBPP
MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers.
31Models
91.3Top score
74.7Median
MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Llama-3.3 Nemotron Super 49B v1 | 91.3 |
| 2 | Qwen2.5 Coder 32B Instruct | 90.2 |
| 3 | Qwen2.5 72B Instruct | 88.2 |
| 4 | Llama 3.1 Nemotron Nano 8B V1 | 84.6 |
| 5 | Qwen2.5 VL 32B Instruct | 84 |
| 6 | Qwen2.5 32B Instruct | 84 |
| 7 | Qwen2.5-Coder 7B Instruct | 83.5 |
| 8 | Qwen2.5 14B Instruct | 82 |
| 9 | Qwen3 235B A22B | 81.4 |
| 10 | Phi-3.5-MoE-instruct | 80.8 |
| 11 | Qwen2 72B Instruct | 80.2 |
| 12 | Qwen2.5 7B Instruct | 79.2 |
| 13 | Codestral-22B | 78.2 |
| 14 | Llama 4 Maverick | 77.6 |
| 15 | Gemini Diffusion | 76 |
| 16 | Mistral Small 3.1 24B Instruct | 74.7 |
| 17 | Gemma 3 27B | 74.4 |
| 18 | Qwen2.5-Omni-7B | 73.2 |
| 19 | Gemma 3 12B | 73 |
| 20 | Phi-3.5-mini-instruct | 69.6 |
| 21 | Mistral Small 3 24B Base | 69.6 |
| 22 | Llama 4 Scout | 67.8 |
| 23 | Qwen2 7B Instruct | 67.2 |
| 24 | Gemma 3n E4B Instructed LiteRT Preview | 63.6 |
| 25 | Gemma 3n E4B Instructed | 63.6 |
| 26 | Gemma 3 4B | 63.2 |
| 27 | Gemma 2 27B | 62.6 |
| 28 | Gemma 3n E2B Instructed LiteRT (Preview) | 56.6 |
| 29 | Gemma 3n E2B Instructed | 56.6 |
| 30 | Gemma 2 9B | 52.4 |
| 31 | Gemma 3 1B | 35.2 |