AI Hub
All benchmarks
Coding

MBPP

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers.

31Models
91.3Top score
74.7Median

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1008365483020242025Gemma 2 9B: 52.4 (2024-06-27)Gemma 2 27B: 62.6 (2024-07-13)Qwen2 7B Instruct: 67.2 (2024-07-23)Phi-3.5-mini-instruct: 69.6 (2024-08-23)Qwen2.5 32B Instruct: 84 (2024-09-19)Qwen2.5-Coder 7B Instruct: 83.5 (2024-09-19)Qwen2.5 14B Instruct: 82 (2024-09-19)Qwen2.5 7B Instruct: 79.2 (2024-10-16)Mistral Small 3 24B Base: 69.6 (2025-01-30)Qwen2.5 VL 32B Instruct: 84 (2025-02-28)Gemma 3 27B: 74.4 (2025-03-12)Gemma 3 1B: 35.2 (2025-03-12)Gemma 3 12B: 73 (2025-03-13)Gemma 3 4B: 63.2 (2025-03-13)Mistral Small 3.1 24B Instruct: 74.7 (2025-03-17)Llama 3.1 Nemotron Nano 8B V1: 84.6 (2025-03-18)Qwen2.5-Omni-7B: 73.2 (2025-03-27)Llama 4 Maverick: 77.6 (2025-04-05)Llama 4 Scout: 67.8 (2025-04-05)Qwen3 235B A22B: 81.4 (2025-04-28)Gemini Diffusion: 76 (2025-05-20)Gemma 3n E4B Instructed LiteRT Preview: 63.6 (2025-05-20)Gemma 3n E2B Instructed LiteRT (Preview): 56.6 (2025-05-20)Gemma 3n E4B Instructed: 63.6 (2025-06-26)Gemma 3n E2B Instructed: 56.6 (2025-06-26)Codestral-22B: 78.2 (2024-05-29)Codestral-22BQwen2 72B Instruct: 80.2 (2024-07-23)Qwen2 72B InstructPhi-3.5-MoE-instruct: 80.8 (2024-08-23)Phi-3.5-MoE-instructQwen2.5 72B Instruct: 88.2 (2024-09-19)Qwen2.5 72B InstructQwen2.5 Coder 32B Instruct: 90.2 (2024-11-11)Qwen2.5 Coder 32B InstructLlama-3.3 Nemotron Super 49B v1: 91.3 (2025-03-18)Llama-3.3 Nemotron Super 49B v1

Ranking

Related Coding benchmarks