Coding

MBPP

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers.

Source

31Models

91.3Top score

74.7Median

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Llama-3.3 Nemotron Super 49B v1NVIDIA	91.3
2	Qwen2.5 Coder 32B InstructAlibaba	90.2
3	Qwen2.5 72B InstructAlibaba	88.2
4	Llama 3.1 Nemotron Nano 8B V1NVIDIA	84.6
5	Qwen2.5 VL 32B InstructAlibaba	84
6	Qwen2.5 32B InstructAlibaba	84
7	Qwen2.5-Coder 7B InstructAlibaba	83.5
8	Qwen2.5 14B InstructAlibaba	82
9	Qwen3 235B A22BAlibaba	81.4
10	Phi-3.5-MoE-instructMicrosoft	80.8
11	Qwen2 72B InstructAlibaba	80.2
12	Qwen2.5 7B InstructAlibaba	79.2
13	Codestral-22BMistral AI	78.2
14	Llama 4 MaverickMeta	77.6
15	Gemini DiffusionGoogle	76
16	Mistral Small 3.1 24B InstructMistral AI	74.7
17	Gemma 3 27BGoogle	74.4
18	Qwen2.5-Omni-7BAlibaba	73.2
19	Gemma 3 12BGoogle	73
20	Phi-3.5-mini-instructMicrosoft	69.6
21	Mistral Small 3 24B BaseMistral AI	69.6
22	Llama 4 ScoutMeta	67.8
23	Qwen2 7B InstructAlibaba	67.2
24	Gemma 3n E4B Instructed LiteRT PreviewGoogle	63.6
25	Gemma 3n E4B InstructedGoogle	63.6
26	Gemma 3 4BGoogle	63.2
27	Gemma 2 27BGoogle	62.6
28	Gemma 3n E2B Instructed LiteRT (Preview)Google	56.6
29	Gemma 3n E2B InstructedGoogle	56.6
30	Gemma 2 9BGoogle	52.4
31	Gemma 3 1BGoogle	35.2

Related Coding benchmarks

LiveCodeBench282 HumanEval68 SWE-bench Verified51 Aider Polyglot21 Terminal-Bench15 MultiPL-E12