AI Hub
All benchmarks
Coding

HumanEval

Functional correctness on 164 hand-written Python programming problems.

68Models
95Top score
85.4Median

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

100836548302023202420252026GPT-4: 67 (2023-03-14)Gemini 1.5 Pro: 84.1 (2024-02-15)Claude 3 Sonnet: 73 (2024-02-29)Claude 3 Opus: 84.9 (2024-03-04)Claude 3 Haiku: 75.9 (2024-03-13)Grok-1.5: 74.1 (2024-03-28)Mixtral 8x22B: 45 (2024-04-10)Gemini 1.5 Flash: 74.3 (2024-05-01)Codestral-22B: 81.1 (2024-05-29)Gemma 2 9B: 40.2 (2024-06-27)Gemma 2 27B: 51.8 (2024-07-13)GPT-4o-mini: 87.2 (2024-07-18)Llama 3.1 405B Instruct: 89 (2024-07-23)Qwen2 72B Instruct: 86 (2024-07-23)Llama 3.1 70B Instruct: 80.5 (2024-07-23)Qwen2 7B Instruct: 79.9 (2024-07-23)Llama 3.1 8B Instruct: 72.6 (2024-07-23)Mistral Large 2: 92 (2024-07-24)Grok-2: 88.4 (2024-08-13)Grok-2 mini: 85.7 (2024-08-13)Phi-3.5-MoE-instruct: 70.7 (2024-08-23)Phi-3.5-mini-instruct: 62.8 (2024-08-23)o1-mini: 92.4 (2024-09-12)Pixtral-12B: 72 (2024-09-17)Qwen2.5-Coder 7B Instruct: 88.4 (2024-09-19)Qwen2.5 32B Instruct: 88.4 (2024-09-19)Qwen2.5 72B Instruct: 86.6 (2024-09-19)Qwen2.5 14B Instruct: 83.5 (2024-09-19)Qwen2.5 7B Instruct: 84.8 (2024-10-16)Ministral 8B Instruct: 34.8 (2024-10-16)Claude 3.5 Haiku: 88.1 (2024-11-04)Qwen2.5 Coder 32B Instruct: 92.7 (2024-11-11)Nova Pro: 89 (2024-11-20)Nova Lite: 85.4 (2024-11-20)Nova Micro: 81.1 (2024-11-20)o1: 88.1 (2024-12-05)Llama 3.3 70B Instruct: 88.4 (2024-12-06)Phi 4: 82.8 (2025-01-10)Mistral Small 3 24B Instruct: 84.8 (2025-01-30)GPT-4.5: 88 (2025-02-27)Qwen2.5 VL 32B Instruct: 91.5 (2025-02-28)Gemma 3 27B: 87.8 (2025-03-12)Gemma 3 1B: 41.5 (2025-03-12)Gemma 3 12B: 85.4 (2025-03-13)Gemma 3 4B: 71.3 (2025-03-13)Mistral Small 3.1 24B Instruct: 88.4 (2025-03-17)Qwen2.5-Omni-7B: 78.7 (2025-03-27)Granite 3.3 8B Instruct: 89.7 (2025-04-16)Granite 3.3 8B Base: 89.7 (2025-04-16)Phi 4 Reasoning: 92.9 (2025-04-30)Phi 4 Reasoning Plus: 92.3 (2025-04-30)IBM Granite 4.0 Tiny Preview: 82.4 (2025-05-02)Gemini Diffusion: 89.6 (2025-05-20)Gemma 3n E4B Instructed LiteRT Preview: 75 (2025-05-20)Gemma 3n E2B Instructed LiteRT (Preview): 66.5 (2025-05-20)Gemma 3n E4B Instructed: 75 (2025-06-26)Gemma 3n E2B Instructed: 66.5 (2025-06-26)Kimi K2 Instruct: 93.3 (2025-07-11)GPT-5: 93.4 (2025-08-07)GPT-3.5 Turbo: 68 (2023-03-01)GPT-3.5 TurboClaude 2: 71.2 (2023-07-11)Claude 2GPT-4 Turbo: 87.1 (2023-11-06)GPT-4 TurboDeepSeek-V2.5: 89 (2024-05-08)DeepSeek-V2.5GPT-4o: 90.2 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 93.7 (2024-06-20)Claude 3.5 SonnetGPT-4.1: 94 (2025-04-14)GPT-4.1Kimi K2 0905: 94.5 (2025-09-05)Kimi K2 0905Claude Opus 4.6: 95 (2026-02-05)Claude Opus 4.6

Ranking

1Claude Opus 4.6
95
2Kimi K2 0905
94.5
3GPT-4.1
94
4Claude 3.5 Sonnet
93.7
5GPT-5
93.4
6Kimi K2 Instruct
93.3
7Phi 4 Reasoning
92.9
8Qwen2.5 Coder 32B Instruct
92.7
9o1-mini
92.4
10Phi 4 Reasoning Plus
92.3
11Mistral Large 2
92
12Qwen2.5 VL 32B Instruct
91.5
13GPT-4o
90.2
14Granite 3.3 8B Instruct
89.7
15Granite 3.3 8B Base
89.7
16Gemini Diffusion
89.6
17Nova Pro
89
18Llama 3.1 405B Instruct
89
19DeepSeek-V2.5
89
20Qwen2.5-Coder 7B Instruct
88.4
21Qwen2.5 32B Instruct
88.4
22Mistral Small 3.1 24B Instruct
88.4
23Llama 3.3 70B Instruct
88.4
24Grok-2
88.4
25Claude 3.5 Haiku
88.1
26o1
88.1
27GPT-4.5
88
28Gemma 3 27B
87.8
29GPT-4o-mini
87.2
30GPT-4 Turbo
87.1
31Qwen2.5 72B Instruct
86.6
32Qwen2 72B Instruct
86
33Grok-2 mini
85.7
34Nova Lite
85.4
35Gemma 3 12B
85.4
36Claude 3 Opus
84.9
37Mistral Small 3 24B Instruct
84.8
38Qwen2.5 7B Instruct
84.8
39Gemini 1.5 Pro
84.1
40Qwen2.5 14B Instruct
83.5
41Phi 4
82.8
42IBM Granite 4.0 Tiny Preview
82.4
43Nova Micro
81.1
44Codestral-22B
81.1
45Llama 3.1 70B Instruct
80.5
46Qwen2 7B Instruct
79.9
47Qwen2.5-Omni-7B
78.7
48Claude 3 Haiku
75.9
49Gemma 3n E4B Instructed LiteRT Preview
75
50Gemma 3n E4B Instructed
75
51Gemini 1.5 Flash
74.3
52Grok-1.5
74.1
53Claude 3 Sonnet
73
54Llama 3.1 8B Instruct
72.6
55Pixtral-12B
72
56Gemma 3 4B
71.3
57Claude 2
71.2
58Phi-3.5-MoE-instruct
70.7
59GPT-3.5 Turbo
68
60GPT-4
67
61Gemma 3n E2B Instructed LiteRT (Preview)
66.5
62Gemma 3n E2B Instructed
66.5
63Phi-3.5-mini-instruct
62.8
64Gemma 2 27B
51.8
65Mixtral 8x22B
45
66Gemma 3 1B
41.5
67Gemma 2 9B
40.2
68Ministral 8B Instruct
34.8

Related Coding benchmarks