Coding

HumanEval

Functional correctness on 164 hand-written Python programming problems.

Source

68Models

95Top score

85.4Median

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Claude Opus 4.6Anthropic	95
2	Kimi K2 0905Moonshot AI	94.5
3	GPT-4.1OpenAI	94
4	Claude 3.5 SonnetAnthropic	93.7
5	GPT-5OpenAI	93.4
6	Kimi K2 InstructMoonshot AI	93.3
7	Phi 4 ReasoningMicrosoft	92.9
8	Qwen2.5 Coder 32B InstructAlibaba	92.7
9	o1-miniOpenAI	92.4
10	Phi 4 Reasoning PlusMicrosoft	92.3
11	Mistral Large 2Mistral AI	92
12	Qwen2.5 VL 32B InstructAlibaba	91.5
13	GPT-4oOpenAI	90.2
14	Granite 3.3 8B InstructIBM	89.7
15	Granite 3.3 8B BaseIBM	89.7
16	Gemini DiffusionGoogle	89.6
17	Nova ProAmazon	89
18	Llama 3.1 405B InstructMeta	89
19	DeepSeek-V2.5DeepSeek	89
20	Qwen2.5-Coder 7B InstructAlibaba	88.4
21	Qwen2.5 32B InstructAlibaba	88.4
22	Mistral Small 3.1 24B InstructMistral AI	88.4
23	Llama 3.3 70B InstructMeta	88.4
24	Grok-2xAI	88.4
25	Claude 3.5 HaikuAnthropic	88.1
26	o1OpenAI	88.1
27	GPT-4.5OpenAI	88
28	Gemma 3 27BGoogle	87.8
29	GPT-4o-miniOpenAI	87.2
30	GPT-4 TurboOpenAI	87.1
31	Qwen2.5 72B InstructAlibaba	86.6
32	Qwen2 72B InstructAlibaba	86
33	Grok-2 minixAI	85.7
34	Nova LiteAmazon	85.4
35	Gemma 3 12BGoogle	85.4
36	Claude 3 OpusAnthropic	84.9
37	Mistral Small 3 24B InstructMistral AI	84.8
38	Qwen2.5 7B InstructAlibaba	84.8
39	Gemini 1.5 ProGoogle	84.1
40	Qwen2.5 14B InstructAlibaba	83.5
41	Phi 4Microsoft	82.8
42	IBM Granite 4.0 Tiny PreviewIBM	82.4
43	Nova MicroAmazon	81.1
44	Codestral-22BMistral AI	81.1
45	Llama 3.1 70B InstructMeta	80.5
46	Qwen2 7B InstructAlibaba	79.9
47	Qwen2.5-Omni-7BAlibaba	78.7
48	Claude 3 HaikuAnthropic	75.9
49	Gemma 3n E4B Instructed LiteRT PreviewGoogle	75
50	Gemma 3n E4B InstructedGoogle	75
51	Gemini 1.5 FlashGoogle	74.3
52	Grok-1.5xAI	74.1
53	Claude 3 SonnetAnthropic	73
54	Llama 3.1 8B InstructMeta	72.6
55	Pixtral-12BMistral AI	72
56	Gemma 3 4BGoogle	71.3
57	Claude 2Anthropic	71.2
58	Phi-3.5-MoE-instructMicrosoft	70.7
59	GPT-3.5 TurboOpenAI	68
60	GPT-4OpenAI	67
61	Gemma 3n E2B Instructed LiteRT (Preview)Google	66.5
62	Gemma 3n E2B InstructedGoogle	66.5
63	Phi-3.5-mini-instructMicrosoft	62.8
64	Gemma 2 27BGoogle	51.8
65	Mixtral 8x22BMistral AI	45
66	Gemma 3 1BGoogle	41.5
67	Gemma 2 9BGoogle	40.2
68	Ministral 8B InstructMistral AI	34.8

Related Coding benchmarks

LiveCodeBench282 SWE-bench Verified51 MBPP31 Aider Polyglot21 Terminal-Bench15 MultiPL-E12