General

IFEval

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing o

Source

41Models

93.9Top score

87.2Median

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	o3-miniOpenAI	93.9
2	Claude 3.7 SonnetAnthropic	93.2
3	Nova ProAmazon	92.1
4	Llama 3.3 70B InstructMeta	92.1
5	Gemma 3 27BGoogle	90.4
6	Nemotron Nano 9B V2NVIDIA	90.3
7	Gemma 3 4BGoogle	90.2
8	Kimi K2-Instruct-0905Moonshot AI	89.8
9	Kimi K2 InstructMoonshot AI	89.8
10	Nova LiteAmazon	89.7
11	Llama 3.1 Nemotron Ultra 253B v1NVIDIA	89.5
12	Gemma 3 12BGoogle	88.9
13	Qwen3 Next 80B A3B ThinkingAlibaba	88.9
14	Qwen3-235B-A22B-Instruct-2507Alibaba	88.7
15	Llama 3.1 405B InstructMeta	88.6
16	GPT-4.5OpenAI	88.2
17	Qwen3-235B-A22B-Thinking-2507Alibaba	87.8
18	Qwen3 Next 80B A3B InstructAlibaba	87.6
19	Llama 3.1 70B InstructMeta	87.5
20	GPT-4.1OpenAI	87.4
21	Nova MicroAmazon	87.2
22	Kimi-k1.5Moonshot AI	87.2
23	DeepSeek-V3DeepSeek	86.1
24	Phi 4 Reasoning PlusMicrosoft	84.9
25	Qwen2.5 72B InstructAlibaba	84.1
26	GPT-4.1 MiniOpenAI	84.1
27	QwQ-32BAlibaba	83.9
28	Phi 4 ReasoningMicrosoft	83.4
29	Mistral Small 3 24B InstructMistral AI	82.9
30	GPT-4oOpenAI	81
31	Llama 3.1 8B InstructMeta	80.4
32	Gemma 3 1BGoogle	80.2
33	Llama 3.1 Nemotron Nano 8B V1NVIDIA	79.3
34	Llama 3.2 3B InstructMeta	77.4
35	Granite 3.3 8B InstructIBM	74.8
36	Granite 3.3 8B BaseIBM	74.8
37	GPT-4.1 NanoOpenAI	74.5
38	Qwen2.5 7B InstructAlibaba	71.2
39	IBM Granite 4.0 Tiny PreviewIBM	63
40	Phi 4Microsoft	63
41	Pixtral-12BMistral AI	61.3

Related General benchmarks

Humanity’s Last Exam360 MMLU-Pro292 MMLU92 SimpleQA26 Arena Hard21 LiveBench13