AI Hub
All benchmarks
General

IFEval

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing o

41Models
93.9Top score
87.2Median

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1008875635020242025Llama 3.1 70B Instruct: 87.5 (2024-07-23)Llama 3.1 8B Instruct: 80.4 (2024-07-23)Pixtral-12B: 61.3 (2024-09-17)Qwen2.5 72B Instruct: 84.1 (2024-09-19)Llama 3.2 3B Instruct: 77.4 (2024-09-25)Qwen2.5 7B Instruct: 71.2 (2024-10-16)Nova Lite: 89.7 (2024-11-20)Nova Micro: 87.2 (2024-11-20)Llama 3.3 70B Instruct: 92.1 (2024-12-06)DeepSeek-V3: 86.1 (2024-12-26)Phi 4: 63 (2025-01-10)Kimi-k1.5: 87.2 (2025-01-20)Mistral Small 3 24B Instruct: 82.9 (2025-01-30)Claude 3.7 Sonnet: 93.2 (2025-02-24)GPT-4.5: 88.2 (2025-02-27)QwQ-32B: 83.9 (2025-03-05)Gemma 3 27B: 90.4 (2025-03-12)Gemma 3 1B: 80.2 (2025-03-12)Gemma 3 4B: 90.2 (2025-03-13)Gemma 3 12B: 88.9 (2025-03-13)Llama 3.1 Nemotron Nano 8B V1: 79.3 (2025-03-18)Llama 3.1 Nemotron Ultra 253B v1: 89.5 (2025-04-07)GPT-4.1: 87.4 (2025-04-14)GPT-4.1 Mini: 84.1 (2025-04-14)GPT-4.1 Nano: 74.5 (2025-04-14)Granite 3.3 8B Instruct: 74.8 (2025-04-16)Granite 3.3 8B Base: 74.8 (2025-04-16)Phi 4 Reasoning Plus: 84.9 (2025-04-30)Phi 4 Reasoning: 83.4 (2025-04-30)IBM Granite 4.0 Tiny Preview: 63 (2025-05-02)Kimi K2 Instruct: 89.8 (2025-07-11)Qwen3-235B-A22B-Instruct-2507: 88.7 (2025-07-22)Qwen3-235B-A22B-Thinking-2507: 87.8 (2025-07-25)Nemotron Nano 9B V2: 90.3 (2025-09-05)Kimi K2-Instruct-0905: 89.8 (2025-09-05)Qwen3 Next 80B A3B Thinking: 88.9 (2025-09-11)Qwen3 Next 80B A3B Instruct: 87.6 (2025-09-11)GPT-4o: 81 (2024-05-13)GPT-4oLlama 3.1 405B Instruct: 88.6 (2024-07-23)Llama 3.1 405B InstructNova Pro: 92.1 (2024-11-20)Nova Proo3-mini: 93.9 (2025-01-31)o3-mini

Ranking

1o3-mini
93.9
2Claude 3.7 Sonnet
93.2
3Nova Pro
92.1
4Llama 3.3 70B Instruct
92.1
5Gemma 3 27B
90.4
6Nemotron Nano 9B V2
90.3
7Gemma 3 4B
90.2
8Kimi K2-Instruct-0905
89.8
9Kimi K2 Instruct
89.8
10Nova Lite
89.7
11Llama 3.1 Nemotron Ultra 253B v1
89.5
12Gemma 3 12B
88.9
13Qwen3 Next 80B A3B Thinking
88.9
14Qwen3-235B-A22B-Instruct-2507
88.7
15Llama 3.1 405B Instruct
88.6
16GPT-4.5
88.2
17Qwen3-235B-A22B-Thinking-2507
87.8
18Qwen3 Next 80B A3B Instruct
87.6
19Llama 3.1 70B Instruct
87.5
20GPT-4.1
87.4
21Nova Micro
87.2
22Kimi-k1.5
87.2
23DeepSeek-V3
86.1
24Phi 4 Reasoning Plus
84.9
25Qwen2.5 72B Instruct
84.1
26GPT-4.1 Mini
84.1
27QwQ-32B
83.9
28Phi 4 Reasoning
83.4
29Mistral Small 3 24B Instruct
82.9
30GPT-4o
81
31Llama 3.1 8B Instruct
80.4
32Gemma 3 1B
80.2
33Llama 3.1 Nemotron Nano 8B V1
79.3
34Llama 3.2 3B Instruct
77.4
35Granite 3.3 8B Instruct
74.8
36Granite 3.3 8B Base
74.8
37GPT-4.1 Nano
74.5
38Qwen2.5 7B Instruct
71.2
39IBM Granite 4.0 Tiny Preview
63
40Phi 4
63
41Pixtral-12B
61.3

Related General benchmarks