IFEval
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing o
41Models
93.9Top score
87.2Median
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | o3-mini | 93.9 |
| 2 | Claude 3.7 Sonnet | 93.2 |
| 3 | Nova Pro | 92.1 |
| 4 | Llama 3.3 70B Instruct | 92.1 |
| 5 | Gemma 3 27B | 90.4 |
| 6 | Nemotron Nano 9B V2 | 90.3 |
| 7 | Gemma 3 4B | 90.2 |
| 8 | Kimi K2-Instruct-0905 | 89.8 |
| 9 | Kimi K2 Instruct | 89.8 |
| 10 | Nova Lite | 89.7 |
| 11 | Llama 3.1 Nemotron Ultra 253B v1 | 89.5 |
| 12 | Gemma 3 12B | 88.9 |
| 13 | Qwen3 Next 80B A3B Thinking | 88.9 |
| 14 | Qwen3-235B-A22B-Instruct-2507 | 88.7 |
| 15 | Llama 3.1 405B Instruct | 88.6 |
| 16 | GPT-4.5 | 88.2 |
| 17 | Qwen3-235B-A22B-Thinking-2507 | 87.8 |
| 18 | Qwen3 Next 80B A3B Instruct | 87.6 |
| 19 | Llama 3.1 70B Instruct | 87.5 |
| 20 | GPT-4.1 | 87.4 |
| 21 | Nova Micro | 87.2 |
| 22 | Kimi-k1.5 | 87.2 |
| 23 | DeepSeek-V3 | 86.1 |
| 24 | Phi 4 Reasoning Plus | 84.9 |
| 25 | Qwen2.5 72B Instruct | 84.1 |
| 26 | GPT-4.1 Mini | 84.1 |
| 27 | QwQ-32B | 83.9 |
| 28 | Phi 4 Reasoning | 83.4 |
| 29 | Mistral Small 3 24B Instruct | 82.9 |
| 30 | GPT-4o | 81 |
| 31 | Llama 3.1 8B Instruct | 80.4 |
| 32 | Gemma 3 1B | 80.2 |
| 33 | Llama 3.1 Nemotron Nano 8B V1 | 79.3 |
| 34 | Llama 3.2 3B Instruct | 77.4 |
| 35 | Granite 3.3 8B Instruct | 74.8 |
| 36 | Granite 3.3 8B Base | 74.8 |
| 37 | GPT-4.1 Nano | 74.5 |
| 38 | Qwen2.5 7B Instruct | 71.2 |
| 39 | IBM Granite 4.0 Tiny Preview | 63 |
| 40 | Phi 4 | 63 |
| 41 | Pixtral-12B | 61.3 |