Multi-IF
Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following.
11Models
80.6Top score
72.2Median
Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Qwen3-235B-A22B-Thinking-2507 | 80.6 |
| 2 | o3-mini | 79.5 |
| 3 | Qwen3 Next 80B A3B Thinking | 77.8 |
| 4 | Qwen3-235B-A22B-Instruct-2507 | 77.5 |
| 5 | Qwen3 Next 80B A3B Instruct | 75.8 |
| 6 | Qwen3 30B A3B | 72.2 |
| 7 | GPT-4.1 | 70.8 |
| 8 | GPT-4.5 | 70.8 |
| 9 | GPT-4.1 Mini | 67 |
| 10 | GPT-4o | 60.9 |
| 11 | GPT-4.1 Nano | 57.2 |