AI Hub
All benchmarks
General

Multi-IF

Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following.

11Models
80.6Top score
72.2Median

Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

908070605020242025GPT-4.5: 70.8 (2025-02-27)GPT-4.1: 70.8 (2025-04-14)GPT-4.1 Mini: 67 (2025-04-14)GPT-4.1 Nano: 57.2 (2025-04-14)Qwen3 30B A3B: 72.2 (2025-04-28)Qwen3-235B-A22B-Instruct-2507: 77.5 (2025-07-22)Qwen3 Next 80B A3B Thinking: 77.8 (2025-09-11)Qwen3 Next 80B A3B Instruct: 75.8 (2025-09-11)GPT-4o: 60.9 (2024-05-13)GPT-4oo3-mini: 79.5 (2025-01-31)o3-miniQwen3-235B-A22B-Thinking-2507: 80.6 (2025-07-25)Qwen3-235B-A22B-Thinking-2507

Ranking

1Qwen3-235B-A22B-Thinking-2507
80.6
2o3-mini
79.5
3Qwen3 Next 80B A3B Thinking
77.8
4Qwen3-235B-A22B-Instruct-2507
77.5
5Qwen3 Next 80B A3B Instruct
75.8
6Qwen3 30B A3B
72.2
7GPT-4.1
70.8
8GPT-4.5
70.8
9GPT-4.1 Mini
67
10GPT-4o
60.9
11GPT-4.1 Nano
57.2

Related General benchmarks