General

Multi-IF

Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following.

11Models

80.6Top score

72.2Median

Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Qwen3-235B-A22B-Thinking-2507Alibaba	80.6
2	o3-miniOpenAI	79.5
3	Qwen3 Next 80B A3B ThinkingAlibaba	77.8
4	Qwen3-235B-A22B-Instruct-2507Alibaba	77.5
5	Qwen3 Next 80B A3B InstructAlibaba	75.8
6	Qwen3 30B A3BAlibaba	72.2
7	GPT-4.1OpenAI	70.8
8	GPT-4.5OpenAI	70.8
9	GPT-4.1 MiniOpenAI	67
10	GPT-4oOpenAI	60.9
11	GPT-4.1 NanoOpenAI	57.2

Related General benchmarks

Humanity’s Last Exam360 MMLU-Pro292 MMLU92 IFEval41 SimpleQA26 Arena Hard21