General

Arena Hard

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder.

Source

21Models

95.6Top score

70.9Median

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Qwen3 235B A22BAlibaba	95.6
2	Qwen3 32BAlibaba	93.8
3	Qwen3 30B A3BAlibaba	91
4	Llama-3.3 Nemotron Super 49B v1NVIDIA	88.3
5	Mistral Small 3 24B InstructMistral AI	87.6
6	Qwen2.5 72B InstructAlibaba	81.2
7	Phi 4 Reasoning PlusMicrosoft	79
8	DeepSeek-V2.5DeepSeek	76.2
9	Phi 4Microsoft	75.4
10	Phi 4 ReasoningMicrosoft	73.3
11	Ministral 8B InstructMistral AI	70.9
12	Jamba 1.5 LargeAI21 Labs	65.4
13	Granite 3.3 8B InstructIBM	57.6
14	Granite 3.3 8B BaseIBM	57.6
15	Qwen2.5 7B InstructAlibaba	52
16	Jamba 1.5 MiniAI21 Labs	46.1
17	Mistral Small 3.2 24B InstructMistral AI	43.1
18	Phi-3.5-MoE-instructMicrosoft	37.9
19	Phi-3.5-mini-instructMicrosoft	37
20	Phi 4 MiniMicrosoft	32.8
21	IBM Granite 4.0 Tiny PreviewIBM	26.7

Related General benchmarks

Humanity’s Last Exam360 MMLU-Pro292 MMLU92 IFEval41 SimpleQA26 LiveBench13