Arena Hard
Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder.
Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Qwen3 235B A22B | 95.6 |
| 2 | Qwen3 32B | 93.8 |
| 3 | Qwen3 30B A3B | 91 |
| 4 | Llama-3.3 Nemotron Super 49B v1 | 88.3 |
| 5 | Mistral Small 3 24B Instruct | 87.6 |
| 6 | Qwen2.5 72B Instruct | 81.2 |
| 7 | Phi 4 Reasoning Plus | 79 |
| 8 | DeepSeek-V2.5 | 76.2 |
| 9 | Phi 4 | 75.4 |
| 10 | Phi 4 Reasoning | 73.3 |
| 11 | Ministral 8B Instruct | 70.9 |
| 12 | Jamba 1.5 Large | 65.4 |
| 13 | Granite 3.3 8B Instruct | 57.6 |
| 14 | Granite 3.3 8B Base | 57.6 |
| 15 | Qwen2.5 7B Instruct | 52 |
| 16 | Jamba 1.5 Mini | 46.1 |
| 17 | Mistral Small 3.2 24B Instruct | 43.1 |
| 18 | Phi-3.5-MoE-instruct | 37.9 |
| 19 | Phi-3.5-mini-instruct | 37 |
| 20 | Phi 4 Mini | 32.8 |
| 21 | IBM Granite 4.0 Tiny Preview | 26.7 |