AI Hub
All benchmarks
General

Arena Hard

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder.

21Models
95.6Top score
70.9Median

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1008060402020242025Jamba 1.5 Large: 65.4 (2024-08-22)Jamba 1.5 Mini: 46.1 (2024-08-22)Phi-3.5-MoE-instruct: 37.9 (2024-08-23)Phi-3.5-mini-instruct: 37 (2024-08-23)Ministral 8B Instruct: 70.9 (2024-10-16)Qwen2.5 7B Instruct: 52 (2024-10-16)Phi 4: 75.4 (2025-01-10)Phi 4 Mini: 32.8 (2025-02-01)Granite 3.3 8B Instruct: 57.6 (2025-04-16)Granite 3.3 8B Base: 57.6 (2025-04-16)Qwen3 32B: 93.8 (2025-04-28)Qwen3 30B A3B: 91 (2025-04-28)Phi 4 Reasoning Plus: 79 (2025-04-30)Phi 4 Reasoning: 73.3 (2025-04-30)IBM Granite 4.0 Tiny Preview: 26.7 (2025-05-02)Mistral Small 3.2 24B Instruct: 43.1 (2025-06-20)DeepSeek-V2.5: 76.2 (2024-05-08)DeepSeek-V2.5Qwen2.5 72B Instruct: 81.2 (2024-09-19)Qwen2.5 72B InstructMistral Small 3 24B Instruct: 87.6 (2025-01-30)Mistral Small 3 24B InstructLlama-3.3 Nemotron Super 49B v1: 88.3 (2025-03-18)Llama-3.3 Nemotron Super 49B v1Qwen3 235B A22B: 95.6 (2025-04-28)Qwen3 235B A22B

Ranking

1Qwen3 235B A22B
95.6
2Qwen3 32B
93.8
3Qwen3 30B A3B
91
4Llama-3.3 Nemotron Super 49B v1
88.3
5Mistral Small 3 24B Instruct
87.6
6Qwen2.5 72B Instruct
81.2
7Phi 4 Reasoning Plus
79
8DeepSeek-V2.5
76.2
9Phi 4
75.4
10Phi 4 Reasoning
73.3
11Ministral 8B Instruct
70.9
12Jamba 1.5 Large
65.4
13Granite 3.3 8B Instruct
57.6
14Granite 3.3 8B Base
57.6
15Qwen2.5 7B Instruct
52
16Jamba 1.5 Mini
46.1
17Mistral Small 3.2 24B Instruct
43.1
18Phi-3.5-MoE-instruct
37.9
19Phi-3.5-mini-instruct
37
20Phi 4 Mini
32.8
21IBM Granite 4.0 Tiny Preview
26.7

Related General benchmarks