Benchmarks
The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.
42
Benchmarks
440
Models scored
2668
Data points
7
Categories
Most measured
GPQA Diamond
405 models
Most saturated
AIME 2025
100% of ceiling
Hardest
FrontierMath
top score 26%
Widest spread
AIME 2025
100 pt range
8 benchmarks
| Leader | ||||
|---|---|---|---|---|
| Arena Hard | General | 21 | Qwen3 235B A22B | 95.6/100 |
| Humanity’s Last Exam | General | 360 | Grok-4 Heavy | 50.7/100 |
| IFEval | General | 41 | o3-mini | 93.9/100 |
| LiveBench | General | 13 | o3-mini | 84.6/100 |
| MMLU | General | 92 | GPT-5 | 92.5/100 |
| MMLU-Pro | General | 292 | Gemini 3 Pro | 89.8/100 |
| Multi-IF | General | 11 | Qwen3-235B-A22B-Thinking-2507 | 80.6/100 |
| SimpleQA | General | 26 | DeepSeek V3.2 Exp | 97.1/100 |