Benchmarks
The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.
42
Benchmarks
440
Models scored
2668
Data points
7
Categories
Most measured
GPQA Diamond
405 models
Most saturated
AIME 2025
100% of ceiling
Hardest
FrontierMath
top score 26%
Widest spread
AIME 2025
100 pt range
42 benchmarks
| Leader | ||||
|---|---|---|---|---|
| AI2D | Multimodal | 17 | Claude 3.5 Sonnet | 94.7/100 |
| Aider Polyglot | Coding | 21 | GPT-5 | 88/100 |
| Aider Polyglot Edit | Coding | 10 | DeepSeek-V3 | 79.7/100 |
| AIME 2024 | Math | 46 | Grok-3 Mini | 95.8/100 |
| AIME 2025 | Math | 221 | Grok-4 Heavy | 100/100 |
| ARC-AGI-2 | Reasoning | 10 | Gemini 3.1 Pro | 77.1/100 |
| Arena Hard | General | 21 | Qwen3 235B A22B | 95.6/100 |
| BFCL | Agents | 10 | Llama 3.1 405B Instruct | 88.5/100 |
| BIG-Bench Hard | Reasoning | 28 | Claude 3.5 Sonnet | 93.1/100 |
| BrowseComp | Agents | 10 | GPT-5 | 54.9/100 |
| ChartQA | Multimodal | 24 | Claude 3.5 Sonnet | 90.8/100 |
| DocVQA | Multimodal | 26 | Qwen2.5 VL 72B Instruct | 96.4/100 |
| DROP | Reasoning | 25 | DeepSeek-V3 | 91.6/100 |
| FrontierMath | Math | 6 | GPT-5 | 26.3/100 |
| GPQA Diamond | Reasoning | 405 | Gemini 3.1 Pro | 94.3/100 |
| GSM8K | Math | 45 | Kimi K2 Instruct | 97.3/100 |
| HMMT 2025 | Math | 11 | Grok 4 Fast | 93.3/100 |
| HumanEval | Coding | 68 | Claude Opus 4.6 | 95/100 |
| Humanity’s Last Exam | General | 360 | Grok-4 Heavy | 50.7/100 |
| IFEval | General | 41 | o3-mini | 93.9/100 |
| LiveBench | General | 13 | o3-mini | 84.6/100 |
| LiveCodeBench | Coding | 282 | DeepSeek-V4-Pro | 93.5/100 |
| LongBench-v2 | Long Context | 2 | MiniMax-M1 | 61.5/100 |
| MATH | Math | 67 | o3-mini | 97.9/100 |
| MATH-500 | Math | 169 | GPT-5 | 99.4/100 |
| MathVista | Multimodal | 34 | o3 | 86.8/100 |
| MBPP | Coding | 31 | Llama-3.3 Nemotron Super 49B v1 | 91.3/100 |
| MGSM | Math | 29 | Llama 4 Maverick | 92.3/100 |
| MMLU | General | 92 | GPT-5 | 92.5/100 |
| MMLU-Pro | General | 292 | Gemini 3 Pro | 89.8/100 |
| MMMU | Multimodal | 52 | GPT-5 | 84.2/100 |
| MMMU-Pro | Multimodal | 13 | GPT-5 | 78.4/100 |
| Multi-IF | General | 11 | Qwen3-235B-A22B-Thinking-2507 | 80.6/100 |
| MultiPL-E | Coding | 12 | Qwen3-235B-A22B-Instruct-2507 | 87.9/100 |
| SimpleQA | General | 26 | DeepSeek V3.2 Exp | 97.1/100 |
| SWE-bench Pro | Coding | 4 | Kimi K2.6 | 58.6/100 |
| SWE-bench Verified | Coding | 51 | Claude Opus 4.7 | 87.6/100 |
| TAU-bench Airline | Agents | 20 | Claude Sonnet 4.5 | 70/100 |
| TAU-bench Retail | Agents | 22 | Claude Sonnet 4.5 | 86.2/100 |
| Terminal-Bench | Coding | 15 | Claude Sonnet 4.5 | 50/100 |
| τ²-bench Airline | Agents | 10 | o3 | 64.8/100 |
| τ²-bench Retail | Agents | 10 | Claude Haiku 4.5 | 83.2/100 |