Benchmarks
The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.
42
Benchmarks
440
Models scored
2668
Data points
7
Categories
Most measured
GPQA Diamond
405 models
Most saturated
AIME 2025
100% of ceiling
Hardest
FrontierMath
top score 26%
Widest spread
AIME 2025
100 pt range
9 benchmarks
| Leader | ||||
|---|---|---|---|---|
| Aider Polyglot | Coding | 21 | GPT-5 | 88/100 |
| Aider Polyglot Edit | Coding | 10 | DeepSeek-V3 | 79.7/100 |
| HumanEval | Coding | 68 | Claude Opus 4.6 | 95/100 |
| LiveCodeBench | Coding | 282 | DeepSeek-V4-Pro | 93.5/100 |
| MBPP | Coding | 31 | Llama-3.3 Nemotron Super 49B v1 | 91.3/100 |
| MultiPL-E | Coding | 12 | Qwen3-235B-A22B-Instruct-2507 | 87.9/100 |
| SWE-bench Pro | Coding | 4 | Kimi K2.6 | 58.6/100 |
| SWE-bench Verified | Coding | 51 | Claude Opus 4.7 | 87.6/100 |
| Terminal-Bench | Coding | 15 | Claude Sonnet 4.5 | 50/100 |