Benchmarks
The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.
42
Benchmarks
440
Models scored
2668
Data points
7
Categories
Most measured
GPQA Diamond
405 models
Most saturated
AIME 2025
100% of ceiling
Hardest
FrontierMath
top score 26%
Widest spread
AIME 2025
100 pt range
4 benchmarks
| Leader | ||||
|---|---|---|---|---|
| ARC-AGI-2 | Reasoning | 10 | Gemini 3.1 Pro | 77.1/100 |
| BIG-Bench Hard | Reasoning | 28 | Claude 3.5 Sonnet | 93.1/100 |
| DROP | Reasoning | 25 | DeepSeek-V3 | 91.6/100 |
| GPQA Diamond | Reasoning | 405 | Gemini 3.1 Pro | 94.3/100 |