Benchmarks
The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.
42
Benchmarks
440
Models scored
2668
Data points
7
Categories
Most measured
GPQA Diamond
405 models
Most saturated
AIME 2025
100% of ceiling
Hardest
FrontierMath
top score 26%
Widest spread
AIME 2025
100 pt range
1 benchmark
| Leader | ||||
|---|---|---|---|---|
| LongBench-v2 | Long Context | 2 | MiniMax-M1 | 61.5/100 |