Benchmarks
The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.
42
Benchmarks
440
Models scored
2668
Data points
7
Categories
Most measured
GPQA Diamond
405 models
Most saturated
AIME 2025
100% of ceiling
Hardest
FrontierMath
top score 26%
Widest spread
AIME 2025
100 pt range
6 benchmarks
| Leader | ||||
|---|---|---|---|---|
| BFCL | Agents | 10 | Llama 3.1 405B Instruct | 88.5/100 |
| BrowseComp | Agents | 10 | GPT-5 | 54.9/100 |
| TAU-bench Airline | Agents | 20 | Claude Sonnet 4.5 | 70/100 |
| TAU-bench Retail | Agents | 22 | Claude Sonnet 4.5 | 86.2/100 |
| τ²-bench Airline | Agents | 10 | o3 | 64.8/100 |
| τ²-bench Retail | Agents | 10 | Claude Haiku 4.5 | 83.2/100 |