AI Hub

Benchmarks

The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.

42
Benchmarks
440
Models scored
2668
Data points
7
Categories

6 benchmarks

Leader
BFCLAgents10Llama 3.1 405B Instruct88.5/100
BrowseCompAgents10GPT-554.9/100
TAU-bench AirlineAgents20Claude Sonnet 4.570/100
TAU-bench RetailAgents22Claude Sonnet 4.586.2/100
τ²-bench AirlineAgents10o364.8/100
τ²-bench RetailAgents10Claude Haiku 4.583.2/100