Benchmarks

The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.

42

Benchmarks

440

Models scored

2668

Data points

7

Categories

100% of ceiling

4 benchmarks

			Leader
ARC-AGI-2	Reasoning	10	Gemini 3.1 Pro	77.1/100
BIG-Bench Hard	Reasoning	28	Claude 3.5 Sonnet	93.1/100
DROP	Reasoning	25	DeepSeek-V3	91.6/100
GPQA Diamond	Reasoning	405	Gemini 3.1 Pro	94.3/100