AI Hub

Benchmarks

The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.

42
Benchmarks
440
Models scored
2668
Data points
7
Categories

42 benchmarks

Leader
AI2DMultimodal17Claude 3.5 Sonnet94.7/100
Aider PolyglotCoding21GPT-588/100
Aider Polyglot EditCoding10DeepSeek-V379.7/100
AIME 2024Math46Grok-3 Mini95.8/100
AIME 2025Math221Grok-4 Heavy100/100
ARC-AGI-2Reasoning10Gemini 3.1 Pro77.1/100
Arena HardGeneral21Qwen3 235B A22B95.6/100
BFCLAgents10Llama 3.1 405B Instruct88.5/100
BIG-Bench HardReasoning28Claude 3.5 Sonnet93.1/100
BrowseCompAgents10GPT-554.9/100
ChartQAMultimodal24Claude 3.5 Sonnet90.8/100
DocVQAMultimodal26Qwen2.5 VL 72B Instruct96.4/100
DROPReasoning25DeepSeek-V391.6/100
FrontierMathMath6GPT-526.3/100
GPQA DiamondReasoning405Gemini 3.1 Pro94.3/100
GSM8KMath45Kimi K2 Instruct97.3/100
HMMT 2025Math11Grok 4 Fast93.3/100
HumanEvalCoding68Claude Opus 4.695/100
Humanity’s Last ExamGeneral360Grok-4 Heavy50.7/100
IFEvalGeneral41o3-mini93.9/100
LiveBenchGeneral13o3-mini84.6/100
LiveCodeBenchCoding282DeepSeek-V4-Pro93.5/100
LongBench-v2Long Context2MiniMax-M161.5/100
MATHMath67o3-mini97.9/100
MATH-500Math169GPT-599.4/100
MathVistaMultimodal34o386.8/100
MBPPCoding31Llama-3.3 Nemotron Super 49B v191.3/100
MGSMMath29Llama 4 Maverick92.3/100
MMLUGeneral92GPT-592.5/100
MMLU-ProGeneral292Gemini 3 Pro89.8/100
MMMUMultimodal52GPT-584.2/100
MMMU-ProMultimodal13GPT-578.4/100
Multi-IFGeneral11Qwen3-235B-A22B-Thinking-250780.6/100
MultiPL-ECoding12Qwen3-235B-A22B-Instruct-250787.9/100
SimpleQAGeneral26DeepSeek V3.2 Exp97.1/100
SWE-bench ProCoding4Kimi K2.658.6/100
SWE-bench VerifiedCoding51Claude Opus 4.787.6/100
TAU-bench AirlineAgents20Claude Sonnet 4.570/100
TAU-bench RetailAgents22Claude Sonnet 4.586.2/100
Terminal-BenchCoding15Claude Sonnet 4.550/100
τ²-bench AirlineAgents10o364.8/100
τ²-bench RetailAgents10Claude Haiku 4.583.2/100