Benchmarks

			Leader
AI2D	Multimodal	17	Claude 3.5 Sonnet	94.7/100
Aider Polyglot	Coding	21	GPT-5	88/100
Aider Polyglot Edit	Coding	10	DeepSeek-V3	79.7/100
AIME 2024	Math	46	Grok-3 Mini	95.8/100
AIME 2025	Math	221	Grok-4 Heavy	100/100
ARC-AGI-2	Reasoning	10	Gemini 3.1 Pro	77.1/100
Arena Hard	General	21	Qwen3 235B A22B	95.6/100
BFCL	Agents	10	Llama 3.1 405B Instruct	88.5/100
BIG-Bench Hard	Reasoning	28	Claude 3.5 Sonnet	93.1/100
BrowseComp	Agents	10	GPT-5	54.9/100
ChartQA	Multimodal	24	Claude 3.5 Sonnet	90.8/100
DocVQA	Multimodal	26	Qwen2.5 VL 72B Instruct	96.4/100
DROP	Reasoning	25	DeepSeek-V3	91.6/100
FrontierMath	Math	6	GPT-5	26.3/100
GPQA Diamond	Reasoning	405	Gemini 3.1 Pro	94.3/100
GSM8K	Math	45	Kimi K2 Instruct	97.3/100
HMMT 2025	Math	11	Grok 4 Fast	93.3/100
HumanEval	Coding	68	Claude Opus 4.6	95/100
Humanity’s Last Exam	General	360	Grok-4 Heavy	50.7/100
IFEval	General	41	o3-mini	93.9/100
LiveBench	General	13	o3-mini	84.6/100
LiveCodeBench	Coding	282	DeepSeek-V4-Pro	93.5/100
LongBench-v2	Long Context	2	MiniMax-M1	61.5/100
MATH	Math	67	o3-mini	97.9/100
MATH-500	Math	169	GPT-5	99.4/100
MathVista	Multimodal	34	o3	86.8/100
MBPP	Coding	31	Llama-3.3 Nemotron Super 49B v1	91.3/100
MGSM	Math	29	Llama 4 Maverick	92.3/100
MMLU	General	92	GPT-5	92.5/100
MMLU-Pro	General	292	Gemini 3 Pro	89.8/100
MMMU	Multimodal	52	GPT-5	84.2/100
MMMU-Pro	Multimodal	13	GPT-5	78.4/100
Multi-IF	General	11	Qwen3-235B-A22B-Thinking-2507	80.6/100
MultiPL-E	Coding	12	Qwen3-235B-A22B-Instruct-2507	87.9/100
SimpleQA	General	26	DeepSeek V3.2 Exp	97.1/100
SWE-bench Pro	Coding	4	Kimi K2.6	58.6/100
SWE-bench Verified	Coding	51	Claude Opus 4.7	87.6/100
TAU-bench Airline	Agents	20	Claude Sonnet 4.5	70/100
TAU-bench Retail	Agents	22	Claude Sonnet 4.5	86.2/100
Terminal-Bench	Coding	15	Claude Sonnet 4.5	50/100
τ²-bench Airline	Agents	10	o3	64.8/100
τ²-bench Retail	Agents	10	Claude Haiku 4.5	83.2/100