AI Hub

Benchmarks

The evaluations behind the rankings — what each one measures, and which models lead. Scores feed the per-category indices on the leaderboard.

42
Benchmarks
440
Models scored
2668
Data points
7
Categories

9 benchmarks

Leader
Aider PolyglotCoding21GPT-588/100
Aider Polyglot EditCoding10DeepSeek-V379.7/100
HumanEvalCoding68Claude Opus 4.695/100
LiveCodeBenchCoding282DeepSeek-V4-Pro93.5/100
MBPPCoding31Llama-3.3 Nemotron Super 49B v191.3/100
MultiPL-ECoding12Qwen3-235B-A22B-Instruct-250787.9/100
SWE-bench ProCoding4Kimi K2.658.6/100
SWE-bench VerifiedCoding51Claude Opus 4.787.6/100
Terminal-BenchCoding15Claude Sonnet 4.550/100