Coding

Aider Polyglot

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust.

21Models

88Top score

60Median

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	GPT-5OpenAI	88
2	Gemini 2.5 Pro Preview 06-05Google	82.2
3	o3OpenAI	81.3
4	Gemini 2.5 ProGoogle	76.5
5	DeepSeek V3.2 ExpDeepSeek	74.5
6	DeepSeek-R1-0528DeepSeek	71.6
7	o4-miniOpenAI	68.9
8	DeepSeek-V3.1DeepSeek	68.4
9	o3-miniOpenAI	66.7
10	Gemini 2.5 FlashGoogle	61.9
11	Kimi K2-Instruct-0905Moonshot AI	60
12	Kimi K2 InstructMoonshot AI	60
13	Qwen3-235B-A22B-Instruct-2507Alibaba	57.3
14	GPT-4.1OpenAI	51.6
15	Qwen3 Next 80B A3B InstructAlibaba	49.8
16	DeepSeek-V3DeepSeek	49.6
17	Magistral MediumMistral AI	47.1
18	GPT-4.1 MiniOpenAI	34.7
19	GPT-4oOpenAI	30.7
20	Gemini 2.5 Flash LiteGoogle	26.7
21	GPT-4.1 NanoOpenAI	9.8

Related Coding benchmarks

LiveCodeBench282 HumanEval68 SWE-bench Verified51 MBPP31 Terminal-Bench15 MultiPL-E12