AI Hub
All benchmarks
Coding

Aider Polyglot

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust.

21Models
88Top score
60Median

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

90684523020242025GPT-4.1: 51.6 (2025-04-14)GPT-4.1 Mini: 34.7 (2025-04-14)GPT-4.1 Nano: 9.8 (2025-04-14)o4-mini: 68.9 (2025-04-16)Gemini 2.5 Flash: 61.9 (2025-04-17)DeepSeek-R1-0528: 71.6 (2025-05-28)Magistral Medium: 47.1 (2025-06-10)Kimi K2 Instruct: 60 (2025-07-11)Qwen3-235B-A22B-Instruct-2507: 57.3 (2025-07-22)Gemini 2.5 Flash Lite: 26.7 (2025-07-22)DeepSeek-V3.1: 68.4 (2025-08-21)Kimi K2-Instruct-0905: 60 (2025-09-05)Qwen3 Next 80B A3B Instruct: 49.8 (2025-09-11)DeepSeek V3.2 Exp: 74.5 (2025-09-29)GPT-4o: 30.7 (2024-05-13)GPT-4oDeepSeek-V3: 49.6 (2024-12-26)DeepSeek-V3o3-mini: 66.7 (2025-01-31)o3-miniGemini 2.5 Pro: 76.5 (2025-03-25)Gemini 2.5 Proo3: 81.3 (2025-04-16)o3Gemini 2.5 Pro Preview 06-05: 82.2 (2025-06-05)Gemini 2.5 Pro Preview 06-05GPT-5: 88 (2025-08-07)GPT-5

Ranking

1GPT-5
88
2Gemini 2.5 Pro Preview 06-05
82.2
3o3
81.3
4Gemini 2.5 Pro
76.5
5DeepSeek V3.2 Exp
74.5
6DeepSeek-R1-0528
71.6
7o4-mini
68.9
8DeepSeek-V3.1
68.4
9o3-mini
66.7
10Gemini 2.5 Flash
61.9
11Kimi K2-Instruct-0905
60
12Kimi K2 Instruct
60
13Qwen3-235B-A22B-Instruct-2507
57.3
14GPT-4.1
51.6
15Qwen3 Next 80B A3B Instruct
49.8
16DeepSeek-V3
49.6
17Magistral Medium
47.1
18GPT-4.1 Mini
34.7
19GPT-4o
30.7
20Gemini 2.5 Flash Lite
26.7
21GPT-4.1 Nano
9.8

Related Coding benchmarks