Aider Polyglot
A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust.
21Models
88Top score
60Median
A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | GPT-5 | 88 |
| 2 | Gemini 2.5 Pro Preview 06-05 | 82.2 |
| 3 | o3 | 81.3 |
| 4 | Gemini 2.5 Pro | 76.5 |
| 5 | DeepSeek V3.2 Exp | 74.5 |
| 6 | DeepSeek-R1-0528 | 71.6 |
| 7 | o4-mini | 68.9 |
| 8 | DeepSeek-V3.1 | 68.4 |
| 9 | o3-mini | 66.7 |
| 10 | Gemini 2.5 Flash | 61.9 |
| 11 | Kimi K2-Instruct-0905 | 60 |
| 12 | Kimi K2 Instruct | 60 |
| 13 | Qwen3-235B-A22B-Instruct-2507 | 57.3 |
| 14 | GPT-4.1 | 51.6 |
| 15 | Qwen3 Next 80B A3B Instruct | 49.8 |
| 16 | DeepSeek-V3 | 49.6 |
| 17 | Magistral Medium | 47.1 |
| 18 | GPT-4.1 Mini | 34.7 |
| 19 | GPT-4o | 30.7 |
| 20 | Gemini 2.5 Flash Lite | 26.7 |
| 21 | GPT-4.1 Nano | 9.8 |