Aider Polyglot Edit
A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust.
10Models
79.7Top score
52.9Median
A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | DeepSeek-V3 | 79.7 |
| 2 | Gemini 2.5 Pro | 72.7 |
| 3 | o3-mini | 60.4 |
| 4 | o4-mini | 58.2 |
| 5 | Gemini 2.5 Flash | 56.7 |
| 6 | GPT-4.1 | 52.9 |
| 7 | GPT-4.5 | 44.9 |
| 8 | GPT-4.1 Mini | 31.6 |
| 9 | GPT-4o | 18.2 |
| 10 | GPT-4.1 Nano | 6.2 |