AI Hub
All benchmarks
Coding

Aider Polyglot Edit

A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust.

10Models
79.7Top score
52.9Median

A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

90684523020242025o3-mini: 60.4 (2025-01-31)GPT-4.5: 44.9 (2025-02-27)Gemini 2.5 Pro: 72.7 (2025-03-25)GPT-4.1: 52.9 (2025-04-14)GPT-4.1 Mini: 31.6 (2025-04-14)GPT-4.1 Nano: 6.2 (2025-04-14)o4-mini: 58.2 (2025-04-16)Gemini 2.5 Flash: 56.7 (2025-04-17)GPT-4o: 18.2 (2024-05-13)GPT-4oDeepSeek-V3: 79.7 (2024-12-26)DeepSeek-V3

Ranking

1DeepSeek-V3
79.7
2Gemini 2.5 Pro
72.7
3o3-mini
60.4
4o4-mini
58.2
5Gemini 2.5 Flash
56.7
6GPT-4.1
52.9
7GPT-4.5
44.9
8GPT-4.1 Mini
31.6
9GPT-4o
18.2
10GPT-4.1 Nano
6.2

Related Coding benchmarks