AI Hub
All benchmarks
Coding

SWE-bench Verified

Resolving real GitHub issues (human-verified subset) — an agentic software-engineering test.

51Models
87.6Top score
66Median

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

906845230202420252026GPT-4o-mini: 8.7 (2024-07-18)o1-preview: 41.3 (2024-09-12)Claude 3.5 Haiku: 40.6 (2024-11-04)o1: 41 (2024-12-05)DeepSeek-V3: 42 (2024-12-26)GPT-4.5: 38 (2025-02-27)Gemini 2.5 Pro: 63.8 (2025-03-25)Llama 4 Maverick: 30 (2025-04-05)GPT-4.1: 54.6 (2025-04-14)GPT-4.1 Mini: 23.6 (2025-04-14)o3: 69.1 (2025-04-16)o4-mini: 68.1 (2025-04-16)Gemini 2.5 Flash: 60.4 (2025-04-17)Gemini Diffusion: 22.9 (2025-05-20)Claude Opus 4: 72.5 (2025-05-22)DeepSeek-R1-0528: 44.6 (2025-05-28)Gemini 2.5 Pro Preview 06-05: 67.2 (2025-06-05)Kimi K2 Instruct: 65.8 (2025-07-11)Kimi K2: 65.8 (2025-07-11)Qwen3-Coder: 55.4 (2025-07-22)Gemini 2.5 Flash Lite: 31.6 (2025-07-22)GLM 4.5 Air: 57.6 (2025-07-25)GLM-4.5: 64.2 (2025-07-28)gpt-oss-120b: 62.4 (2025-08-05)DeepSeek-V3.1: 66 (2025-08-21)Kimi K2-Instruct-0905: 65.8 (2025-09-05)GPT-5 Codex: 74.5 (2025-09-23)DeepSeek V3.2 Exp: 67.8 (2025-09-29)GLM-4.6: 68 (2025-09-30)Claude Haiku 4.5: 73.3 (2025-10-15)MiniMax-M2: 69.4 (2025-10-27)Kimi K2 Thinking: 71.3 (2025-11-06)Gemini 3 Pro: 76.2 (2025-11-18)GPT-5.2: 80 (2025-12-11)Gemini 3 Flash: 78 (2025-12-17)Claude Opus 4.6: 80.8 (2026-02-05)GLM-5: 77.8 (2026-02-11)Claude Sonnet 4.6: 79.6 (2026-02-17)Gemini 3.1 Pro: 80.6 (2026-02-19)DeepSeek-V4-Pro: 80.6 (2026-04-24)DeepSeek-V2.5: 16.8 (2024-05-08)DeepSeek-V2.5GPT-4o: 33.2 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 49 (2024-06-20)Claude 3.5 Sonneto3-mini: 49.3 (2025-01-31)o3-miniClaude 3.7 Sonnet: 70.3 (2025-02-24)Claude 3.7 SonnetClaude Sonnet 4: 72.7 (2025-05-22)Claude Sonnet 4Claude Opus 4.1: 74.5 (2025-08-05)Claude Opus 4.1GPT-5: 74.9 (2025-08-07)GPT-5Claude Sonnet 4.5: 77.2 (2025-09-29)Claude Sonnet 4.5Claude Opus 4.5: 80.9 (2025-11-24)Claude Opus 4.5Claude Opus 4.7: 87.6 (2026-04-16)Claude Opus 4.7

Ranking

1Claude Opus 4.7
87.6
2Claude Opus 4.5
80.9
3Claude Opus 4.6
80.8
4DeepSeek-V4-Pro
80.6
5Gemini 3.1 Pro
80.6
6GPT-5.2
80
7Claude Sonnet 4.6
79.6
8Gemini 3 Flash
78
9GLM-5
77.8
10Claude Sonnet 4.5
77.2
11Gemini 3 Pro
76.2
12GPT-5
74.9
13GPT-5 Codex
74.5
14Claude Opus 4.1
74.5
15Claude Haiku 4.5
73.3
16Claude Sonnet 4
72.7
17Claude Opus 4
72.5
18Kimi K2 Thinking
71.3
19Claude 3.7 Sonnet
70.3
20MiniMax-M2
69.4
21o3
69.1
22o4-mini
68.1
23GLM-4.6
68
24DeepSeek V3.2 Exp
67.8
25Gemini 2.5 Pro Preview 06-05
67.2
26DeepSeek-V3.1
66
27Kimi K2-Instruct-0905
65.8
28Kimi K2 Instruct
65.8
29Kimi K2
65.8
30GLM-4.5
64.2
31Gemini 2.5 Pro
63.8
32gpt-oss-120b
62.4
33Gemini 2.5 Flash
60.4
34GLM 4.5 Air
57.6
35Qwen3-Coder
55.4
36GPT-4.1
54.6
37o3-mini
49.3
38Claude 3.5 Sonnet
49
39DeepSeek-R1-0528
44.6
40DeepSeek-V3
42
41o1-preview
41.3
42o1
41
43Claude 3.5 Haiku
40.6
44GPT-4.5
38
45GPT-4o
33.2
46Gemini 2.5 Flash Lite
31.6
47Llama 4 Maverick
30
48GPT-4.1 Mini
23.6
49Gemini Diffusion
22.9
50DeepSeek-V2.5
16.8
51GPT-4o-mini
8.7

Related Coding benchmarks