SWE-bench Verified
Resolving real GitHub issues (human-verified subset) — an agentic software-engineering test.
51Models
87.6Top score
66Median
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude Opus 4.7 | 87.6 |
| 2 | Claude Opus 4.5 | 80.9 |
| 3 | Claude Opus 4.6 | 80.8 |
| 4 | DeepSeek-V4-Pro | 80.6 |
| 5 | Gemini 3.1 Pro | 80.6 |
| 6 | GPT-5.2 | 80 |
| 7 | Claude Sonnet 4.6 | 79.6 |
| 8 | Gemini 3 Flash | 78 |
| 9 | GLM-5 | 77.8 |
| 10 | Claude Sonnet 4.5 | 77.2 |
| 11 | Gemini 3 Pro | 76.2 |
| 12 | GPT-5 | 74.9 |
| 13 | GPT-5 Codex | 74.5 |
| 14 | Claude Opus 4.1 | 74.5 |
| 15 | Claude Haiku 4.5 | 73.3 |
| 16 | Claude Sonnet 4 | 72.7 |
| 17 | Claude Opus 4 | 72.5 |
| 18 | Kimi K2 Thinking | 71.3 |
| 19 | Claude 3.7 Sonnet | 70.3 |
| 20 | MiniMax-M2 | 69.4 |
| 21 | o3 | 69.1 |
| 22 | o4-mini | 68.1 |
| 23 | GLM-4.6 | 68 |
| 24 | DeepSeek V3.2 Exp | 67.8 |
| 25 | Gemini 2.5 Pro Preview 06-05 | 67.2 |
| 26 | DeepSeek-V3.1 | 66 |
| 27 | Kimi K2-Instruct-0905 | 65.8 |
| 28 | Kimi K2 Instruct | 65.8 |
| 29 | Kimi K2 | 65.8 |
| 30 | GLM-4.5 | 64.2 |
| 31 | Gemini 2.5 Pro | 63.8 |
| 32 | gpt-oss-120b | 62.4 |
| 33 | Gemini 2.5 Flash | 60.4 |
| 34 | GLM 4.5 Air | 57.6 |
| 35 | Qwen3-Coder | 55.4 |
| 36 | GPT-4.1 | 54.6 |
| 37 | o3-mini | 49.3 |
| 38 | Claude 3.5 Sonnet | 49 |
| 39 | DeepSeek-R1-0528 | 44.6 |
| 40 | DeepSeek-V3 | 42 |
| 41 | o1-preview | 41.3 |
| 42 | o1 | 41 |
| 43 | Claude 3.5 Haiku | 40.6 |
| 44 | GPT-4.5 | 38 |
| 45 | GPT-4o | 33.2 |
| 46 | Gemini 2.5 Flash Lite | 31.6 |
| 47 | Llama 4 Maverick | 30 |
| 48 | GPT-4.1 Mini | 23.6 |
| 49 | Gemini Diffusion | 22.9 |
| 50 | DeepSeek-V2.5 | 16.8 |
| 51 | GPT-4o-mini | 8.7 |