Terminal-Bench
Terminal-Bench is a benchmark for testing AI agents in real terminal environments.
15Models
50Top score
37.5Median
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude Sonnet 4.5 | 50 |
| 2 | MiniMax-M2 | 46.3 |
| 3 | Claude Opus 4.1 | 43.3 |
| 4 | Claude Haiku 4.5 | 41 |
| 5 | GLM-4.6 | 40.5 |
| 6 | Claude Opus 4 | 39.2 |
| 7 | DeepSeek V3.2 Exp | 37.7 |
| 8 | GLM-4.5 | 37.5 |
| 9 | Claude Sonnet 4 | 35.5 |
| 10 | Claude 3.7 Sonnet | 35.2 |
| 11 | DeepSeek-V3.1 | 31.3 |
| 12 | Kimi K2 Instruct | 30 |
| 13 | GLM 4.5 Air | 30 |
| 14 | Kimi K2-Instruct-0905 | 25 |
| 15 | DeepSeek-R1-0528 | 5.7 |