AI Hub
All benchmarks
Coding

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments.

15Models
50Top score
37.5Median

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

6045301502025Claude Sonnet 4: 35.5 (2025-05-22)DeepSeek-R1-0528: 5.7 (2025-05-28)Kimi K2 Instruct: 30 (2025-07-11)GLM 4.5 Air: 30 (2025-07-25)GLM-4.5: 37.5 (2025-07-28)DeepSeek-V3.1: 31.3 (2025-08-21)Kimi K2-Instruct-0905: 25 (2025-09-05)DeepSeek V3.2 Exp: 37.7 (2025-09-29)GLM-4.6: 40.5 (2025-09-30)Claude Haiku 4.5: 41 (2025-10-15)MiniMax-M2: 46.3 (2025-10-27)Claude 3.7 Sonnet: 35.2 (2025-02-24)Claude 3.7 SonnetClaude Opus 4: 39.2 (2025-05-22)Claude Opus 4Claude Opus 4.1: 43.3 (2025-08-05)Claude Opus 4.1Claude Sonnet 4.5: 50 (2025-09-29)Claude Sonnet 4.5

Ranking

1Claude Sonnet 4.5
50
2MiniMax-M2
46.3
3Claude Opus 4.1
43.3
4Claude Haiku 4.5
41
5GLM-4.6
40.5
6Claude Opus 4
39.2
7DeepSeek V3.2 Exp
37.7
8GLM-4.5
37.5
9Claude Sonnet 4
35.5
10Claude 3.7 Sonnet
35.2
11DeepSeek-V3.1
31.3
12Kimi K2 Instruct
30
13GLM 4.5 Air
30
14Kimi K2-Instruct-0905
25
15DeepSeek-R1-0528
5.7

Related Coding benchmarks