Coding

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments.

15Models

50Top score

37.5Median

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Claude Sonnet 4.5Anthropic	50
2	MiniMax-M2MiniMax	46.3
3	Claude Opus 4.1Anthropic	43.3
4	Claude Haiku 4.5Anthropic	41
5	GLM-4.6Zhipu AI	40.5
6	Claude Opus 4Anthropic	39.2
7	DeepSeek V3.2 ExpDeepSeek	37.7
8	GLM-4.5Zhipu AI	37.5
9	Claude Sonnet 4Anthropic	35.5
10	Claude 3.7 SonnetAnthropic	35.2
11	DeepSeek-V3.1DeepSeek	31.3
12	Kimi K2 InstructMoonshot AI	30
13	GLM 4.5 AirZhipu AI	30
14	Kimi K2-Instruct-0905Moonshot AI	25
15	DeepSeek-R1-0528DeepSeek	5.7

Related Coding benchmarks

LiveCodeBench282 HumanEval68 SWE-bench Verified51 MBPP31 Aider Polyglot21 MultiPL-E12