Agents

TAU-bench Airline

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains.

20Models

70Top score

49.2Median

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Claude Sonnet 4.5Anthropic	70
2	GLM 4.5 AirZhipu AI	60.8
3	GLM-4.5Zhipu AI	60.4
4	Claude Sonnet 4Anthropic	60
5	Claude Opus 4Anthropic	59.6
6	Claude 3.7 SonnetAnthropic	58.4
7	Claude Opus 4.1Anthropic	56
8	GPT-4.5OpenAI	50
9	o1OpenAI	50
10	GPT-4.1OpenAI	49.4
11	o4-miniOpenAI	49.2
12	Qwen3 Next 80B A3B ThinkingAlibaba	49
13	Qwen3-235B-A22B-Thinking-2507Alibaba	46
14	Claude 3.5 SonnetAnthropic	46
15	Qwen3 Next 80B A3B InstructAlibaba	44
16	GPT-4oOpenAI	42.8
17	GPT-4.1 MiniOpenAI	36
18	o3-miniOpenAI	32.4
19	Claude 3.5 HaikuAnthropic	22.8
20	GPT-4.1 NanoOpenAI	14

Related Agents benchmarks

TAU-bench Retail22 BFCL10 τ²-bench Airline10 τ²-bench Retail10 BrowseComp10