Agents

τ²-bench Airline

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios.

Source

10Models

64.8Top score

56.5Median

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	o3OpenAI	64.8
2	Claude Haiku 4.5Anthropic	63.6
3	GPT-5OpenAI	62.6
4	Qwen3 Next 80B A3B ThinkingAlibaba	60.5
5	Qwen3-235B-A22B-Thinking-2507Alibaba	58
6	Kimi K2-Instruct-0905Moonshot AI	56.5
7	Kimi K2 InstructMoonshot AI	56.5
8	Qwen3 Next 80B A3B InstructAlibaba	45.5
9	GPT-4oOpenAI	45.5
10	Qwen3-235B-A22B-Instruct-2507Alibaba	44

Related Agents benchmarks

TAU-bench Retail22 TAU-bench Airline20 BFCL10 τ²-bench Retail10 BrowseComp10