AI Hub
All benchmarks
Agents

τ²-bench Airline

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios.

10Models
64.8Top score
56.5Median

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

706355484020242025Kimi K2 Instruct: 56.5 (2025-07-11)Qwen3-235B-A22B-Instruct-2507: 44 (2025-07-22)Qwen3-235B-A22B-Thinking-2507: 58 (2025-07-25)GPT-5: 62.6 (2025-08-07)Kimi K2-Instruct-0905: 56.5 (2025-09-05)Qwen3 Next 80B A3B Thinking: 60.5 (2025-09-11)Qwen3 Next 80B A3B Instruct: 45.5 (2025-09-11)Claude Haiku 4.5: 63.6 (2025-10-15)GPT-4o: 45.5 (2024-05-13)GPT-4oo3: 64.8 (2025-04-16)o3

Ranking

1o3
64.8
2Claude Haiku 4.5
63.6
3GPT-5
62.6
4Qwen3 Next 80B A3B Thinking
60.5
5Qwen3-235B-A22B-Thinking-2507
58
6Kimi K2-Instruct-0905
56.5
7Kimi K2 Instruct
56.5
8Qwen3 Next 80B A3B Instruct
45.5
9GPT-4o
45.5
10Qwen3-235B-A22B-Instruct-2507
44

Related Agents benchmarks