τ²-bench Airline
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios.
10Models
64.8Top score
56.5Median
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | o3 | 64.8 |
| 2 | Claude Haiku 4.5 | 63.6 |
| 3 | GPT-5 | 62.6 |
| 4 | Qwen3 Next 80B A3B Thinking | 60.5 |
| 5 | Qwen3-235B-A22B-Thinking-2507 | 58 |
| 6 | Kimi K2-Instruct-0905 | 56.5 |
| 7 | Kimi K2 Instruct | 56.5 |
| 8 | Qwen3 Next 80B A3B Instruct | 45.5 |
| 9 | GPT-4o | 45.5 |
| 10 | Qwen3-235B-A22B-Instruct-2507 | 44 |