TAU-bench Airline
Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains.
20Models
70Top score
49.2Median
Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude Sonnet 4.5 | 70 |
| 2 | GLM 4.5 Air | 60.8 |
| 3 | GLM-4.5 | 60.4 |
| 4 | Claude Sonnet 4 | 60 |
| 5 | Claude Opus 4 | 59.6 |
| 6 | Claude 3.7 Sonnet | 58.4 |
| 7 | Claude Opus 4.1 | 56 |
| 8 | GPT-4.5 | 50 |
| 9 | o1 | 50 |
| 10 | GPT-4.1 | 49.4 |
| 11 | o4-mini | 49.2 |
| 12 | Qwen3 Next 80B A3B Thinking | 49 |
| 13 | Qwen3-235B-A22B-Thinking-2507 | 46 |
| 14 | Claude 3.5 Sonnet | 46 |
| 15 | Qwen3 Next 80B A3B Instruct | 44 |
| 16 | GPT-4o | 42.8 |
| 17 | GPT-4.1 Mini | 36 |
| 18 | o3-mini | 32.4 |
| 19 | Claude 3.5 Haiku | 22.8 |
| 20 | GPT-4.1 Nano | 14 |