AI Hub
All benchmarks
Agents

TAU-bench Airline

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains.

20Models
70Top score
49.2Median

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

806345281020242025Claude 3.5 Haiku: 22.8 (2024-11-04)o3-mini: 32.4 (2025-01-31)GPT-4.5: 50 (2025-02-27)GPT-4.1: 49.4 (2025-04-14)GPT-4.1 Mini: 36 (2025-04-14)GPT-4.1 Nano: 14 (2025-04-14)o4-mini: 49.2 (2025-04-16)Claude Opus 4: 59.6 (2025-05-22)Qwen3-235B-A22B-Thinking-2507: 46 (2025-07-25)GLM-4.5: 60.4 (2025-07-28)Claude Opus 4.1: 56 (2025-08-05)Qwen3 Next 80B A3B Thinking: 49 (2025-09-11)Qwen3 Next 80B A3B Instruct: 44 (2025-09-11)GPT-4o: 42.8 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 46 (2024-06-20)Claude 3.5 Sonneto1: 50 (2024-12-05)o1Claude 3.7 Sonnet: 58.4 (2025-02-24)Claude 3.7 SonnetClaude Sonnet 4: 60 (2025-05-22)Claude Sonnet 4GLM 4.5 Air: 60.8 (2025-07-25)GLM 4.5 AirClaude Sonnet 4.5: 70 (2025-09-29)Claude Sonnet 4.5

Ranking

1Claude Sonnet 4.5
70
2GLM 4.5 Air
60.8
3GLM-4.5
60.4
4Claude Sonnet 4
60
5Claude Opus 4
59.6
6Claude 3.7 Sonnet
58.4
7Claude Opus 4.1
56
8GPT-4.5
50
9o1
50
10GPT-4.1
49.4
11o4-mini
49.2
12Qwen3 Next 80B A3B Thinking
49
13Qwen3-235B-A22B-Thinking-2507
46
14Claude 3.5 Sonnet
46
15Qwen3 Next 80B A3B Instruct
44
16GPT-4o
42.8
17GPT-4.1 Mini
36
18o3-mini
32.4
19Claude 3.5 Haiku
22.8
20GPT-4.1 Nano
14

Related Agents benchmarks