AI Hub
All benchmarks
Agents

TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments.

22Models
86.2Top score
68.4Median

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

907050301020242025Claude 3.5 Haiku: 51 (2024-11-04)o3-mini: 57.6 (2025-01-31)GPT-4.5: 68.4 (2025-02-27)GPT-4.1: 68 (2025-04-14)GPT-4.1 Mini: 55.8 (2025-04-14)GPT-4.1 Nano: 22.6 (2025-04-14)o4-mini: 71.8 (2025-04-16)Claude Sonnet 4: 80.5 (2025-05-22)GLM 4.5 Air: 77.9 (2025-07-25)Qwen3-235B-A22B-Thinking-2507: 67.8 (2025-07-25)GLM-4.5: 79.7 (2025-07-28)gpt-oss-120b: 67.8 (2025-08-05)gpt-oss-20b: 54.8 (2025-08-05)Qwen3 Next 80B A3B Thinking: 69.6 (2025-09-11)Qwen3 Next 80B A3B Instruct: 60.9 (2025-09-11)GPT-4o: 60.3 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 69.2 (2024-06-20)Claude 3.5 Sonneto1: 70.8 (2024-12-05)o1Claude 3.7 Sonnet: 81.2 (2025-02-24)Claude 3.7 SonnetClaude Opus 4: 81.4 (2025-05-22)Claude Opus 4Claude Opus 4.1: 82.4 (2025-08-05)Claude Opus 4.1Claude Sonnet 4.5: 86.2 (2025-09-29)Claude Sonnet 4.5

Ranking

1Claude Sonnet 4.5
86.2
2Claude Opus 4.1
82.4
3Claude Opus 4
81.4
4Claude 3.7 Sonnet
81.2
5Claude Sonnet 4
80.5
6GLM-4.5
79.7
7GLM 4.5 Air
77.9
8o4-mini
71.8
9o1
70.8
10Qwen3 Next 80B A3B Thinking
69.6
11Claude 3.5 Sonnet
69.2
12GPT-4.5
68.4
13GPT-4.1
68
14Qwen3-235B-A22B-Thinking-2507
67.8
15gpt-oss-120b
67.8
16Qwen3 Next 80B A3B Instruct
60.9
17GPT-4o
60.3
18o3-mini
57.6
19GPT-4.1 Mini
55.8
20gpt-oss-20b
54.8
21Claude 3.5 Haiku
51
22GPT-4.1 Nano
22.6

Related Agents benchmarks