TAU-bench Retail
A benchmark for evaluating tool-agent-user interaction in retail environments.
22Models
86.2Top score
68.4Median
A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude Sonnet 4.5 | 86.2 |
| 2 | Claude Opus 4.1 | 82.4 |
| 3 | Claude Opus 4 | 81.4 |
| 4 | Claude 3.7 Sonnet | 81.2 |
| 5 | Claude Sonnet 4 | 80.5 |
| 6 | GLM-4.5 | 79.7 |
| 7 | GLM 4.5 Air | 77.9 |
| 8 | o4-mini | 71.8 |
| 9 | o1 | 70.8 |
| 10 | Qwen3 Next 80B A3B Thinking | 69.6 |
| 11 | Claude 3.5 Sonnet | 69.2 |
| 12 | GPT-4.5 | 68.4 |
| 13 | GPT-4.1 | 68 |
| 14 | Qwen3-235B-A22B-Thinking-2507 | 67.8 |
| 15 | gpt-oss-120b | 67.8 |
| 16 | Qwen3 Next 80B A3B Instruct | 60.9 |
| 17 | GPT-4o | 60.3 |
| 18 | o3-mini | 57.6 |
| 19 | GPT-4.1 Mini | 55.8 |
| 20 | gpt-oss-20b | 54.8 |
| 21 | Claude 3.5 Haiku | 51 |
| 22 | GPT-4.1 Nano | 22.6 |