τ²-bench Retail
τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools.
10Models
83.2Top score
70.6Median
τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude Haiku 4.5 | 83.2 |
| 2 | GPT-5 | 81.1 |
| 3 | o3 | 80.2 |
| 4 | Qwen3-235B-A22B-Thinking-2507 | 71.9 |
| 5 | Qwen3-235B-A22B-Instruct-2507 | 71.3 |
| 6 | Kimi K2-Instruct-0905 | 70.6 |
| 7 | Kimi K2 Instruct | 70.6 |
| 8 | Qwen3 Next 80B A3B Thinking | 67.8 |
| 9 | GPT-4o | 63.4 |
| 10 | Qwen3 Next 80B A3B Instruct | 57.3 |