AI Hub
All benchmarks
Agents

τ²-bench Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools.

10Models
83.2Top score
70.6Median

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

908070605020242025Kimi K2 Instruct: 70.6 (2025-07-11)Qwen3-235B-A22B-Instruct-2507: 71.3 (2025-07-22)Qwen3-235B-A22B-Thinking-2507: 71.9 (2025-07-25)Kimi K2-Instruct-0905: 70.6 (2025-09-05)Qwen3 Next 80B A3B Thinking: 67.8 (2025-09-11)Qwen3 Next 80B A3B Instruct: 57.3 (2025-09-11)GPT-4o: 63.4 (2024-05-13)GPT-4oo3: 80.2 (2025-04-16)o3GPT-5: 81.1 (2025-08-07)GPT-5Claude Haiku 4.5: 83.2 (2025-10-15)Claude Haiku 4.5

Ranking

1Claude Haiku 4.5
83.2
2GPT-5
81.1
3o3
80.2
4Qwen3-235B-A22B-Thinking-2507
71.9
5Qwen3-235B-A22B-Instruct-2507
71.3
6Kimi K2-Instruct-0905
70.6
7Kimi K2 Instruct
70.6
8Qwen3 Next 80B A3B Thinking
67.8
9GPT-4o
63.4
10Qwen3 Next 80B A3B Instruct
57.3

Related Agents benchmarks