Agents

τ²-bench Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools.

Source

10Models

83.2Top score

70.6Median

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Claude Haiku 4.5Anthropic	83.2
2	GPT-5OpenAI	81.1
3	o3OpenAI	80.2
4	Qwen3-235B-A22B-Thinking-2507Alibaba	71.9
5	Qwen3-235B-A22B-Instruct-2507Alibaba	71.3
6	Kimi K2-Instruct-0905Moonshot AI	70.6
7	Kimi K2 InstructMoonshot AI	70.6
8	Qwen3 Next 80B A3B ThinkingAlibaba	67.8
9	GPT-4oOpenAI	63.4
10	Qwen3 Next 80B A3B InstructAlibaba	57.3

Related Agents benchmarks

TAU-bench Retail22 TAU-bench Airline20 BFCL10 τ²-bench Airline10 BrowseComp10