Agents

TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments.

22Models

86.2Top score

68.4Median

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Claude Sonnet 4.5Anthropic	86.2
2	Claude Opus 4.1Anthropic	82.4
3	Claude Opus 4Anthropic	81.4
4	Claude 3.7 SonnetAnthropic	81.2
5	Claude Sonnet 4Anthropic	80.5
6	GLM-4.5Zhipu AI	79.7
7	GLM 4.5 AirZhipu AI	77.9
8	o4-miniOpenAI	71.8
9	o1OpenAI	70.8
10	Qwen3 Next 80B A3B ThinkingAlibaba	69.6
11	Claude 3.5 SonnetAnthropic	69.2
12	GPT-4.5OpenAI	68.4
13	GPT-4.1OpenAI	68
14	Qwen3-235B-A22B-Thinking-2507Alibaba	67.8
15	gpt-oss-120bOpenAI	67.8
16	Qwen3 Next 80B A3B InstructAlibaba	60.9
17	GPT-4oOpenAI	60.3
18	o3-miniOpenAI	57.6
19	GPT-4.1 MiniOpenAI	55.8
20	gpt-oss-20bOpenAI	54.8
21	Claude 3.5 HaikuAnthropic	51
22	GPT-4.1 NanoOpenAI	22.6

Related Agents benchmarks

TAU-bench Airline20 BFCL10 τ²-bench Airline10 τ²-bench Retail10 BrowseComp10