Agents

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information.

Source

10Models

54.9Top score

40.1Median

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	GPT-5OpenAI	54.9
2	o4-miniOpenAI	51.5
3	o3OpenAI	49.7
4	GLM-4.6Zhipu AI	45.1
5	Grok 4 FastxAI	44.9
6	DeepSeek V3.2 ExpDeepSeek	40.1
7	DeepSeek-V3.1DeepSeek	30
8	GLM-4.5Zhipu AI	26.4
9	GLM 4.5 AirZhipu AI	21.3
10	DeepSeek-R1-0528DeepSeek	8.9

Related Agents benchmarks

TAU-bench Retail22 TAU-bench Airline20 BFCL10 τ²-bench Airline10 τ²-bench Retail10