AI Hub
All benchmarks
Agents

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information.

10Models
54.9Top score
40.1Median

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

6045301502025o3: 49.7 (2025-04-16)DeepSeek-R1-0528: 8.9 (2025-05-28)GLM 4.5 Air: 21.3 (2025-07-25)GLM-4.5: 26.4 (2025-07-28)DeepSeek-V3.1: 30 (2025-08-21)Grok 4 Fast: 44.9 (2025-09-19)DeepSeek V3.2 Exp: 40.1 (2025-09-29)GLM-4.6: 45.1 (2025-09-30)o4-mini: 51.5 (2025-04-16)o4-miniGPT-5: 54.9 (2025-08-07)GPT-5

Ranking

1GPT-5
54.9
2o4-mini
51.5
3o3
49.7
4GLM-4.6
45.1
5Grok 4 Fast
44.9
6DeepSeek V3.2 Exp
40.1
7DeepSeek-V3.1
30
8GLM-4.5
26.4
9GLM 4.5 Air
21.3
10DeepSeek-R1-0528
8.9

Related Agents benchmarks