BrowseComp
BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information.
10Models
54.9Top score
40.1Median
BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | GPT-5 | 54.9 |
| 2 | o4-mini | 51.5 |
| 3 | o3 | 49.7 |
| 4 | GLM-4.6 | 45.1 |
| 5 | Grok 4 Fast | 44.9 |
| 6 | DeepSeek V3.2 Exp | 40.1 |
| 7 | DeepSeek-V3.1 | 30 |
| 8 | GLM-4.5 | 26.4 |
| 9 | GLM 4.5 Air | 21.3 |
| 10 | DeepSeek-R1-0528 | 8.9 |