General

SimpleQA

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models.

Source

26Models

97.1Top score

31Median

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	DeepSeek V3.2 ExpDeepSeek	97.1
2	Grok 4 FastxAI	95
3	DeepSeek-V3.1DeepSeek	93.4
4	DeepSeek-R1-0528DeepSeek	92.3
5	GPT-4.5OpenAI	62.5
6	Qwen3-235B-A22B-Instruct-2507Alibaba	54.3
7	Gemini 2.5 Pro Preview 06-05Google	54
8	Gemini 2.5 ProGoogle	50.8
9	o1OpenAI	47
10	o1-previewOpenAI	42.4
11	GPT-4oOpenAI	38.2
12	Kimi K2 BaseMoonshot AI	35.3
13	Kimi K2-Instruct-0905Moonshot AI	31
14	Kimi K2 InstructMoonshot AI	31
15	Gemini 2.5 FlashGoogle	26.9
16	DeepSeek-V3DeepSeek	24.9
17	Gemini 2.0 Flash LiteGoogle	21.7
18	o3-miniOpenAI	15
19	Mistral Small 3.2 24B InstructMistral AI	12.1
20	Gemini 2.5 Flash LiteGoogle	10.7
21	Mistral Small 3.1 24B InstructMistral AI	10.4
22	Gemma 3 27BGoogle	10
23	Gemma 3 12BGoogle	6.3
24	Gemma 3 4BGoogle	4
25	Phi 4Microsoft	3
26	Gemma 3 1BGoogle	2.2

Related General benchmarks

Humanity’s Last Exam360 MMLU-Pro292 MMLU92 IFEval41 Arena Hard21 LiveBench13