AI Hub
All benchmarks
General

SimpleQA

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models.

26Models
97.1Top score
31Median

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

100755025020242025DeepSeek-V3: 24.9 (2024-12-26)Phi 4: 3 (2025-01-10)o3-mini: 15 (2025-01-31)Gemini 2.0 Flash Lite: 21.7 (2025-02-25)Gemma 3 27B: 10 (2025-03-12)Gemma 3 1B: 2.2 (2025-03-12)Gemma 3 12B: 6.3 (2025-03-13)Gemma 3 4B: 4 (2025-03-13)Mistral Small 3.1 24B Instruct: 10.4 (2025-03-17)Gemini 2.5 Pro: 50.8 (2025-03-25)Gemini 2.5 Flash: 26.9 (2025-04-17)Gemini 2.5 Pro Preview 06-05: 54 (2025-06-05)Mistral Small 3.2 24B Instruct: 12.1 (2025-06-20)Kimi K2 Base: 35.3 (2025-07-11)Kimi K2 Instruct: 31 (2025-07-11)Qwen3-235B-A22B-Instruct-2507: 54.3 (2025-07-22)Gemini 2.5 Flash Lite: 10.7 (2025-07-22)Kimi K2-Instruct-0905: 31 (2025-09-05)GPT-4o: 38.2 (2024-05-13)GPT-4oo1-preview: 42.4 (2024-09-12)o1-previewo1: 47 (2024-12-05)o1GPT-4.5: 62.5 (2025-02-27)GPT-4.5DeepSeek-R1-0528: 92.3 (2025-05-28)DeepSeek-R1-0528DeepSeek-V3.1: 93.4 (2025-08-21)DeepSeek-V3.1Grok 4 Fast: 95 (2025-09-19)Grok 4 FastDeepSeek V3.2 Exp: 97.1 (2025-09-29)DeepSeek V3.2 Exp

Ranking

1DeepSeek V3.2 Exp
97.1
2Grok 4 Fast
95
3DeepSeek-V3.1
93.4
4DeepSeek-R1-0528
92.3
5GPT-4.5
62.5
6Qwen3-235B-A22B-Instruct-2507
54.3
7Gemini 2.5 Pro Preview 06-05
54
8Gemini 2.5 Pro
50.8
9o1
47
10o1-preview
42.4
11GPT-4o
38.2
12Kimi K2 Base
35.3
13Kimi K2-Instruct-0905
31
14Kimi K2 Instruct
31
15Gemini 2.5 Flash
26.9
16DeepSeek-V3
24.9
17Gemini 2.0 Flash Lite
21.7
18o3-mini
15
19Mistral Small 3.2 24B Instruct
12.1
20Gemini 2.5 Flash Lite
10.7
21Mistral Small 3.1 24B Instruct
10.4
22Gemma 3 27B
10
23Gemma 3 12B
6.3
24Gemma 3 4B
4
25Phi 4
3
26Gemma 3 1B
2.2

Related General benchmarks