SimpleQA
SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models.
26Models
97.1Top score
31Median
SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | DeepSeek V3.2 Exp | 97.1 |
| 2 | Grok 4 Fast | 95 |
| 3 | DeepSeek-V3.1 | 93.4 |
| 4 | DeepSeek-R1-0528 | 92.3 |
| 5 | GPT-4.5 | 62.5 |
| 6 | Qwen3-235B-A22B-Instruct-2507 | 54.3 |
| 7 | Gemini 2.5 Pro Preview 06-05 | 54 |
| 8 | Gemini 2.5 Pro | 50.8 |
| 9 | o1 | 47 |
| 10 | o1-preview | 42.4 |
| 11 | GPT-4o | 38.2 |
| 12 | Kimi K2 Base | 35.3 |
| 13 | Kimi K2-Instruct-0905 | 31 |
| 14 | Kimi K2 Instruct | 31 |
| 15 | Gemini 2.5 Flash | 26.9 |
| 16 | DeepSeek-V3 | 24.9 |
| 17 | Gemini 2.0 Flash Lite | 21.7 |
| 18 | o3-mini | 15 |
| 19 | Mistral Small 3.2 24B Instruct | 12.1 |
| 20 | Gemini 2.5 Flash Lite | 10.7 |
| 21 | Mistral Small 3.1 24B Instruct | 10.4 |
| 22 | Gemma 3 27B | 10 |
| 23 | Gemma 3 12B | 6.3 |
| 24 | Gemma 3 4B | 4 |
| 25 | Phi 4 | 3 |
| 26 | Gemma 3 1B | 2.2 |