AI Hub
All benchmarks
Reasoning

DROP

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content.

25Models
91.6Top score
78.9Median

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

10083654830202320242025Gemini 1.5 Pro: 74.9 (2024-02-15)Claude 3 Sonnet: 78.9 (2024-02-29)Claude 3 Opus: 83.1 (2024-03-04)Claude 3 Haiku: 78.4 (2024-03-13)GPT-4o-mini: 79.7 (2024-07-18)Llama 3.1 405B Instruct: 84.8 (2024-07-23)Llama 3.1 70B Instruct: 79.6 (2024-07-23)Llama 3.1 8B Instruct: 59.5 (2024-07-23)Claude 3.5 Haiku: 83.1 (2024-11-04)Nova Pro: 85.4 (2024-11-20)Nova Lite: 80.2 (2024-11-20)Nova Micro: 79.3 (2024-11-20)Phi 4: 75.5 (2025-01-10)Granite 3.3 8B Instruct: 59.4 (2025-04-16)Granite 3.3 8B Base: 36.1 (2025-04-16)IBM Granite 4.0 Tiny Preview: 46.2 (2025-05-02)Gemma 3n E4B Instructed LiteRT Preview: 60.8 (2025-05-20)Gemma 3n E2B Instructed LiteRT (Preview): 53.9 (2025-05-20)Gemma 3n E4B: 60.8 (2025-06-26)Gemma 3n E2B: 53.9 (2025-06-26)GPT-3.5 Turbo: 70.2 (2023-03-01)GPT-3.5 TurboGPT-4: 80.9 (2023-03-14)GPT-4GPT-4 Turbo: 86 (2023-11-06)GPT-4 TurboClaude 3.5 Sonnet: 87.1 (2024-06-20)Claude 3.5 SonnetDeepSeek-V3: 91.6 (2024-12-26)DeepSeek-V3

Ranking

1DeepSeek-V3
91.6
2Claude 3.5 Sonnet
87.1
3GPT-4 Turbo
86
4Nova Pro
85.4
5Llama 3.1 405B Instruct
84.8
6Claude 3.5 Haiku
83.1
7Claude 3 Opus
83.1
8GPT-4
80.9
9Nova Lite
80.2
10GPT-4o-mini
79.7
11Llama 3.1 70B Instruct
79.6
12Nova Micro
79.3
13Claude 3 Sonnet
78.9
14Claude 3 Haiku
78.4
15Phi 4
75.5
16Gemini 1.5 Pro
74.9
17GPT-3.5 Turbo
70.2
18Gemma 3n E4B Instructed LiteRT Preview
60.8
19Gemma 3n E4B
60.8
20Llama 3.1 8B Instruct
59.5
21Granite 3.3 8B Instruct
59.4
22Gemma 3n E2B Instructed LiteRT (Preview)
53.9
23Gemma 3n E2B
53.9
24IBM Granite 4.0 Tiny Preview
46.2
25Granite 3.3 8B Base
36.1

Related Reasoning benchmarks