Reasoning

DROP

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content.

Source

25Models

91.6Top score

78.9Median

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	DeepSeek-V3DeepSeek	91.6
2	Claude 3.5 SonnetAnthropic	87.1
3	GPT-4 TurboOpenAI	86
4	Nova ProAmazon	85.4
5	Llama 3.1 405B InstructMeta	84.8
6	Claude 3.5 HaikuAnthropic	83.1
7	Claude 3 OpusAnthropic	83.1
8	GPT-4OpenAI	80.9
9	Nova LiteAmazon	80.2
10	GPT-4o-miniOpenAI	79.7
11	Llama 3.1 70B InstructMeta	79.6
12	Nova MicroAmazon	79.3
13	Claude 3 SonnetAnthropic	78.9
14	Claude 3 HaikuAnthropic	78.4
15	Phi 4Microsoft	75.5
16	Gemini 1.5 ProGoogle	74.9
17	GPT-3.5 TurboOpenAI	70.2
18	Gemma 3n E4B Instructed LiteRT PreviewGoogle	60.8
19	Gemma 3n E4BGoogle	60.8
20	Llama 3.1 8B InstructMeta	59.5
21	Granite 3.3 8B InstructIBM	59.4
22	Gemma 3n E2B Instructed LiteRT (Preview)Google	53.9
23	Gemma 3n E2BGoogle	53.9
24	IBM Granite 4.0 Tiny PreviewIBM	46.2
25	Granite 3.3 8B BaseIBM	36.1

Related Reasoning benchmarks

GPQA Diamond405 BIG-Bench Hard28 ARC-AGI-210