DROP
DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content.
25Models
91.6Top score
78.9Median
DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | DeepSeek-V3 | 91.6 |
| 2 | Claude 3.5 Sonnet | 87.1 |
| 3 | GPT-4 Turbo | 86 |
| 4 | Nova Pro | 85.4 |
| 5 | Llama 3.1 405B Instruct | 84.8 |
| 6 | Claude 3.5 Haiku | 83.1 |
| 7 | Claude 3 Opus | 83.1 |
| 8 | GPT-4 | 80.9 |
| 9 | Nova Lite | 80.2 |
| 10 | GPT-4o-mini | 79.7 |
| 11 | Llama 3.1 70B Instruct | 79.6 |
| 12 | Nova Micro | 79.3 |
| 13 | Claude 3 Sonnet | 78.9 |
| 14 | Claude 3 Haiku | 78.4 |
| 15 | Phi 4 | 75.5 |
| 16 | Gemini 1.5 Pro | 74.9 |
| 17 | GPT-3.5 Turbo | 70.2 |
| 18 | Gemma 3n E4B Instructed LiteRT Preview | 60.8 |
| 19 | Gemma 3n E4B | 60.8 |
| 20 | Llama 3.1 8B Instruct | 59.5 |
| 21 | Granite 3.3 8B Instruct | 59.4 |
| 22 | Gemma 3n E2B Instructed LiteRT (Preview) | 53.9 |
| 23 | Gemma 3n E2B | 53.9 |
| 24 | IBM Granite 4.0 Tiny Preview | 46.2 |
| 25 | Granite 3.3 8B Base | 36.1 |