BIG-Bench Hard
BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance.
BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Claude 3.5 Sonnet | 93.1 |
| 2 | Gemini 1.5 Pro | 89.2 |
| 3 | Qwen3 235B A22B | 88.9 |
| 4 | Gemma 3 27B | 87.6 |
| 5 | Nova Pro | 86.9 |
| 6 | Claude 3 Opus | 86.8 |
| 7 | Gemma 3 12B | 85.7 |
| 8 | Gemini 1.5 Flash | 85.5 |
| 9 | Qwen2.5 32B Instruct | 84.5 |
| 10 | DeepSeek-V2.5 | 84.3 |
| 11 | Claude 3 Sonnet | 82.9 |
| 12 | Qwen2 72B Instruct | 82.4 |
| 13 | Nova Lite | 82.4 |
| 14 | Nova Micro | 79.5 |
| 15 | Phi-3.5-MoE-instruct | 79.1 |
| 16 | Qwen2.5 14B Instruct | 78.2 |
| 17 | Claude 3 Haiku | 73.7 |
| 18 | Gemma 3 4B | 72.2 |
| 19 | Phi 4 Mini | 70.4 |
| 20 | Granite 3.3 8B Instruct | 69.1 |
| 21 | Granite 3.3 8B Base | 69.1 |
| 22 | Phi-3.5-mini-instruct | 69 |
| 23 | IBM Granite 4.0 Tiny Preview | 55.7 |
| 24 | Gemma 3n E4B Instructed LiteRT Preview | 52.9 |
| 25 | Gemma 3n E4B | 52.9 |
| 26 | Gemma 3n E2B Instructed LiteRT (Preview) | 44.3 |
| 27 | Gemma 3n E2B | 44.3 |
| 28 | Gemma 3 1B | 39.1 |