AI Hub
All benchmarks
Reasoning

BIG-Bench Hard

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance.

28Models
93.1Top score
79.1Median

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1008365483020242025Claude 3 Sonnet: 82.9 (2024-02-29)Claude 3 Opus: 86.8 (2024-03-04)Claude 3 Haiku: 73.7 (2024-03-13)Gemini 1.5 Flash: 85.5 (2024-05-01)DeepSeek-V2.5: 84.3 (2024-05-08)Qwen2 72B Instruct: 82.4 (2024-07-23)Phi-3.5-MoE-instruct: 79.1 (2024-08-23)Phi-3.5-mini-instruct: 69 (2024-08-23)Qwen2.5 32B Instruct: 84.5 (2024-09-19)Qwen2.5 14B Instruct: 78.2 (2024-09-19)Nova Pro: 86.9 (2024-11-20)Nova Lite: 82.4 (2024-11-20)Nova Micro: 79.5 (2024-11-20)Phi 4 Mini: 70.4 (2025-02-01)Gemma 3 27B: 87.6 (2025-03-12)Gemma 3 1B: 39.1 (2025-03-12)Gemma 3 12B: 85.7 (2025-03-13)Gemma 3 4B: 72.2 (2025-03-13)Granite 3.3 8B Instruct: 69.1 (2025-04-16)Granite 3.3 8B Base: 69.1 (2025-04-16)Qwen3 235B A22B: 88.9 (2025-04-28)IBM Granite 4.0 Tiny Preview: 55.7 (2025-05-02)Gemma 3n E4B Instructed LiteRT Preview: 52.9 (2025-05-20)Gemma 3n E2B Instructed LiteRT (Preview): 44.3 (2025-05-20)Gemma 3n E4B: 52.9 (2025-06-26)Gemma 3n E2B: 44.3 (2025-06-26)Gemini 1.5 Pro: 89.2 (2024-02-15)Gemini 1.5 ProClaude 3.5 Sonnet: 93.1 (2024-06-20)Claude 3.5 Sonnet

Ranking

1Claude 3.5 Sonnet
93.1
2Gemini 1.5 Pro
89.2
3Qwen3 235B A22B
88.9
4Gemma 3 27B
87.6
5Nova Pro
86.9
6Claude 3 Opus
86.8
7Gemma 3 12B
85.7
8Gemini 1.5 Flash
85.5
9Qwen2.5 32B Instruct
84.5
10DeepSeek-V2.5
84.3
11Claude 3 Sonnet
82.9
12Qwen2 72B Instruct
82.4
13Nova Lite
82.4
14Nova Micro
79.5
15Phi-3.5-MoE-instruct
79.1
16Qwen2.5 14B Instruct
78.2
17Claude 3 Haiku
73.7
18Gemma 3 4B
72.2
19Phi 4 Mini
70.4
20Granite 3.3 8B Instruct
69.1
21Granite 3.3 8B Base
69.1
22Phi-3.5-mini-instruct
69
23IBM Granite 4.0 Tiny Preview
55.7
24Gemma 3n E4B Instructed LiteRT Preview
52.9
25Gemma 3n E4B
52.9
26Gemma 3n E2B Instructed LiteRT (Preview)
44.3
27Gemma 3n E2B
44.3
28Gemma 3 1B
39.1

Related Reasoning benchmarks