Reasoning

BIG-Bench Hard

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance.

Source

28Models

93.1Top score

79.1Median

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Claude 3.5 SonnetAnthropic	93.1
2	Gemini 1.5 ProGoogle	89.2
3	Qwen3 235B A22BAlibaba	88.9
4	Gemma 3 27BGoogle	87.6
5	Nova ProAmazon	86.9
6	Claude 3 OpusAnthropic	86.8
7	Gemma 3 12BGoogle	85.7
8	Gemini 1.5 FlashGoogle	85.5
9	Qwen2.5 32B InstructAlibaba	84.5
10	DeepSeek-V2.5DeepSeek	84.3
11	Claude 3 SonnetAnthropic	82.9
12	Qwen2 72B InstructAlibaba	82.4
13	Nova LiteAmazon	82.4
14	Nova MicroAmazon	79.5
15	Phi-3.5-MoE-instructMicrosoft	79.1
16	Qwen2.5 14B InstructAlibaba	78.2
17	Claude 3 HaikuAnthropic	73.7
18	Gemma 3 4BGoogle	72.2
19	Phi 4 MiniMicrosoft	70.4
20	Granite 3.3 8B InstructIBM	69.1
21	Granite 3.3 8B BaseIBM	69.1
22	Phi-3.5-mini-instructMicrosoft	69
23	IBM Granite 4.0 Tiny PreviewIBM	55.7
24	Gemma 3n E4B Instructed LiteRT PreviewGoogle	52.9
25	Gemma 3n E4BGoogle	52.9
26	Gemma 3n E2B Instructed LiteRT (Preview)Google	44.3
27	Gemma 3n E2BGoogle	44.3
28	Gemma 3 1BGoogle	39.1

Related Reasoning benchmarks

GPQA Diamond405 DROP25 ARC-AGI-210