AI Hub
All benchmarks
General

MMLU

Massive Multitask Language Understanding — 57 subjects spanning STEM, humanities, and social sciences.

92Models
92.5Top score
81.2Median

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

10088756350202320242025Claude 2: 78.5 (2023-07-11)Mistral 7B: 60.1 (2023-09-27)Mixtral 8x7B: 70.6 (2023-12-11)Gemini 1.5 Pro: 85.9 (2024-02-15)Gemini 1.0 Pro: 71.8 (2024-02-15)Claude 3 Sonnet: 79 (2024-02-29)Claude 3 Haiku: 75.2 (2024-03-13)Grok-1.5: 81.3 (2024-03-28)Command R+: 75.7 (2024-04-04)Mixtral 8x22B: 77.8 (2024-04-10)Gemini 1.5 Flash: 78.9 (2024-05-01)DeepSeek-V2.5: 80.4 (2024-05-08)Gemma 2 9B: 71.3 (2024-06-27)Gemma 2 27B: 75.2 (2024-07-13)GPT-4o-mini: 82 (2024-07-18)Mistral NeMo Instruct: 68 (2024-07-18)Llama 3.1 405B: 88.6 (2024-07-23)Llama 3.1 405B Instruct: 87.3 (2024-07-23)Llama 3.1 70B Instruct: 83.6 (2024-07-23)Qwen2 72B Instruct: 82.3 (2024-07-23)Qwen2 7B Instruct: 70.5 (2024-07-23)Llama 3.1 8B Instruct: 69.4 (2024-07-23)Mistral Large 2: 84 (2024-07-24)Grok-2: 87.5 (2024-08-13)Grok-2 mini: 86.2 (2024-08-13)Jamba 1.5 Large: 81.2 (2024-08-22)Jamba 1.5 Mini: 69.7 (2024-08-22)Phi-3.5-MoE-instruct: 78.9 (2024-08-23)Phi-3.5-mini-instruct: 69 (2024-08-23)o1-mini: 85.2 (2024-09-12)Pixtral-12B: 69.2 (2024-09-17)Qwen2.5 72B: 86.1 (2024-09-19)Qwen2.5 32B Instruct: 83.3 (2024-09-19)Qwen2.5 14B Instruct: 79.7 (2024-09-19)Qwen2.5-Coder 7B Instruct: 67.6 (2024-09-19)Llama 3.2 90B Instruct: 86 (2024-09-25)Llama 3.2 11B Instruct: 73 (2024-09-25)Llama 3.2 3B Instruct: 63.4 (2024-09-25)Llama 3.1 Nemotron 70B Instruct: 80.2 (2024-10-01)Ministral 8B Instruct: 65 (2024-10-16)Claude 3.5 Haiku: 80.9 (2024-11-04)Qwen2.5 Coder 32B Instruct: 75.1 (2024-11-11)Nova Pro: 85.9 (2024-11-20)Nova Lite: 80.5 (2024-11-20)Nova Micro: 77.6 (2024-11-20)Llama 3.3 70B Instruct: 86 (2024-12-06)Llama 3.3 70B: 86 (2024-12-06)Gemini 2.0 Flash: 87 (2024-12-11)DeepSeek-V3: 88.5 (2024-12-26)Phi 4: 84.8 (2025-01-10)DeepSeek-R1: 90.8 (2025-01-20)Kimi-k1.5: 87.4 (2025-01-20)Mistral Small 3 24B Base: 80.7 (2025-01-30)o3-mini: 86.9 (2025-01-31)Phi 4 Mini: 67.3 (2025-02-01)Claude 3.7 Sonnet: 86.1 (2025-02-24)GPT-4.5: 90.8 (2025-02-27)Qwen2.5 VL 32B Instruct: 78.4 (2025-02-28)Mistral Small 3.1 24B Base: 81 (2025-03-17)Mistral Small 3.1 24B Instruct: 80.6 (2025-03-17)Llama 4 Maverick: 85.5 (2025-04-05)Llama 4 Scout: 79.6 (2025-04-05)GPT-4.1: 90.2 (2025-04-14)GPT-4.1 Mini: 87.5 (2025-04-14)GPT-4.1 Nano: 80.1 (2025-04-14)Granite 3.3 8B Instruct: 65.5 (2025-04-16)Granite 3.3 8B Base: 63.9 (2025-04-16)Qwen3 235B A22B: 87.8 (2025-04-28)IBM Granite 4.0 Tiny Preview: 60.4 (2025-05-02)Gemma 3n E4B Instructed LiteRT Preview: 64.9 (2025-05-20)Gemma 3n E2B Instructed LiteRT (Preview): 60.1 (2025-05-20)Claude Opus 4: 88.8 (2025-05-22)Claude Sonnet 4: 88 (2025-05-22)Mistral Small 3.2 24B Instruct: 80.5 (2025-06-20)Gemma 3n E4B Instructed: 64.9 (2025-06-26)Gemma 3n E2B Instructed: 60.1 (2025-06-26)Kimi K2 Instruct: 89.5 (2025-07-11)Kimi K2: 89.5 (2025-07-11)Kimi K2 Base: 87.8 (2025-07-11)gpt-oss-120b: 90 (2025-08-05)gpt-oss-20b: 85.3 (2025-08-05)Kimi K2 0905: 90.2 (2025-09-05)Kimi K2-Instruct-0905: 89.5 (2025-09-05)GPT-3.5 Turbo: 70 (2023-03-01)GPT-3.5 TurboGPT-4: 86.4 (2023-03-14)GPT-4GPT-4 Turbo: 86.5 (2023-11-06)GPT-4 TurboClaude 3 Opus: 86.8 (2024-03-04)Claude 3 OpusGPT-4o: 88.7 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 90.4 (2024-06-20)Claude 3.5 Sonneto1-preview: 90.8 (2024-09-12)o1-previewo1: 92 (2024-12-05)o1GPT-5: 92.5 (2025-08-07)GPT-5

Ranking

1GPT-5
92.5
2o1
92
3o1-preview
90.8
4DeepSeek-R1
90.8
5GPT-4.5
90.8
6Claude 3.5 Sonnet
90.4
7Kimi K2 0905
90.2
8GPT-4.1
90.2
9gpt-oss-120b
90
10Kimi K2-Instruct-0905
89.5
11Kimi K2 Instruct
89.5
12Kimi K2
89.5
13Claude Opus 4
88.8
14GPT-4o
88.7
15Llama 3.1 405B
88.6
16DeepSeek-V3
88.5
17Claude Sonnet 4
88
18Kimi K2 Base
87.8
19Qwen3 235B A22B
87.8
20GPT-4.1 Mini
87.5
21Grok-2
87.5
22Kimi-k1.5
87.4
23Llama 3.1 405B Instruct
87.3
24Gemini 2.0 Flash
87
25o3-mini
86.9
26Claude 3 Opus
86.8
27GPT-4 Turbo
86.5
28GPT-4
86.4
29Grok-2 mini
86.2
30Claude 3.7 Sonnet
86.1
31Qwen2.5 72B
86.1
32Llama 3.2 90B Instruct
86
33Llama 3.3 70B Instruct
86
34Llama 3.3 70B
86
35Nova Pro
85.9
36Gemini 1.5 Pro
85.9
37Llama 4 Maverick
85.5
38gpt-oss-20b
85.3
39o1-mini
85.2
40Phi 4
84.8
41Mistral Large 2
84
42Llama 3.1 70B Instruct
83.6
43Qwen2.5 32B Instruct
83.3
44Qwen2 72B Instruct
82.3
45GPT-4o-mini
82
46Grok-1.5
81.3
47Jamba 1.5 Large
81.2
48Mistral Small 3.1 24B Base
81
49Claude 3.5 Haiku
80.9
50Mistral Small 3 24B Base
80.7
51Mistral Small 3.1 24B Instruct
80.6
52Nova Lite
80.5
53Mistral Small 3.2 24B Instruct
80.5
54DeepSeek-V2.5
80.4
55Llama 3.1 Nemotron 70B Instruct
80.2
56GPT-4.1 Nano
80.1
57Qwen2.5 14B Instruct
79.7
58Llama 4 Scout
79.6
59Claude 3 Sonnet
79
60Phi-3.5-MoE-instruct
78.9
61Gemini 1.5 Flash
78.9
62Claude 2
78.5
63Qwen2.5 VL 32B Instruct
78.4
64Mixtral 8x22B
77.8
65Nova Micro
77.6
66Command R+
75.7
67Claude 3 Haiku
75.2
68Gemma 2 27B
75.2
69Qwen2.5 Coder 32B Instruct
75.1
70Llama 3.2 11B Instruct
73
71Gemini 1.0 Pro
71.8
72Gemma 2 9B
71.3
73Mixtral 8x7B
70.6
74Qwen2 7B Instruct
70.5
75GPT-3.5 Turbo
70
76Jamba 1.5 Mini
69.7
77Llama 3.1 8B Instruct
69.4
78Pixtral-12B
69.2
79Phi-3.5-mini-instruct
69
80Mistral NeMo Instruct
68
81Qwen2.5-Coder 7B Instruct
67.6
82Phi 4 Mini
67.3
83Granite 3.3 8B Instruct
65.5
84Ministral 8B Instruct
65
85Gemma 3n E4B Instructed LiteRT Preview
64.9
86Gemma 3n E4B Instructed
64.9
87Granite 3.3 8B Base
63.9
88Llama 3.2 3B Instruct
63.4
89IBM Granite 4.0 Tiny Preview
60.4
90Gemma 3n E2B Instructed LiteRT (Preview)
60.1
91Gemma 3n E2B Instructed
60.1
92Mistral 7B
60.1

Related General benchmarks