AI Hub
All benchmarks
Multimodal

MMMU

MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning.

52Models
84.2Top score
65.9Median

MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

906845230202320242025Gemini 1.0 Pro: 47.9 (2024-02-15)Gemini 1.5 Flash 8B: 53.7 (2024-03-15)Grok-1.5: 53.6 (2024-03-28)Grok-1.5V: 53.6 (2024-04-12)Gemini 1.5 Flash: 62.3 (2024-05-01)Claude 3.5 Sonnet: 68.3 (2024-06-20)GPT-4o-mini: 59.4 (2024-07-18)Grok-2: 66.1 (2024-08-13)Grok-2 mini: 63.2 (2024-08-13)Phi-3.5-vision-instruct: 43 (2024-08-23)Pixtral-12B: 52.5 (2024-09-17)Llama 3.2 90B Instruct: 60.3 (2024-09-25)Llama 3.2 11B Instruct: 50.7 (2024-09-25)Pixtral Large: 64 (2024-11-19)Nova Pro: 61.7 (2024-11-20)Nova Lite: 56.2 (2024-11-20)Gemini 2.0 Flash: 70.7 (2024-12-11)DeepSeek VL2: 51.1 (2024-12-13)DeepSeek VL2 Small: 48 (2024-12-13)DeepSeek VL2 Tiny: 40.7 (2024-12-13)QvQ-72B-Preview: 70.3 (2024-12-25)Kimi-k1.5: 70 (2025-01-20)Gemini 2.0 Flash Thinking: 75.4 (2025-01-21)Qwen2.5 VL 7B Instruct: 58.6 (2025-01-26)Qwen2.5 VL 72B Instruct: 70.2 (2025-02-01)Phi-4-multimodal-instruct: 55.1 (2025-02-01)Claude 3.7 Sonnet: 75 (2025-02-24)Gemini 2.0 Flash Lite: 68 (2025-02-25)GPT-4.5: 75.2 (2025-02-27)Qwen2.5 VL 32B Instruct: 70 (2025-02-28)Mistral Small 3.1 24B Instruct: 59.3 (2025-03-17)Mistral Small 3.1 24B Base: 59.3 (2025-03-17)Qwen2.5-Omni-7B: 59.2 (2025-03-27)Llama 4 Maverick: 73.4 (2025-04-05)Llama 4 Scout: 69.4 (2025-04-05)GPT-4.1: 74.8 (2025-04-14)GPT-4.1 Mini: 72.7 (2025-04-14)GPT-4.1 Nano: 55.4 (2025-04-14)o4-mini: 81.6 (2025-04-16)Gemini 2.5 Flash: 79.7 (2025-04-17)Claude Sonnet 4: 74.4 (2025-05-22)Gemini 2.5 Pro Preview 06-05: 82 (2025-06-05)Mistral Small 3.2 24B Instruct: 62.5 (2025-06-20)Gemini 2.5 Flash Lite: 72.9 (2025-07-22)GPT-3.5 Turbo: 0 (2023-03-01)GPT-3.5 TurboGemini 1.5 Pro: 65.9 (2024-02-15)Gemini 1.5 ProGPT-4o: 72.2 (2024-05-13)GPT-4oo1: 77.6 (2024-12-05)o1Grok-3: 78 (2025-02-17)Grok-3Gemini 2.5 Pro: 79.6 (2025-03-25)Gemini 2.5 Proo3: 82.9 (2025-04-16)o3GPT-5: 84.2 (2025-08-07)GPT-5

Ranking

1GPT-5
84.2
2o3
82.9
3Gemini 2.5 Pro Preview 06-05
82
4o4-mini
81.6
5Gemini 2.5 Flash
79.7
6Gemini 2.5 Pro
79.6
7Grok-3
78
8o1
77.6
9Gemini 2.0 Flash Thinking
75.4
10GPT-4.5
75.2
11Claude 3.7 Sonnet
75
12GPT-4.1
74.8
13Claude Sonnet 4
74.4
14Llama 4 Maverick
73.4
15Gemini 2.5 Flash Lite
72.9
16GPT-4.1 Mini
72.7
17GPT-4o
72.2
18Gemini 2.0 Flash
70.7
19QvQ-72B-Preview
70.3
20Qwen2.5 VL 72B Instruct
70.2
21Qwen2.5 VL 32B Instruct
70
22Kimi-k1.5
70
23Llama 4 Scout
69.4
24Claude 3.5 Sonnet
68.3
25Gemini 2.0 Flash Lite
68
26Grok-2
66.1
27Gemini 1.5 Pro
65.9
28Pixtral Large
64
29Grok-2 mini
63.2
30Mistral Small 3.2 24B Instruct
62.5
31Gemini 1.5 Flash
62.3
32Nova Pro
61.7
33Llama 3.2 90B Instruct
60.3
34GPT-4o-mini
59.4
35Mistral Small 3.1 24B Instruct
59.3
36Mistral Small 3.1 24B Base
59.3
37Qwen2.5-Omni-7B
59.2
38Qwen2.5 VL 7B Instruct
58.6
39Nova Lite
56.2
40GPT-4.1 Nano
55.4
41Phi-4-multimodal-instruct
55.1
42Gemini 1.5 Flash 8B
53.7
43Grok-1.5V
53.6
44Grok-1.5
53.6
45Pixtral-12B
52.5
46DeepSeek VL2
51.1
47Llama 3.2 11B Instruct
50.7
48DeepSeek VL2 Small
48
49Gemini 1.0 Pro
47.9
50Phi-3.5-vision-instruct
43
51DeepSeek VL2 Tiny
40.7
52GPT-3.5 Turbo
0

Related Multimodal benchmarks