AI Hub
All benchmarks
Multimodal

MMMU-Pro

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate op

13Models
78.4Top score
49.5Median

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

9073553820202420252026Qwen2-VL-72B-Instruct: 46.2 (2024-08-29)Llama 3.2 90B Instruct: 45.2 (2024-09-25)Llama 3.2 11B Instruct: 33 (2024-09-25)Qwen2.5 VL 7B Instruct: 38.3 (2025-01-26)Qwen2.5 VL 72B Instruct: 51.1 (2025-02-01)Phi-4-multimodal-instruct: 38.5 (2025-02-01)Qwen2.5 VL 32B Instruct: 49.5 (2025-02-28)Qwen2.5-Omni-7B: 36.6 (2025-03-27)Llama 4 Maverick: 59.6 (2025-04-05)Claude Opus 4.6: 77.3 (2026-02-05)GPT-4o: 59.9 (2024-05-13)GPT-4oo3: 76.4 (2025-04-16)o3GPT-5: 78.4 (2025-08-07)GPT-5

Ranking

1GPT-5
78.4
2Claude Opus 4.6
77.3
3o3
76.4
4GPT-4o
59.9
5Llama 4 Maverick
59.6
6Qwen2.5 VL 72B Instruct
51.1
7Qwen2.5 VL 32B Instruct
49.5
8Qwen2-VL-72B-Instruct
46.2
9Llama 3.2 90B Instruct
45.2
10Phi-4-multimodal-instruct
38.5
11Qwen2.5 VL 7B Instruct
38.3
12Qwen2.5-Omni-7B
36.6
13Llama 3.2 11B Instruct
33

Related Multimodal benchmarks