MMMU-Pro
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate op
13Models
78.4Top score
49.5Median
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | GPT-5 | 78.4 |
| 2 | Claude Opus 4.6 | 77.3 |
| 3 | o3 | 76.4 |
| 4 | GPT-4o | 59.9 |
| 5 | Llama 4 Maverick | 59.6 |
| 6 | Qwen2.5 VL 72B Instruct | 51.1 |
| 7 | Qwen2.5 VL 32B Instruct | 49.5 |
| 8 | Qwen2-VL-72B-Instruct | 46.2 |
| 9 | Llama 3.2 90B Instruct | 45.2 |
| 10 | Phi-4-multimodal-instruct | 38.5 |
| 11 | Qwen2.5 VL 7B Instruct | 38.3 |
| 12 | Qwen2.5-Omni-7B | 36.6 |
| 13 | Llama 3.2 11B Instruct | 33 |