AI Hub
All benchmarks
Multimodal

AI2D

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers.

17Models
94.7Top score
84.5Median

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1009080706020242025Phi-3.5-vision-instruct: 78.1 (2024-08-23)Llama 3.2 90B Instruct: 92.3 (2024-09-25)Llama 3.2 11B Instruct: 91.1 (2024-09-25)Pixtral Large: 93.8 (2024-11-19)DeepSeek VL2: 81.4 (2024-12-13)DeepSeek VL2 Small: 80 (2024-12-13)DeepSeek VL2 Tiny: 71.6 (2024-12-13)Qwen2.5 VL 72B Instruct: 88.4 (2025-02-01)Phi-4-multimodal-instruct: 82.3 (2025-02-01)Gemma 3 27B: 84.5 (2025-03-12)Gemma 3 12B: 84.2 (2025-03-13)Gemma 3 4B: 74.8 (2025-03-13)Qwen2.5-Omni-7B: 83.2 (2025-03-27)Mistral Small 3.2 24B Instruct: 92.9 (2025-06-20)Grok-1.5V: 88.3 (2024-04-12)Grok-1.5VGPT-4o: 94.2 (2024-05-13)GPT-4oClaude 3.5 Sonnet: 94.7 (2024-06-20)Claude 3.5 Sonnet

Ranking

1Claude 3.5 Sonnet
94.7
2GPT-4o
94.2
3Pixtral Large
93.8
4Mistral Small 3.2 24B Instruct
92.9
5Llama 3.2 90B Instruct
92.3
6Llama 3.2 11B Instruct
91.1
7Qwen2.5 VL 72B Instruct
88.4
8Grok-1.5V
88.3
9Gemma 3 27B
84.5
10Gemma 3 12B
84.2
11Qwen2.5-Omni-7B
83.2
12Phi-4-multimodal-instruct
82.3
13DeepSeek VL2
81.4
14DeepSeek VL2 Small
80
15Phi-3.5-vision-instruct
78.1
16Gemma 3 4B
74.8
17DeepSeek VL2 Tiny
71.6

Related Multimodal benchmarks