AI Hub
All benchmarks
General

LiveBench

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news article

13Models
84.6Top score
73.1Median

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

907355382020242025Qwen2.5 72B Instruct: 52.3 (2024-09-19)Qwen2.5 7B Instruct: 35.9 (2024-10-16)Phi 4: 47.6 (2025-01-10)QwQ-32B: 73.1 (2025-03-05)Qwen2.5-Omni-7B: 29.6 (2025-03-27)Qwen3 235B A22B: 77.1 (2025-04-28)Qwen3 32B: 74.9 (2025-04-28)Qwen3 30B A3B: 74.3 (2025-04-28)Kimi K2 Instruct: 76.4 (2025-07-11)Kimi K2-Instruct-0905: 76.4 (2025-09-05)o1-preview: 52.3 (2024-09-12)o1-previewo1: 67 (2024-12-05)o1o3-mini: 84.6 (2025-01-31)o3-mini

Ranking

1o3-mini
84.6
2Qwen3 235B A22B
77.1
3Kimi K2-Instruct-0905
76.4
4Kimi K2 Instruct
76.4
5Qwen3 32B
74.9
6Qwen3 30B A3B
74.3
7QwQ-32B
73.1
8o1
67
9o1-preview
52.3
10Qwen2.5 72B Instruct
52.3
11Phi 4
47.6
12Qwen2.5 7B Instruct
35.9
13Qwen2.5-Omni-7B
29.6

Related General benchmarks