LiveBench
LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news article
13Models
84.6Top score
73.1Median
LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | o3-mini | 84.6 |
| 2 | Qwen3 235B A22B | 77.1 |
| 3 | Kimi K2-Instruct-0905 | 76.4 |
| 4 | Kimi K2 Instruct | 76.4 |
| 5 | Qwen3 32B | 74.9 |
| 6 | Qwen3 30B A3B | 74.3 |
| 7 | QwQ-32B | 73.1 |
| 8 | o1 | 67 |
| 9 | o1-preview | 52.3 |
| 10 | Qwen2.5 72B Instruct | 52.3 |
| 11 | Phi 4 | 47.6 |
| 12 | Qwen2.5 7B Instruct | 35.9 |
| 13 | Qwen2.5-Omni-7B | 29.6 |