Test-Time Compute Scaling
Improving accuracy by spending more computation at inference — longer reasoning or multiple samples — rather than only at training.
Instead of scaling parameters or training data, the model trades extra inference compute for accuracy via long chain-of-thought “thinking,” parallel sampling with verification or voting, or iterative self-refinement.
It became a defining paradigm of 2025 after o1 and R1, with controllable “thinking budgets” and “Deep Think” modes operationalizing it as a second scaling axis alongside pretraining compute.
Related papers
- MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
- Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
- Qwen3 Technical Report
- Olympiad-level Formal Mathematical Reasoning with Reinforcement Learning (AlphaProof)