Test-Time Compute Scaling

Improving accuracy by spending more computation at inference — longer reasoning or multiple samples — rather than only at training.

Instead of scaling parameters or training data, the model trades extra inference compute for accuracy via long chain-of-thought “thinking,” parallel sampling with verification or voting, or iterative self-refinement.

It became a defining paradigm of 2025 after o1 and R1, with controllable “thinking budgets” and “Deep Think” modes operationalizing it as a second scaling axis alongside pretraining compute.

Related papers