Training

Training Compute-Optimal Large Language Models

DeepMind·March 29, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch

TL;DR

Shows that most large models are undertrained: for a fixed compute budget, model size and training tokens should scale together, roughly 20 tokens per parameter.

Why it matters

The "Chinchilla" scaling result rewrote how labs allocate compute, shifting emphasis toward more training data and shaping the design of nearly every model that followed.

Related terms

Scaling Laws