Training
Training Compute-Optimal Large Language Models
DeepMind·March 29, 2022
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch
View on arXivTL;DR
Shows that most large models are undertrained: for a fixed compute budget, model size and training tokens should scale together, roughly 20 tokens per parameter.
Why it matters
The "Chinchilla" scaling result rewrote how labs allocate compute, shifting emphasis toward more training data and shaping the design of nearly every model that followed.