AI Hub
All papers
ArchitectureTraining

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Google·January 11, 2021

William Fedus, Barret Zoph, Noam Shazeer

View on arXiv

TL;DR

Simplifies mixture-of-experts routing to a single expert per token, scaling Transformers to trillions of parameters at constant compute per token.

Why it matters

Made sparse, mixture-of-experts scaling practical and influenced the architecture of many later frontier models.

Related terms