ArchitectureTraining
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Google·January 11, 2021
William Fedus, Barret Zoph, Noam Shazeer
View on arXivTL;DR
Simplifies mixture-of-experts routing to a single expert per token, scaling Transformers to trillions of parameters at constant compute per token.
Why it matters
Made sparse, mixture-of-experts scaling practical and influenced the architecture of many later frontier models.