ArchitectureTraining

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Google·January 11, 2021

William Fedus, Barret Zoph, Noam Shazeer

TL;DR

Simplifies mixture-of-experts routing to a single expert per token, scaling Transformers to trillions of parameters at constant compute per token.

Made sparse, mixture-of-experts scaling practical and influenced the architecture of many later frontier models.