Mixture of Experts
A sparse architecture that routes each token to a few specialized sub-networks.
In a Mixture-of-Experts (MoE) model, each layer contains many "expert" sub-networks, but a router activates only a small subset for each token. This lets total parameter count grow enormously while keeping the compute per token roughly constant.
MoE powers some of the most capable and cost-efficient recent models, including Mixtral, Gemini, and DeepSeek-V3.