Mixture of Experts

A sparse architecture that routes each token to a few specialized sub-networks.

In a Mixture-of-Experts (MoE) model, each layer contains many "expert" sub-networks, but a router activates only a small subset for each token. This lets total parameter count grow enormously while keeping the compute per token roughly constant.

MoE powers some of the most capable and cost-efficient recent models, including Mixtral, Gemini, and DeepSeek-V3.

Related papers