Transformer

The neural-network architecture behind virtually all modern language models.

The Transformer is a neural network architecture built around self-attention rather than recurrence or convolution. By processing all tokens in a sequence in parallel and letting every token attend to every other, it trains far more efficiently on modern hardware and captures long-range dependencies well.

Introduced in 2017, it is the foundation of essentially every large language model in use today — GPT, Claude, Gemini, and Llama are all Transformer variants.

Related papers