Transformer
The neural-network architecture behind virtually all modern language models.
The Transformer is a neural network architecture built around self-attention rather than recurrence or convolution. By processing all tokens in a sequence in parallel and letting every token attend to every other, it trains far more efficiently on modern hardware and captures long-range dependencies well.
Introduced in 2017, it is the foundation of essentially every large language model in use today — GPT, Claude, Gemini, and Llama are all Transformer variants.
Related papers
- Attention Is All You Need
- The Llama 3 Herd of Models
- GPT-4 Technical Report
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Improving Language Understanding by Generative Pre-Training