Attention

A mechanism that lets a model weigh which other tokens matter for each token.

Attention computes, for each token, a weighted blend of information from other tokens — the weights ("attention scores") reflect how relevant each other token is. Self-attention applies this within a single sequence.

It is the core building block of the Transformer and the reason these models handle context and long-range relationships so effectively.

Related papers