FlashAttention

An exact, IO-aware attention algorithm that is much faster and uses less memory.

FlashAttention reorders the attention computation to minimize reads and writes to slow GPU memory, computing exact attention without ever materializing the full attention matrix. The result is large speedups and memory savings, which in turn make longer context windows practical.

It is now a default building block for training and serving Transformer models.

Related papers