Architecture

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Stanford·May 27, 2022

Tri Dao, Daniel Y. Fu, Stefano Ermon

TL;DR

An IO-aware algorithm that computes exact attention with far fewer GPU memory operations, giving large speed and memory improvements.

A systems advance that quietly enabled longer contexts and cheaper training across the field; now a standard component.