Architecture
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Stanford·May 27, 2022
Tri Dao, Daniel Y. Fu, Stefano Ermon
View on arXivTL;DR
An IO-aware algorithm that computes exact attention with far fewer GPU memory operations, giving large speed and memory improvements.
Why it matters
A systems advance that quietly enabled longer contexts and cheaper training across the field; now a standard component.