AI Hub
All papers
Architecture

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Stanford·May 27, 2022

Tri Dao, Daniel Y. Fu, Stefano Ermon

View on arXiv

TL;DR

An IO-aware algorithm that computes exact attention with far fewer GPU memory operations, giving large speed and memory improvements.

Why it matters

A systems advance that quietly enabled longer contexts and cheaper training across the field; now a standard component.

Related terms