Native Sparse Attention (NSA)

A hardware-aligned, end-to-end-trainable sparse attention using hierarchical token compression and selection.

NSA combines coarse-grained token compression with fine-grained token selection in a dynamic hierarchical strategy, and is trainable from scratch (not just applied at inference) with hardware-friendly kernels.

It matches or beats full attention on long-context tasks at large speedups, and seeded the wave of trainable sparse-attention work that culminated in DeepSeek’s DSA.

Related papers