DeepSeek Sparse Attention (DSA)

A fine-grained trainable sparse attention built on Multi-head Latent Attention, using a lightning indexer plus token selection.

DSA adds a lightweight “lightning indexer” that scores which past tokens matter, then attends only to a selected fine-grained subset, instantiated on top of MLA.

It cuts long-context training and inference cost substantially while keeping output quality essentially unchanged, and shipped in a frontier open model (DeepSeek-V3.2) — the production realization of trainable fine-grained sparse attention.

Related papers