Grouped-Query Attention

An attention variant that shares key/value heads to speed inference.

Grouped-query attention lets multiple query heads share a smaller set of key and value heads, shrinking the memory bandwidth and KV cache needed at inference with little quality loss. It is widely used in efficient modern LLMs.