Grouped-Query Attention
An attention variant that shares key/value heads to speed inference.
Grouped-query attention lets multiple query heads share a smaller set of key and value heads, shrinking the memory bandwidth and KV cache needed at inference with little quality loss. It is widely used in efficient modern LLMs.