KV Cache

Cached attention keys and values that let a model generate each new token without recomputing the whole sequence.

During generation, the attention keys/values for past tokens are stored and reused, turning generation from quadratic into incremental work. The KV cache grows with context length and often dominates inference memory, motivating techniques like grouped-query attention and quantized caches.