# AI KV Cache
An optimization that stores previously computed key and value tensors from the [[AI Attention|attention mechanism]] during autoregressive generation. Without it, the model would recompute attention over all previous tokens for every new token generated.
KV cache trades memory for speed: it makes generation O(n) instead of O(n^2) per token but grows linearly with sequence length. For long contexts, the KV cache can consume tens of gigabytes of GPU memory.
KV cache size is a primary constraint on long-context inference. A model might support 128K tokens in theory, but the KV cache memory required to actually serve that context at reasonable batch sizes can be prohibitive.
Techniques to manage it:
- **Paged attention** (vLLM): treats KV cache like virtual memory, eliminating fragmentation
- **Sliding window attention**: only caches the most recent N tokens
- **KV cache quantization**: storing keys/values in lower precision (e.g., FP8)
- **Multi-query / grouped-query attention (MQA/GQA)**: shares key-value heads across query heads, reducing cache size
- **Sparse attention with token-wise compression**: as in DeepSeek Sparse Attention (DSA); see [[DeepSeek v4]], which cuts KV cache size to ~10% of DeepSeek V3.2 at the same context length
Directly relevant to [[Context Window]] management and [[Context Compression]] strategies.
## References
-
## Related
- [[Large Language Models (LLMs)]]
- [[Context Window]]
- [[Context Compression]]
- [[AI Attention]]
- [[Transformers]]