AI KV Cache - DeveloPassion

# AI KV Cache An optimization that stores previously computed key and value tensors from the [[AI Attention|attention mechanism]] during autoregressive generation. Without it, the model would recompute attention over all previous tokens for every new token generated. KV cache trades memory for speed: it makes generation O(n) instead of O(n^2) per token but grows linearly with sequence length. For long contexts, the KV cache can consume tens of gigabytes of GPU memory. KV cache size is a primary constraint on long-context inference. A model might support 128K tokens in theory, but the KV cache memory required to actually serve that context at reasonable batch sizes can be prohibitive. Techniques to manage it: - **Paged attention** (vLLM): treats KV cache like virtual memory, eliminating fragmentation - **Sliding window attention**: only caches the most recent N tokens - **KV cache quantization**: storing keys/values in lower precision (e.g., FP8) - **Multi-query / grouped-query attention (MQA/GQA)**: shares key-value heads across query heads, reducing cache size - **Sparse attention with token-wise compression**: as in DeepSeek Sparse Attention (DSA); see [[DeepSeek v4]], which cuts KV cache size to ~10% of DeepSeek V3.2 at the same context length Directly relevant to [[Context Window]] management and [[Context Compression]] strategies. ## References - ## Related - [[Large Language Models (LLMs)]] - [[Context Window]] - [[Context Compression]] - [[AI Attention]] - [[Transformers]]