AI Sparse Attention - DeveloPassion

# AI Sparse Attention A family of [[AI Attention|attention]] mechanisms that compute interactions between only a subset of token pairs instead of every pair, breaking the O(n²) cost of full self-attention. The dense attention used in vanilla [[Transformers]] forces every token to attend to every other token; sparse attention restricts that interaction graph by structure (fixed patterns) or by content (learned routing), trading a small drop in modelling power for very large gains in speed and memory. Sparse attention is the architectural move that makes million-token [[Context Window|context windows]] economically viable — both at training time (FLOPs) and at inference time ([[AI KV Cache|KV cache]] size). ## Why it matters Full attention scales O(n²) in compute and O(n) in KV cache per layer. At a 1M-token context, those costs dominate everything else. Sparse attention attacks both axes: - **Compute**: each token attends to k ≪ n neighbours, so per-layer cost drops from O(n²) toward O(n·k) or O(n·log n). - **Memory**: only the kept keys/values need to live in the cache; in some schemes the cache itself is compressed. The trade-off is information loss when the dropped pairs would have mattered. The whole design space is "which pairs can I safely skip?" ## Three families - **Pattern-based (fixed sparsity)**. Each token attends to a deterministic subset: sliding window (Longformer), local + global landmarks (BigBird), strided / dilated (Sparse Transformer), block-diagonal. Cheap and predictable but blind to content. - **Content-based (learned sparsity)**. The model learns which pairs matter. Routing transformers cluster tokens; Reformer uses LSH; Native Sparse Attention and DeepSeek Sparse Attention (DSA, see [[DeepSeek v4]]) compress and route at the token level. More expensive logic per layer, but far better quality at the same sparsity ratio. - **Approximated / linearised attention**. Performer, Linformer, and Linear Attention rewrite the softmax so attention becomes O(n) via kernel approximation or low-rank projection. Technically not "sparse" but pursues the same goal — drop the quadratic — through a different door. ## Relationship to MoE Sparse attention and [[AI Mixture of Experts (MoE)|MoE]] are both forms of [[Sparse AI Models|sparsity]] but operate on different axes. MoE is sparse in the **parameter** dimension: only a few experts fire per token. Sparse attention is sparse in the **sequence** dimension: only a few token pairs interact per layer. Modern frontier open models (DeepSeek v4, Native Sparse Attention models, Kimi K2) stack the two — sparse experts plus sparse attention — to compound the efficiency gains. ## What this unlocks - Million-token contexts that were technically supported but economically impractical. - Long-context [[AI Inference|inference]] on smaller deployments, because the KV cache stops dominating GPU memory. - Better cost / quality positioning for [[AI Open Weight Models|open-weights models]] vs closed frontier APIs — most of DeepSeek v4's price advantage traces back to DSA. ## Limitations - Quality on long-range, content-dependent tasks (multi-hop retrieval over a 500K-token context) can lag full attention; benchmarks like RULER and InfiniteBench expose this. - Pattern-based schemes underperform when the structure is wrong for the data. - Content-based schemes add training complexity and can be hard to load-balance, similar to MoE routing failure modes. ## References - ## Related - [[AI Attention]] - [[Transformers]] - [[AI KV Cache]] - [[Context Window]] - [[AI Mixture of Experts (MoE)]] - [[Sparse AI Models]] - [[Dense AI Models]] - [[AI Inference]] - [[Large Language Models (LLMs)]] - [[DeepSeek v4]]