# Sparse AI Models Sparse models contain many parameters but only activate a subset for each input token. The most common implementation is [[AI Mixture of Experts (MoE)]], where a router selects which expert sub-networks process each token. This contrasts with [[Dense AI Models]], which use all parameters on every forward pass. Key benefit: better performance per FLOP. A sparse model with 100B total parameters activating 20B per token can match a 70B dense model while being faster at inference. This decouples model capacity (total knowledge) from inference cost (active compute). Trade-offs: all parameters must still fit in memory even though only a fraction are active per token. Harder to quantize effectively. Routing adds complexity; poor load balancing during training can leave experts undertrained. ## Examples | Model | Total Params | Active Params | Experts | |-------|-------------|---------------|---------| | [[Gemma 4]] 26B A4B | 25.2B | 3.8B | 128 total, 8 active | | Mixtral 8x7B | 46.7B | 12.9B | 8 total, 2 active | | Mixtral 8x22B | 176B | 39B | 8 total, 2 active | | DeepSeek-V3 | 671B | 37B | 256 total, 8 active | | Grok-1 | 314B | ~86B | 8 total, 2 active | | [[GPT4\|GPT-4]] (rumored MoE) | ~1.8T | ~280B | 16 total, 2 active | ## References - https://en.wikipedia.org/wiki/Mixture_of_experts ## Related - [[Dense AI Models]] - [[AI Mixture of Experts (MoE)]] - [[Large Language Models (LLMs)]] - [[Transformers]] - [[Deep Learning]] - [[AI Scaling Laws]]