# Sparse AI Models
Sparse models contain many parameters but only activate a subset for each input token. The most common implementation is [[AI Mixture of Experts (MoE)]], where a router selects which expert sub-networks process each token. This contrasts with [[Dense AI Models]], which use all parameters on every forward pass.
Key benefit: better performance per FLOP. A sparse model with 100B total parameters activating 20B per token can match a 70B dense model while being faster at inference. This decouples model capacity (total knowledge) from inference cost (active compute).
Trade-offs: all parameters must still fit in memory even though only a fraction are active per token. Harder to quantize effectively. Routing adds complexity; poor load balancing during training can leave experts undertrained.
## Examples
| Model | Total Params | Active Params | Experts |
|-------|-------------|---------------|---------|
| [[Gemma 4]] 26B A4B | 25.2B | 3.8B | 128 total, 8 active |
| Mixtral 8x7B | 46.7B | 12.9B | 8 total, 2 active |
| Mixtral 8x22B | 176B | 39B | 8 total, 2 active |
| DeepSeek-V3 | 671B | 37B | 256 total, 8 active |
| Grok-1 | 314B | ~86B | 8 total, 2 active |
| [[GPT4\|GPT-4]] (rumored MoE) | ~1.8T | ~280B | 16 total, 2 active |
## References
- https://en.wikipedia.org/wiki/Mixture_of_experts
## Related
- [[Dense AI Models]]
- [[AI Mixture of Experts (MoE)]]
- [[Large Language Models (LLMs)]]
- [[Transformers]]
- [[Deep Learning]]
- [[AI Scaling Laws]]