AI Mixture of Experts (MoE)

# AI Mixture of Experts (MoE) Architecture where a model contains multiple specialized sub-networks ("experts") and a routing mechanism that activates only a subset for each input. This is [[Sparse AI Models|sparse activation]]: the model has a very large total parameter count but only uses a fraction of them per inference step, unlike [[Dense AI Models]] which use all parameters on every forward pass. Benefits: better performance per compute dollar and faster inference than equivalent dense models. A 100B-parameter MoE model that activates 20B per token can match or beat a 70B dense model while being cheaper to run. Used in GPT-4, Mixtral, DeepSeek (see [[DeepSeek v4]]: 1.6T / 49B active), Grok, and [[Gemma 4]] (26B A4B variant: 128 experts, 8 active per token). Trade-offs: more total parameters means more memory (all experts must be loaded even if only some are active), harder to quantize effectively, and routing decisions add complexity. Load balancing across experts during training is a known challenge; poorly balanced routing leads to some experts being undertrained. The memory trade-off can be mitigated through [[AI Expert Offloading]]: streaming expert weights from SSD into RAM on demand, exploiting the fact that only a few experts are active per token. This enables running very large MoE models on consumer hardware. MoE and [[AI Sparse Attention]] are complementary: MoE is sparse in the **parameter** dimension (few experts fire per token), while sparse attention is sparse in the **sequence** dimension (few token pairs interact per layer). Modern frontier open models stack both to compound the efficiency gains. ## References - ## Related - [[Dense AI Models]] - [[Sparse AI Models]] - [[Large Language Models (LLMs)]] - [[Transformers]] - [[Deep Learning]] - [[AI Expert Offloading]]