AI Expert Offloading - DeveloPassion

# AI Expert Offloading Technique for running [[AI Mixture of Experts (MoE)]] models that are far larger than available RAM by streaming expert weights from SSD into memory on demand during [[AI Inference|inference]]. Since MoE models only activate a small subset of experts per token, the inactive experts can stay on disk. The SSD bandwidth is fast enough to load the needed experts before the next token is generated. This exploits the fundamental property of [[Sparse AI Models|sparse architectures]]: if only 32B out of 1T parameters are active per token, you don't need 1T parameters in RAM. You just need the active experts plus the shared layers (attention, embedding). Demonstrated examples (as of early 2026): - Qwen3.5-397B (17B active) running on 48GB RAM - [[Kimi K2.5]] (1T total, 32B active) running on a 96GB MacBook Pro at ~1.7 tok/s (the [[Kimi K2.6]] successor inherits the same MoE lineage) - Qwen3.5-397B running on an iPhone at 0.6 tok/s Expert offloading is complementary to [[AI Quantization]]. Quantization reduces the precision of weights to shrink their size; offloading moves weights between storage tiers. Combining both lets you run even larger models on constrained hardware. This represents a significant shift for [[Running AI Models Locally|local AI inference]]: models previously requiring data-center hardware become accessible on consumer devices, trading speed for accessibility. ## References - https://simonwillison.net/2026/Mar/24/streaming-experts/ ## Related - [[AI Mixture of Experts (MoE)]] - [[AI Inference]] - [[Running AI Models Locally]] - [[Sparse AI Models]] - [[AI Quantization]] - [[Large Language Models (LLMs)]] - [[Qwen]] - [[AirLLM]]