AirLLM - DeveloPassion

# AirLLM AirLLM is an open-source Python library that runs 70B-parameter [[Large Language Models (LLMs)]] on a single consumer GPU with as little as 4GB of VRAM, by executing the transformer one layer at a time. It targets the same constraint as [[AI Expert Offloading]] — fitting models that exceed available memory — but uses a different technique: **layer-wise inference** instead of expert streaming. ## Mechanism A standard 70B model needs 140GB+ of VRAM to keep all weights resident. AirLLM exploits the fact that, during a forward pass, each transformer layer's output is the next layer's only input — no other layer's weights are needed at that moment. So the library: 1. Loads layer N from disk into the GPU 2. Computes layer N's output 3. Frees layer N's weights from VRAM 4. Loads layer N+1 5. Repeats until the final layer Peak VRAM usage drops to roughly the size of one layer — about 1/80th of the full model, ~1.6GB for a 70B model. ## Key Features - **AutoModel** — auto-detects model architecture, no need to specify a model class - **Prefetching** — overlaps disk loading with compute (~10% speedup) - **Compression** — optional weight compression for ~3x speedup - **Multi-architecture support** — Llama, Mistral, ChatGLM, Qwen, Baichuan, InternLM - **Hugging Face integration** — drop-in for `transformers`-style inference ## Trade-offs - **Slow.** Per-token latency is dominated by SSD-to-VRAM transfer for every layer at every step. Acceptable for batch / offline use, painful for interactive chat. - **SSD wear and bandwidth bound** — every token reads the full model from disk. - **Not a replacement for proper inference servers** — for production or interactive use, [[AI Quantization]] or [[AI Expert Offloading]] (for MoE models) are usually faster. ## Positioning vs Expert Offloading [[AI Expert Offloading]] streams *some* weights (only active experts) per token and only works for [[AI Mixture of Experts (MoE)|MoE]] architectures. AirLLM streams *all* weights one layer at a time and works for any dense transformer. They're orthogonal techniques addressing the same memory ceiling from opposite directions: sparsity-aware vs architecture-agnostic. ## Use Cases - Hobby / research access to 70B-class dense models on a single consumer GPU - Offline batch inference where latency doesn't matter - Teaching and demos of layer-wise execution mechanics ## References - GitHub: https://github.com/lyogavin/airllm - Hugging Face blog post: https://huggingface.co/blog/lyogavin/airllm ## Related - [[AI Expert Offloading]] - [[AI Quantization]] - [[AI Inference]] - [[Running AI Models Locally]] - [[Sparse AI Models]] - [[AI Mixture of Experts (MoE)]] - [[Large Language Models (LLMs)]] - [[AI Open Weight Models]] - [[Ollama]]