# AirLLM
AirLLM is an open-source Python library that runs 70B-parameter [[Large Language Models (LLMs)]] on a single consumer GPU with as little as 4GB of VRAM, by executing the transformer one layer at a time. It targets the same constraint as [[AI Expert Offloading]] — fitting models that exceed available memory — but uses a different technique: **layer-wise inference** instead of expert streaming.
## Mechanism
A standard 70B model needs 140GB+ of VRAM to keep all weights resident. AirLLM exploits the fact that, during a forward pass, each transformer layer's output is the next layer's only input — no other layer's weights are needed at that moment. So the library:
1. Loads layer N from disk into the GPU
2. Computes layer N's output
3. Frees layer N's weights from VRAM
4. Loads layer N+1
5. Repeats until the final layer
Peak VRAM usage drops to roughly the size of one layer — about 1/80th of the full model, ~1.6GB for a 70B model.
## Key Features
- **AutoModel** — auto-detects model architecture, no need to specify a model class
- **Prefetching** — overlaps disk loading with compute (~10% speedup)
- **Compression** — optional weight compression for ~3x speedup
- **Multi-architecture support** — Llama, Mistral, ChatGLM, Qwen, Baichuan, InternLM
- **Hugging Face integration** — drop-in for `transformers`-style inference
## Trade-offs
- **Slow.** Per-token latency is dominated by SSD-to-VRAM transfer for every layer at every step. Acceptable for batch / offline use, painful for interactive chat.
- **SSD wear and bandwidth bound** — every token reads the full model from disk.
- **Not a replacement for proper inference servers** — for production or interactive use, [[AI Quantization]] or [[AI Expert Offloading]] (for MoE models) are usually faster.
## Positioning vs Expert Offloading
[[AI Expert Offloading]] streams *some* weights (only active experts) per token and only works for [[AI Mixture of Experts (MoE)|MoE]] architectures. AirLLM streams *all* weights one layer at a time and works for any dense transformer. They're orthogonal techniques addressing the same memory ceiling from opposite directions: sparsity-aware vs architecture-agnostic.
## Use Cases
- Hobby / research access to 70B-class dense models on a single consumer GPU
- Offline batch inference where latency doesn't matter
- Teaching and demos of layer-wise execution mechanics
## References
- GitHub: https://github.com/lyogavin/airllm
- Hugging Face blog post: https://huggingface.co/blog/lyogavin/airllm
## Related
- [[AI Expert Offloading]]
- [[AI Quantization]]
- [[AI Inference]]
- [[Running AI Models Locally]]
- [[Sparse AI Models]]
- [[AI Mixture of Experts (MoE)]]
- [[Large Language Models (LLMs)]]
- [[AI Open Weight Models]]
- [[Ollama]]