# Knowledge Distillation
A model compression technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model. The student learns from the teacher's output probability distributions (soft labels) rather than from hard ground-truth labels alone.
## How It Works
1. The teacher model (a large, high-capability [[AI Foundation Models|foundation model]]) generates predictions with full probability distributions
2. These "soft targets" encode richer information than binary labels; they capture inter-class similarities the teacher has learned
3. The student model (a smaller [[Small Language Models (SLMs)|SLM]]) is trained to match these soft distributions, typically using a temperature-scaled softmax and a KL-divergence loss
4. The result: a compact model that retains much of the teacher's knowledge while being cheaper to run at [[AI Inference]] time
## Why It Matters
- Enables deployment of capable models on resource-constrained hardware
- Reduces [[AI Cost Management|inference costs]] while preserving quality
- Complements other compression techniques: [[AI Quantization]], [[Low Rank Adapter (LoRA)|LoRA]], pruning
- Key enabler for [[Small Language Models (SLMs)]] and on-device AI
- Can be combined with [[Synthetic Data]] generation where the teacher produces training examples
## Relation to Other Techniques
| Technique | What it compresses | Tradeoff |
|-----------|-------------------|----------|
| Knowledge Distillation | Model knowledge into smaller architecture | Training cost for inference savings |
| [[AI Quantization]] | Numerical precision (32-bit to 4-bit) | Slight quality loss for major speed/size gains |
| [[Low Rank Adapter (LoRA)]] | Fine-tuning parameter count | Keeps base model frozen; adapter is small |
| Pruning | Removes unneeded weights | Can lose capability if too aggressive |
## The PKM Parallel
The concept has a striking parallel in [[Personal Knowledge Management (PKM)]]. The journey from raw notes to refined knowledge follows the same teacher-to-student dynamic:
- **Raw capture** (the "teacher"): books, articles, conversations, experiences contain vast, unstructured knowledge
- **[[Progressive summarization]]**: each pass through your notes extracts and condenses the most essential information, just as distillation compresses a large model's knowledge
- **[[Atomic notes]]**: the final distilled form. Each note captures one idea in its most compressed, reusable form, analogous to a small model that captures one capability efficiently
- **[[Knowledge Graph (KG)]]**: the connections between distilled notes create a network that's greater than the sum of its parts, similar to how a well-distilled model captures relationships, not just facts
The [[Zettelkasten method]] is essentially a manual knowledge distillation pipeline: read widely (large teacher), extract ideas (distill), write atomic notes (student model), and connect them (knowledge graph).
The risk in both domains is the same: distillation loses nuance. A compressed model can't do everything the original could. A summarized note may lose the context that made the original insight valuable. The [[Natural tension between compression and context]] applies equally to neural networks and notebooks.
## References
- Hinton, Vinyals, Dean (2015). "Distilling the Knowledge in a Neural Network"
## Related
- [[AI Foundation Models]]
- [[Small Language Models (SLMs)]]
- [[AI Quantization]]
- [[AI Fine-Tuning]]
- [[AI Scaling Laws]]
- [[Dense AI Models]]
- [[Sparse AI Models]]
- [[Large Language Models (LLMs)]]
- [[Synthetic Data]]
- [[Machine Learning (ML)]]
- [[Deep Learning]]
- [[Personal Knowledge Management (PKM)]]
- [[Progressive summarization]]
- [[Atomic notes]]
- [[Zettelkasten method]]
- [[Knowledge Graph (KG)]]
- [[Natural tension between compression and context]]
- [[Context Compression]]
- [[Knowledge Decay]]
- [[AI Open Weight Models]]
- [[AI Speculative Decoding]]
- [[Progressive Distillation]]