# AI Quantization
Technique of reducing the numerical precision of model weights (e.g., from 32-bit float to 8-bit or 4-bit integers) to decrease model size, memory usage, and inference cost while preserving most capability. Enables running large models on consumer hardware.
Common formats and methods:
- **GGUF**: popular format for CPU/hybrid inference (used by llama.cpp)
- **GPTQ**: post-training quantization optimized for GPU inference
- **AWQ**: activation-aware weight quantization that preserves important weights
- **INT8/INT4**: standard integer precision levels
Trade-off: lower precision = smaller and faster model but with some quality degradation, especially at extreme quantization (2-3 bit). The sweet spot for most use cases is 4-bit quantization (Q4), which retains ~95% of full-precision quality at a fraction of the memory.
Quantization is one of the most impactful optimization techniques for AI inference. It directly determines whether a [[Large Language Models (LLMs)|large language model]] can run on a given piece of hardware.
## References
## Related
- [[Large Language Models (LLMs)]]
- [[AI Inference]]
- [[Knowledge Distillation]]
- [[Edge AI]]
- [[On-Device Machine Learning]]
- [[Neural Processing Unit (NPU)]]
- [[Gemini Nano]]
- [[ONNX]]
- [[ONNX Runtime Web]]