# AI Quantization Technique of reducing the numerical precision of model weights (e.g., from 32-bit float to 8-bit or 4-bit integers) to decrease model size, memory usage, and inference cost while preserving most capability. Enables running large models on consumer hardware. Common formats and methods: - **GGUF**: popular format for CPU/hybrid inference (used by llama.cpp) - **GPTQ**: post-training quantization optimized for GPU inference - **AWQ**: activation-aware weight quantization that preserves important weights - **INT8/INT4**: standard integer precision levels Trade-off: lower precision = smaller and faster model but with some quality degradation, especially at extreme quantization (2-3 bit). The sweet spot for most use cases is 4-bit quantization (Q4), which retains ~95% of full-precision quality at a fraction of the memory. Quantization is one of the most impactful optimization techniques for AI inference. It directly determines whether a [[Large Language Models (LLMs)|large language model]] can run on a given piece of hardware. ## References ## Related - [[Large Language Models (LLMs)]] - [[AI Inference]] - [[Knowledge Distillation]] - [[Edge AI]] - [[On-Device Machine Learning]] - [[Neural Processing Unit (NPU)]] - [[Gemini Nano]] - [[ONNX]] - [[ONNX Runtime Web]]