GPT-Generated Unified Format (GGUF)

# GPT-Generated Unified Format (GGUF) GGUF is the binary file format used to store [[Large Language Models (LLMs)]] for local inference, introduced by the [[llama.cpp]] project as the successor to the older [[Georgi Gerganov Machine Learning (GGML)|GGML]] format. A single `.gguf` file bundles everything needed to run a model: the (usually quantized) weights, the tensor layout, rich key-value metadata, and the chat template. That self-contained design is why it became the de facto distribution format for [[AI Open Weight Models|open-weight]] models you run yourself. ## Why it matters - **One file, no surprises**: metadata and template travel with the weights, so a runtime can load a model with minimal configuration - **Quantization-first**: GGUF is where low-bit quants live (e.g. Q4_K_M, and the 2-bit dynamic quants used to run big models like [[GLM-5.2]] on a Mac) - **Ecosystem default**: produced by tools like [[Unsloth]], consumed by [[llama.cpp]], [[Ollama]], [[Mistral.rs]], and [[Docker Model Runner]] ## Related - [[Georgi Gerganov Machine Learning (GGML)]] - [[Safetensors]] - [[ONNX]] - [[Georgi Gerganov]] - [[llama.cpp]] - [[Large Language Models (LLMs)]] - [[AI Open Weight Models]] - [[Ollama]] - [[Unsloth]] - [[Mistral.rs]] - [[Docker Model Runner]] - [[GLM-5.2]]