# AI Inference
The process of running a trained AI model to generate predictions or outputs from new inputs. Distinct from training, which builds the model. Inference is where cost, latency, and throughput matter for production use.
Key optimization techniques:
- **Quantization**: reducing numerical precision of weights to speed up computation
- **KV caching**: storing previously computed key-value pairs to avoid redundant computation during autoregressive generation
- **[[Speculative Decoding]]**: using a smaller draft model to propose tokens, then verifying them in parallel with the larger model. Co-designed variants like [[AI Multi-Token Prediction Drafters]] (e.g. [[Gemma 4]]) push speedups to 3× by sharing the target's KV cache and activations.
- **Batching**: processing multiple requests simultaneously to maximize GPU utilization
- **[[AI Expert Offloading|Expert offloading]]**: streaming MoE expert weights from SSD on demand, enabling models larger than available RAM
Inference cost = tokens processed x cost per token. For [[Large Language Models (LLMs)]], the [[Context Window]] size directly affects inference cost since all tokens in the window must be processed. Managing the [[Token Budget]] is essential for cost-effective deployment.
The training-to-inference cost ratio has shifted dramatically. For frontier models, inference costs now dominate total expenditure as deployment scales.
## References
## Related
- [[Large Language Models (LLMs)]]
- [[Token Budget]]
- [[Context Window]]
- [[On-Device Machine Learning]]
- [[WebNN API]]
- [[Browser-Provided Language Models]]
- [[Edge AI]]
- [[Edge Computing]]
- [[Neural Processing Unit (NPU)]]
- [[ONNX]]
- [[ONNX Runtime Web]]
- [[Transformers.js]]
- [[LLM Streaming]]
- [[Speculative Decoding]]
- [[AI Multi-Token Prediction Drafters]]
- [[MLX]]
- [[SGLang]]
- [[vLLM]]