# Running AI Models Locally
Running AI models on your own hardware instead of relying on cloud APIs. This gives you full control over your data, eliminates per-token costs, and removes dependency on external services.
Local inference became practical with [[AI Quantization]] techniques that compress models to run on consumer hardware, and with [[AI Open Weight Models]] that provide freely downloadable weights.
## Why run locally
- **Privacy**: data never leaves your machine
- **Cost**: no per-token API fees after hardware investment
- **Latency**: no network round-trips
- **Availability**: works offline, no rate limits
- **Experimentation**: swap models freely, test fine-tunes
## Key tools
- **[[Ollama]]**: CLI tool that makes downloading and running open models trivial. Pull a model, run it. Exposes an OpenAI-compatible API locally
- **[[LM Studio]]**: GUI application for browsing, downloading, and running models with a chat interface and local API server
## Expanding what runs locally
[[AI Expert Offloading]] enables running MoE models far larger than available RAM by streaming expert weights from SSD. Combined with [[AI Quantization]], this makes models with hundreds of billions of parameters accessible on consumer hardware, trading speed for accessibility.
## Trade-offs
Local models are typically smaller and less capable than frontier cloud models. [[Small Language Models (SLMs)]] are catching up fast, but for the most complex reasoning tasks, cloud APIs still lead. The sweet spot is using local models for privacy-sensitive tasks, high-volume workloads, and experimentation, while using cloud APIs for tasks requiring maximum capability.
## References
-
## Related
- [[Ollama]]
- [[LM Studio]]
- [[AI Open Weight Models]]
- [[AI Quantization]]
- [[Small Language Models (SLMs)]]
- [[Bring Your Own Key (BYOK)]]
- [[AI Inference]]
- [[AI Expert Offloading]]
- [[AI Foundation Models]]