# Running AI Models Locally Running AI models on your own hardware instead of relying on cloud APIs. This gives you full control over your data, eliminates per-token costs, and removes dependency on external services. Local inference became practical with [[AI Quantization]] techniques that compress models to run on consumer hardware, and with [[AI Open Weight Models]] that provide freely downloadable weights. ## Why run locally - **Privacy**: data never leaves your machine - **Cost**: no per-token API fees after hardware investment - **Latency**: no network round-trips - **Availability**: works offline, no rate limits - **Experimentation**: swap models freely, test fine-tunes ## Key tools - **[[Ollama]]**: CLI tool that makes downloading and running open models trivial. Pull a model, run it. Exposes an OpenAI-compatible API locally - **[[LM Studio]]**: GUI application for browsing, downloading, and running models with a chat interface and local API server ## Expanding what runs locally [[AI Expert Offloading]] enables running MoE models far larger than available RAM by streaming expert weights from SSD. Combined with [[AI Quantization]], this makes models with hundreds of billions of parameters accessible on consumer hardware, trading speed for accessibility. ## Trade-offs Local models are typically smaller and less capable than frontier cloud models. [[Small Language Models (SLMs)]] are catching up fast, but for the most complex reasoning tasks, cloud APIs still lead. The sweet spot is using local models for privacy-sensitive tasks, high-volume workloads, and experimentation, while using cloud APIs for tasks requiring maximum capability. ## References - ## Related - [[Ollama]] - [[LM Studio]] - [[AI Open Weight Models]] - [[AI Quantization]] - [[Small Language Models (SLMs)]] - [[Bring Your Own Key (BYOK)]] - [[AI Inference]] - [[AI Expert Offloading]] - [[AI Foundation Models]]