# Ollama A free and open source engine that can run [[Large Language Models (LLMs)]] locally. No data leaves your machine. Written in [[Go]], licensed under MIT. ## Key Features - **OpenAI-compatible API** — drop-in replacement for the OpenAI API (`/v1/chat/completions`, `/v1/embeddings`), so any OpenAI-compatible tool works out of the box - **Vision models** — supports multimodal models that can process images (e.g., LLaVA, Qwen 2.5 VL) - **Embedding models** — generate embeddings for RAG and vector search (e.g., `nomic-embed-text`, `mxbai-embed-large`) - **Tool/function calling** — structured output and function calling support for compatible models - **GPU acceleration** — NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) support; multi-GPU spreading when a model doesn't fit on one card - **Concurrent model loading** — run multiple models simultaneously (default: 3 per GPU or 3 for CPU) - **Parallel request processing** — per-model parallel requests via `OLLAMA_NUM_PARALLEL` (scales context × parallel count) - **Flash Attention** — enable with `OLLAMA_FLASH_ATTENTION=1` to reduce memory at larger contexts - **KV cache quantization** — `OLLAMA_KV_CACHE_TYPE` supports `f16` (default), `q8_0` (~½ memory), `q4_0` (~¼ memory) - **Modelfile** — create custom models with system prompts, parameters, and adapter layers (similar to a Dockerfile) - **Configurable context window** — adjust via `num_ctx` parameter or `OLLAMA_CONTEXT_LENGTH` env var - **[[OpenClaw]] integration** — built-in launcher (`ollama launch openclaw`) that bridges messaging apps (WhatsApp, Telegram, Slack, Discord, iMessage) to local/cloud models ## Installation Download from the official Website: https://ollama.com/download. For example, on Linux, you just need to run the following command to get up and running: `curl -fsSL https://ollama.com/install.sh | sh` Once installed, Ollama will run a server on `http://localhost:11434`, and will serve any model you have installed over that API endpoint, making it a breeze for compatible tools to interact with those (e.g., the [[Companion plugin for Obsidian]], [[Transcriber plugin for Obsidian]]). TIP: If you want to install/use Ollama from WSL on Windows, you'll need to enable systemd by modifying the `/etc/wsl.conf` file to add `systemd=true` under the `[boot]` section, as explained here: https://learn.microsoft.com/en-us/windows/wsl/systemd#how-to-enable-systemd ## CLI Commands ```bash ollama pull <model> # Download a model ollama run <model> # Run (and pull if needed), supports multiline with """ ollama list # List installed models ollama show <model> # Show model details ollama rm <model> # Remove a model ollama cp <src> <dst> # Copy a model ollama ps # List running models (shows GPU/CPU split) ollama stop <model> # Unload a model from memory ollama create <name> # Create custom model from Modelfile ollama serve # Start the server (run `ollama serve --help` for env var list) ollama launch <app> # Launch integrations (openclaw, opencode, claude-code, codex, droid) ``` ## REST API - `POST /api/generate` — text generation (streaming) - `POST /api/chat` — chat completions (streaming) - `POST /api/embed` — generate embeddings - `GET /api/tags` — list installed models - `POST /api/pull` — pull a model - `DELETE /api/delete` — delete a model The OpenAI-compatible endpoints are available under `/v1/`. ## Environment Variables | Variable | Default | Purpose | |----------|---------|---------| | `OLLAMA_HOST` | `127.0.0.1:11434` | Bind address. Set to `0.0.0.0:11434` to expose on network | | `OLLAMA_MODELS` | OS-specific | Custom model storage directory | | `OLLAMA_KEEP_ALIVE` | `5m` | How long models stay loaded after last request. Duration strings (`10m`, `1h`), seconds, `-1` for forever, `0` for immediate unload | | `OLLAMA_CONTEXT_LENGTH` | `4096` | Default context window size for all models | | `OLLAMA_NUM_PARALLEL` | `1` | Max parallel requests per model (RAM scales by parallel × context) | | `OLLAMA_MAX_LOADED_MODELS` | `3 × GPUs` (or `3`) | Max models loaded concurrently | | `OLLAMA_MAX_QUEUE` | `512` | Max queued requests before rejecting with 503 | | `OLLAMA_FLASH_ATTENTION` | `0` | Set to `1` to enable Flash Attention (reduces memory at large contexts) | | `OLLAMA_KV_CACHE_TYPE` | `f16` | KV cache quantization: `f16`, `q8_0` (½ memory), `q4_0` (¼ memory) | | `OLLAMA_NO_CLOUD` | not set | Set to `1` to disable cloud features (cloud models, web search). Also settable via `disable_ollama_cloud` in `~/.ollama/server.json` | | `OLLAMA_ORIGINS` | `127.0.0.1`, `0.0.0.0` | Allowed CORS origins. Add `chrome-extension://*` etc. for browser extensions | | `HTTPS_PROXY` | not set | Proxy for model downloads (do NOT set `HTTP_PROXY`) | ### Setting environment variables - **macOS**: `launchctl setenv OLLAMA_HOST "0.0.0.0:11434"` then restart Ollama - **Linux** (systemd): `systemctl edit ollama.service`, add `Environment="OLLAMA_HOST=0.0.0.0:11434"` under `[Service]`, then `systemctl daemon-reload && systemctl restart ollama` - **Windows**: System Settings > Environment Variables, then restart Ollama ## Tips and Tricks - **Preload a model** for faster first response: `ollama run <model> ""` (sends empty prompt, loads into memory) - **Keep a model loaded longer**: set `OLLAMA_KEEP_ALIVE=1h` or use the `keep_alive` API parameter per request. Use `-1` to keep loaded indefinitely - **Check GPU usage**: `ollama ps` shows the GPU/CPU memory split per loaded model - **Disable cloud features**: set `OLLAMA_NO_CLOUD=1` to run fully local (loses cloud models and web search) - **WSL networking slow**: disable "Large Send Offload Version 2" (IPv4 and IPv6) on the vEthernet (WSL) adapter - **Docker GPU**: requires `nvidia-container-toolkit`; not available on macOS Docker Desktop (no GPU passthrough) - **Model storage locations**: macOS `~/.ollama/models`, Linux `/usr/share/ollama/.ollama/models`, Windows `C:\Users\%username%\.ollama\models` - **Proxy**: use `HTTPS_PROXY` only (Ollama doesn't use HTTP for model pulls) - **Disable auto-start**: Windows: Task Manager > Startup apps > disable Ollama. macOS: Settings > Login Items > disable Ollama ## References - Official Website: https://ollama.com/ - Documentation: https://docs.ollama.com - CLI reference: https://docs.ollama.com/cli - FAQ: https://docs.ollama.com/faq - List of models: https://ollama.com/search - Blog: https://ollama.com/blog - Source code: https://github.com/ollama/ollama - API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md - Modelfile reference: https://github.com/ollama/ollama/blob/main/docs/modelfile.md - OpenClaw integration: https://docs.ollama.com/integrations/openclaw - Discord community: https://discord.com/invite/ollama ## Related - [[Large Language Models (LLMs)]] - [[OpenClaw]] - [[GLM OCR]] - [[Transcriber plugin for Obsidian]] - [[Companion plugin for Obsidian]]