# Ollama
A free and open source engine that can run [[Large Language Models (LLMs)]] locally. No data leaves your machine. Written in [[Go]], licensed under MIT.
## Key Features
- **OpenAI-compatible API** — drop-in replacement for the OpenAI API (`/v1/chat/completions`, `/v1/embeddings`), so any OpenAI-compatible tool works out of the box
- **Vision models** — supports multimodal models that can process images (e.g., LLaVA, Qwen 2.5 VL)
- **Embedding models** — generate embeddings for RAG and vector search (e.g., `nomic-embed-text`, `mxbai-embed-large`)
- **Tool/function calling** — structured output and function calling support for compatible models
- **GPU acceleration** — NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) support; multi-GPU spreading when a model doesn't fit on one card
- **Concurrent model loading** — run multiple models simultaneously (default: 3 per GPU or 3 for CPU)
- **Parallel request processing** — per-model parallel requests via `OLLAMA_NUM_PARALLEL` (scales context × parallel count)
- **Flash Attention** — enable with `OLLAMA_FLASH_ATTENTION=1` to reduce memory at larger contexts
- **KV cache quantization** — `OLLAMA_KV_CACHE_TYPE` supports `f16` (default), `q8_0` (~½ memory), `q4_0` (~¼ memory)
- **Modelfile** — create custom models with system prompts, parameters, and adapter layers (similar to a Dockerfile)
- **Configurable context window** — adjust via `num_ctx` parameter or `OLLAMA_CONTEXT_LENGTH` env var
- **[[OpenClaw]] integration** — built-in launcher (`ollama launch openclaw`) that bridges messaging apps (WhatsApp, Telegram, Slack, Discord, iMessage) to local/cloud models
## Installation
Download from the official Website: https://ollama.com/download. For example, on Linux, you just need to run the following command to get up and running: `curl -fsSL https://ollama.com/install.sh | sh`
Once installed, Ollama will run a server on `http://localhost:11434`, and will serve any model you have installed over that API endpoint, making it a breeze for compatible tools to interact with those (e.g., the [[Companion plugin for Obsidian]], [[Transcriber plugin for Obsidian]]).
TIP: If you want to install/use Ollama from WSL on Windows, you'll need to enable systemd by modifying the `/etc/wsl.conf` file to add `systemd=true` under the `[boot]` section, as explained here: https://learn.microsoft.com/en-us/windows/wsl/systemd#how-to-enable-systemd
## CLI Commands
```bash
ollama pull <model> # Download a model
ollama run <model> # Run (and pull if needed), supports multiline with """
ollama list # List installed models
ollama show <model> # Show model details
ollama rm <model> # Remove a model
ollama cp <src> <dst> # Copy a model
ollama ps # List running models (shows GPU/CPU split)
ollama stop <model> # Unload a model from memory
ollama create <name> # Create custom model from Modelfile
ollama serve # Start the server (run `ollama serve --help` for env var list)
ollama launch <app> # Launch integrations (openclaw, opencode, claude-code, codex, droid)
```
## REST API
- `POST /api/generate` — text generation (streaming)
- `POST /api/chat` — chat completions (streaming)
- `POST /api/embed` — generate embeddings
- `GET /api/tags` — list installed models
- `POST /api/pull` — pull a model
- `DELETE /api/delete` — delete a model
The OpenAI-compatible endpoints are available under `/v1/`.
## Environment Variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `OLLAMA_HOST` | `127.0.0.1:11434` | Bind address. Set to `0.0.0.0:11434` to expose on network |
| `OLLAMA_MODELS` | OS-specific | Custom model storage directory |
| `OLLAMA_KEEP_ALIVE` | `5m` | How long models stay loaded after last request. Duration strings (`10m`, `1h`), seconds, `-1` for forever, `0` for immediate unload |
| `OLLAMA_CONTEXT_LENGTH` | `4096` | Default context window size for all models |
| `OLLAMA_NUM_PARALLEL` | `1` | Max parallel requests per model (RAM scales by parallel × context) |
| `OLLAMA_MAX_LOADED_MODELS` | `3 × GPUs` (or `3`) | Max models loaded concurrently |
| `OLLAMA_MAX_QUEUE` | `512` | Max queued requests before rejecting with 503 |
| `OLLAMA_FLASH_ATTENTION` | `0` | Set to `1` to enable Flash Attention (reduces memory at large contexts) |
| `OLLAMA_KV_CACHE_TYPE` | `f16` | KV cache quantization: `f16`, `q8_0` (½ memory), `q4_0` (¼ memory) |
| `OLLAMA_NO_CLOUD` | not set | Set to `1` to disable cloud features (cloud models, web search). Also settable via `disable_ollama_cloud` in `~/.ollama/server.json` |
| `OLLAMA_ORIGINS` | `127.0.0.1`, `0.0.0.0` | Allowed CORS origins. Add `chrome-extension://*` etc. for browser extensions |
| `HTTPS_PROXY` | not set | Proxy for model downloads (do NOT set `HTTP_PROXY`) |
### Setting environment variables
- **macOS**: `launchctl setenv OLLAMA_HOST "0.0.0.0:11434"` then restart Ollama
- **Linux** (systemd): `systemctl edit ollama.service`, add `Environment="OLLAMA_HOST=0.0.0.0:11434"` under `[Service]`, then `systemctl daemon-reload && systemctl restart ollama`
- **Windows**: System Settings > Environment Variables, then restart Ollama
## Tips and Tricks
- **Preload a model** for faster first response: `ollama run <model> ""` (sends empty prompt, loads into memory)
- **Keep a model loaded longer**: set `OLLAMA_KEEP_ALIVE=1h` or use the `keep_alive` API parameter per request. Use `-1` to keep loaded indefinitely
- **Check GPU usage**: `ollama ps` shows the GPU/CPU memory split per loaded model
- **Disable cloud features**: set `OLLAMA_NO_CLOUD=1` to run fully local (loses cloud models and web search)
- **WSL networking slow**: disable "Large Send Offload Version 2" (IPv4 and IPv6) on the vEthernet (WSL) adapter
- **Docker GPU**: requires `nvidia-container-toolkit`; not available on macOS Docker Desktop (no GPU passthrough)
- **Model storage locations**: macOS `~/.ollama/models`, Linux `/usr/share/ollama/.ollama/models`, Windows `C:\Users\%username%\.ollama\models`
- **Proxy**: use `HTTPS_PROXY` only (Ollama doesn't use HTTP for model pulls)
- **Disable auto-start**: Windows: Task Manager > Startup apps > disable Ollama. macOS: Settings > Login Items > disable Ollama
## References
- Official Website: https://ollama.com/
- Documentation: https://docs.ollama.com
- CLI reference: https://docs.ollama.com/cli
- FAQ: https://docs.ollama.com/faq
- List of models: https://ollama.com/search
- Blog: https://ollama.com/blog
- Source code: https://github.com/ollama/ollama
- API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md
- Modelfile reference: https://github.com/ollama/ollama/blob/main/docs/modelfile.md
- OpenClaw integration: https://docs.ollama.com/integrations/openclaw
- Discord community: https://discord.com/invite/ollama
## Related
- [[Large Language Models (LLMs)]]
- [[OpenClaw]]
- [[GLM OCR]]
- [[Transcriber plugin for Obsidian]]
- [[Companion plugin for Obsidian]]