# SGLang
SGLang is an open-source serving framework for [[Large Language Models (LLMs)]] and vision-language models, optimized for **structured generation** and **high-throughput batched inference**. Born from a 2024 Berkeley/Stanford effort, by 2026 it is one of the two dominant production-scale LLM runtimes alongside [[vLLM]] — adopted by xAI for Grok serving, by NVIDIA's reference stack, and by a long list of inference providers.
The core idea: most LLM workloads are not "one prompt → one response." They are **programs** — multi-turn agents, RAG pipelines, JSON-constrained extraction, tool-using loops, parallel branches. SGLang gives those programs a runtime that *understands* their structure (shared prefixes, branches, JSON schemas) and exploits it for speed.
Apache 2.0 license.
## What sets it apart
- **RadixAttention** — automatic prefix caching across requests via a radix tree of KV caches. Many parallel requests with shared system prompts or few-shot examples become near-free after the first one. This is the headline feature, and the reason SGLang often beats vLLM on agent/RAG workloads.
- **Structured output as a first-class primitive** — JSON, regex, grammar-constrained generation with negligible overhead via compressed FSMs. Critical for tool-calling agents that need valid output 100% of the time, not 95%.
- **Speculative decoding support** — paired draft/target models, including co-designed [[AI Multi-Token Prediction Drafters]] like the [[Gemma 4]] release. See [[Speculative Decoding]].
- **Tensor / pipeline / data parallelism** — multi-GPU and multi-node serving for frontier-scale models.
- **Continuous batching** with token-level scheduling — same baseline trick as vLLM, but composed differently.
- **OpenAI-compatible API** — drop-in for code already targeting OpenAI's HTTP shape.
## SGLang vs vLLM
The two compete head-on. Rough breakdown circa 2026:
| Axis | SGLang | vLLM |
|---|---|---|
| Prefix caching | RadixAttention (automatic, cross-request) | Prefix caching (configurable, less aggressive) |
| Structured output | First-class, compressed FSM | Available but layered on |
| Agent/RAG workloads | Often faster (prefix sharing dominates) | Competitive on simple chat |
| Single-prompt latency | Competitive | Competitive |
| Datacenter throughput | Excellent | Excellent |
| Ecosystem maturity | Younger, growing fast | Wider adoption, more battle-tested |
For agent-shaped workloads with lots of shared context, SGLang usually wins. For simple chat at extreme scale, both are within a few percent; pick by ergonomics and team familiarity.
## Where it doesn't fit
- **Local inference on consumer hardware.** Use [[Ollama]] or [[MLX]] (Mac) instead. SGLang assumes a GPU server or cluster.
- **Non-CUDA hardware.** CUDA-first; ROCm support exists but trails. Apple Silicon is not the target.
## Quickstart
```sh
pip install --upgrade pip
pip install "sglang[all]"
python -m sglang.launch_server \
--model-path google/gemma-4-31b-it \
--port 30000
```
Then call it via the OpenAI-compatible HTTP endpoint at `http://localhost:30000/v1`.
## References
- Source: <https://github.com/sgl-project/sglang>
- Documentation: <https://docs.sglang.ai/>
- RadixAttention paper: <https://arxiv.org/abs/2312.07104>
- License: Apache 2.0
## Related
- [[Large Language Models (LLMs)]]
- [[AI Inference]]
- [[vLLM]]
- [[MLX]]
- [[Ollama]]
- [[Speculative Decoding]]
- [[AI Multi-Token Prediction Drafters]]
- [[Gemma 4]]
- [[Transformers]]