SGLang - DeveloPassion

# SGLang SGLang is an open-source serving framework for [[Large Language Models (LLMs)]] and vision-language models, optimized for **structured generation** and **high-throughput batched inference**. Born from a 2024 Berkeley/Stanford effort, by 2026 it is one of the two dominant production-scale LLM runtimes alongside [[vLLM]] — adopted by xAI for Grok serving, by NVIDIA's reference stack, and by a long list of inference providers. The core idea: most LLM workloads are not "one prompt → one response." They are **programs** — multi-turn agents, RAG pipelines, JSON-constrained extraction, tool-using loops, parallel branches. SGLang gives those programs a runtime that *understands* their structure (shared prefixes, branches, JSON schemas) and exploits it for speed. Apache 2.0 license. ## What sets it apart - **RadixAttention** — automatic prefix caching across requests via a radix tree of KV caches. Many parallel requests with shared system prompts or few-shot examples become near-free after the first one. This is the headline feature, and the reason SGLang often beats vLLM on agent/RAG workloads. - **Structured output as a first-class primitive** — JSON, regex, grammar-constrained generation with negligible overhead via compressed FSMs. Critical for tool-calling agents that need valid output 100% of the time, not 95%. - **Speculative decoding support** — paired draft/target models, including co-designed [[AI Multi-Token Prediction Drafters]] like the [[Gemma 4]] release. See [[Speculative Decoding]]. - **Tensor / pipeline / data parallelism** — multi-GPU and multi-node serving for frontier-scale models. - **Continuous batching** with token-level scheduling — same baseline trick as vLLM, but composed differently. - **OpenAI-compatible API** — drop-in for code already targeting OpenAI's HTTP shape. ## SGLang vs vLLM The two compete head-on. Rough breakdown circa 2026: | Axis | SGLang | vLLM | |---|---|---| | Prefix caching | RadixAttention (automatic, cross-request) | Prefix caching (configurable, less aggressive) | | Structured output | First-class, compressed FSM | Available but layered on | | Agent/RAG workloads | Often faster (prefix sharing dominates) | Competitive on simple chat | | Single-prompt latency | Competitive | Competitive | | Datacenter throughput | Excellent | Excellent | | Ecosystem maturity | Younger, growing fast | Wider adoption, more battle-tested | For agent-shaped workloads with lots of shared context, SGLang usually wins. For simple chat at extreme scale, both are within a few percent; pick by ergonomics and team familiarity. ## Where it doesn't fit - **Local inference on consumer hardware.** Use [[Ollama]] or [[MLX]] (Mac) instead. SGLang assumes a GPU server or cluster. - **Non-CUDA hardware.** CUDA-first; ROCm support exists but trails. Apple Silicon is not the target. ## Quickstart ```sh pip install --upgrade pip pip install "sglang[all]" python -m sglang.launch_server \ --model-path google/gemma-4-31b-it \ --port 30000 ``` Then call it via the OpenAI-compatible HTTP endpoint at `http://localhost:30000/v1`. ## References - Source: <https://github.com/sgl-project/sglang> - Documentation: <https://docs.sglang.ai/> - RadixAttention paper: <https://arxiv.org/abs/2312.07104> - License: Apache 2.0 ## Related - [[Large Language Models (LLMs)]] - [[AI Inference]] - [[vLLM]] - [[MLX]] - [[Ollama]] - [[Speculative Decoding]] - [[AI Multi-Token Prediction Drafters]] - [[Gemma 4]] - [[Transformers]]