Constrained Decoding - DeveloPassion

# Constrained Decoding The technique of restricting an LLM's token sampling step so its output must conform to a given schema, regex, or formal grammar. By construction, the model cannot emit invalid output. The mechanism behind [[LLM Structured Outputs]] and most [[LLM Tool Calling]] implementations. ## How It Works At each generation step: 1. The model produces logits over the entire vocabulary 2. A schema-aware mask sets logits to -∞ for tokens that would violate the constraint 3. Sampling proceeds normally over the remaining valid tokens 4. The chosen token advances the constraint-tracking state machine 5. Repeat until the constraint is fully satisfied (e.g., closing brace for JSON) ## Approaches | Method | Best For | Trade-off | |---|---|---| | Regex-based | Phone numbers, dates, simple patterns | Limited expressiveness | | JSON Schema | Structured data outputs | Most common; well-supported | | Context-free grammar (BNF/EBNF) | Languages like SQL, custom DSLs | Slowest, most expressive | | Choice / enum | Pick one from a fixed list | Trivial to implement | | Lark / GBNF (llama.cpp) | Hand-written grammars | Powerful, niche tooling | ## Popular Implementations - **Outlines** — Python library with regex, JSON Schema, CFG support - **Guidance** — programmatic templates with constraints - **lm-format-enforcer** — JSON Schema, regex, choice - **OpenAI Structured Outputs** — server-side JSON Schema enforcement - **llama.cpp grammars** — GBNF format for local inference - **vLLM** — guided decoding via outlines/lm-format-enforcer - **xgrammar** — fast CFG-based engine ## Why It Matters Without constrained decoding, applications need: - Defensive parsing - Retry loops on malformed output - Output validators with fallback logic With constrained decoding, applications can rely on the output shape — the LLM becomes a reliable system component, not a source of randomness. ## Trade-offs **Strengths:** guaranteed valid output, eliminates a class of bugs, simpler app code **Limitations:** - Slight latency overhead from token-level masking - Can degrade reasoning quality if constraints are too tight ("schema-pushed" outputs that satisfy the schema but lose semantic content) - Streaming + constraints can be tricky (some schemas can't be validated incrementally) - Not all runtimes implement it equivalently — schema feature support varies ## Where It Shows Up in This Vault - [[LLM Structured Outputs]] — the user-facing API surface - [[LLM Tool Calling]] — special case (output must match a tool signature) - W3C [[Prompt API]] `responseConstraint` field ## References - https://github.com/dottxt-ai/outlines - https://platform.openai.com/docs/guides/structured-outputs ## Related - [[LLM Structured Outputs]] - [[LLM Tool Calling]] - [[Large Language Models (LLMs)]] - [[AI Inference]] - [[Prompt API]] - [[Browser-Provided Language Models]]