# Constrained Decoding
The technique of restricting an LLM's token sampling step so its output must conform to a given schema, regex, or formal grammar. By construction, the model cannot emit invalid output. The mechanism behind [[LLM Structured Outputs]] and most [[LLM Tool Calling]] implementations.
## How It Works
At each generation step:
1. The model produces logits over the entire vocabulary
2. A schema-aware mask sets logits to -∞ for tokens that would violate the constraint
3. Sampling proceeds normally over the remaining valid tokens
4. The chosen token advances the constraint-tracking state machine
5. Repeat until the constraint is fully satisfied (e.g., closing brace for JSON)
## Approaches
| Method | Best For | Trade-off |
|---|---|---|
| Regex-based | Phone numbers, dates, simple patterns | Limited expressiveness |
| JSON Schema | Structured data outputs | Most common; well-supported |
| Context-free grammar (BNF/EBNF) | Languages like SQL, custom DSLs | Slowest, most expressive |
| Choice / enum | Pick one from a fixed list | Trivial to implement |
| Lark / GBNF (llama.cpp) | Hand-written grammars | Powerful, niche tooling |
## Popular Implementations
- **Outlines** — Python library with regex, JSON Schema, CFG support
- **Guidance** — programmatic templates with constraints
- **lm-format-enforcer** — JSON Schema, regex, choice
- **OpenAI Structured Outputs** — server-side JSON Schema enforcement
- **llama.cpp grammars** — GBNF format for local inference
- **vLLM** — guided decoding via outlines/lm-format-enforcer
- **xgrammar** — fast CFG-based engine
## Why It Matters
Without constrained decoding, applications need:
- Defensive parsing
- Retry loops on malformed output
- Output validators with fallback logic
With constrained decoding, applications can rely on the output shape — the LLM becomes a reliable system component, not a source of randomness.
## Trade-offs
**Strengths:** guaranteed valid output, eliminates a class of bugs, simpler app code
**Limitations:**
- Slight latency overhead from token-level masking
- Can degrade reasoning quality if constraints are too tight ("schema-pushed" outputs that satisfy the schema but lose semantic content)
- Streaming + constraints can be tricky (some schemas can't be validated incrementally)
- Not all runtimes implement it equivalently — schema feature support varies
## Where It Shows Up in This Vault
- [[LLM Structured Outputs]] — the user-facing API surface
- [[LLM Tool Calling]] — special case (output must match a tool signature)
- W3C [[Prompt API]] `responseConstraint` field
## References
- https://github.com/dottxt-ai/outlines
- https://platform.openai.com/docs/guides/structured-outputs
## Related
- [[LLM Structured Outputs]]
- [[LLM Tool Calling]]
- [[Large Language Models (LLMs)]]
- [[AI Inference]]
- [[Prompt API]]
- [[Browser-Provided Language Models]]