# LLM Structured Outputs
Constraining a [[Large Language Models (LLMs)|LLM]]'s output to conform to a specific format — typically a JSON Schema, regular expression, or grammar. Eliminates parsing ambiguity, prevents malformed responses, and makes LLMs reliable as components in larger systems.
## How It Works
The runtime intercepts the model's token sampling step. At each step:
1. The model proposes a probability distribution over the next token
2. The constrained-decoding layer masks tokens that would violate the schema
3. The highest-probability allowed token is chosen
4. Process repeats until the constraint is fully satisfied
This is sometimes called constrained decoding or guided generation. Common implementations: `outlines`, `guidance`, `lm-format-enforcer`, OpenAI's structured outputs mode.
## What You Can Constrain
- **JSON Schema**: most common; full type system, required fields, enums, regex patterns
- **Regex**: simple format constraints (phone numbers, dates, IDs)
- **Grammar (BNF/EBNF)**: arbitrary formal languages (SQL, custom DSLs)
- **Choice**: pick from a fixed set of strings
## Why It Matters
Without structured outputs:
- LLM may emit Markdown around the JSON
- Property names may be misspelled
- Required fields may be omitted
- Numeric values may be quoted
- Apps need defensive parsing and fallbacks
With structured outputs: parse with confidence. The LLM cannot produce invalid output — by construction.
## Where It Shows Up
| API | Mechanism |
|---|---|
| OpenAI API | `response_format: { type: "json_schema", schema }` |
| Anthropic Claude API | Tool use with strict schemas |
| W3C [[Prompt API]] | `responseConstraint` (JSON Schema or regex) |
| [[Gemini Nano]] | Supported via Prompt API |
| Local runtimes (llama.cpp, vLLM) | Grammar-based constrained decoding |
## Trade-offs
- **Latency**: slight overhead from token-level masking
- **Quality**: very tight constraints can degrade reasoning ("schema-pushed" outputs)
- **Schema complexity**: deeply nested or recursive schemas may not be supported by all runtimes
## Relationship to Tool Calling
[[LLM Tool Calling]] is a special case of structured outputs: the schema is the union of registered tool signatures plus their parameters. Many runtimes share the same constrained-decoding implementation for both.
## References
- https://github.com/webmachinelearning/prompt-api
- https://platform.openai.com/docs/guides/structured-outputs
## Related
- [[Large Language Models (LLMs)]]
- [[LLM Tool Calling]]
- [[LLM Streaming]]
- [[Prompt API]]
- [[Gemini Nano]]
- [[Browser-Provided Language Models]]
- [[Constrained Decoding]]