# AI Guardrails
AI guardrails are constraints applied to AI systems to prevent harmful, unintended, or low-quality outputs. They operate at multiple levels: training-time alignment, runtime filtering, and system-level restrictions.
## Types
**Input guardrails** filter what reaches the model:
- [[Prompt injection]] detection and blocking
- Content moderation on user inputs
- Input validation and sanitization
**Output guardrails** filter what leaves the model:
- Toxicity and bias detection
- Hallucination detection (cross-referencing claims against known sources)
- Format validation (ensuring structured outputs match expected schemas)
- Confidence thresholds (flagging low-confidence responses)
**Action guardrails** constrain what agents can do:
- Permission systems (which tools an agent can use, which files it can modify)
- Human-in-the-loop approval for high-risk actions (sending messages, deleting data, deploying code)
- Rate limiting and cost caps
- Sandbox environments for code execution
In [[Agentic Engineering]], action guardrails are critical because the [[Agentic loops|agentic loop]] gives the model autonomous power. [[Claude Code]] implements this through its permission system: some tool calls require explicit user approval. The [[Lethal Trifecta for AI Agents]] describes what happens when guardrails fail in agentic systems.
Guardrails complement but don't replace [[AI Alignment]] (training the model to want the right things) and [[Responsible AI]] practices (organizational policies for safe deployment).
## References
-
## Related
- [[AI Safety]]
- [[AI Alignment]]
- [[AI Hallucination]]
- [[Prompt injection]]
- [[Responsible AI]]
- [[Agentic Engineering]]
- [[Agentic loops]]
- [[AI Agent Harness]]
- [[Claude Code]]
- [[Lethal Trifecta for AI Agents]]
- [[EU AI Act]]