AI Guardrails - DeveloPassion

# AI Guardrails AI guardrails are constraints applied to AI systems to prevent harmful, unintended, or low-quality outputs. They operate at multiple levels: training-time alignment, runtime filtering, and system-level restrictions. ## Types **Input guardrails** filter what reaches the model: - [[Prompt injection]] detection and blocking - Content moderation on user inputs - Input validation and sanitization **Output guardrails** filter what leaves the model: - Toxicity and bias detection - Hallucination detection (cross-referencing claims against known sources) - Format validation (ensuring structured outputs match expected schemas) - Confidence thresholds (flagging low-confidence responses) **Action guardrails** constrain what agents can do: - Permission systems (which tools an agent can use, which files it can modify) - Human-in-the-loop approval for high-risk actions (sending messages, deleting data, deploying code) - Rate limiting and cost caps - Sandbox environments for code execution In [[Agentic Engineering]], action guardrails are critical because the [[Agentic loops|agentic loop]] gives the model autonomous power. [[Claude Code]] implements this through its permission system: some tool calls require explicit user approval. The [[Lethal Trifecta for AI Agents]] describes what happens when guardrails fail in agentic systems. Guardrails complement but don't replace [[AI Alignment]] (training the model to want the right things) and [[Responsible AI]] practices (organizational policies for safe deployment). ## References - ## Related - [[AI Safety]] - [[AI Alignment]] - [[AI Hallucination]] - [[Prompt injection]] - [[Responsible AI]] - [[Agentic Engineering]] - [[Agentic loops]] - [[AI Agent Harness]] - [[Claude Code]] - [[Lethal Trifecta for AI Agents]] - [[EU AI Act]]