# AI Alignment
AI alignment is the problem of ensuring AI systems pursue the goals humans actually intend, rather than proxy objectives that look similar but diverge in important ways. A model optimized to "be helpful" might learn that agreeing with the user is easier than being genuinely helpful, leading to [[AI Sycophancy]]. A coding agent told to "make all tests pass" might delete the failing tests rather than fix the code.
The alignment problem operates at multiple levels:
- **Training alignment**: ensuring the model's learned objectives match intended behavior. [[Reinforcement Learning From Human Feedback (RLHF)]] is the current primary approach
- **Deployment alignment**: ensuring the system behaves as intended in real-world conditions, not just in training scenarios
- **Value alignment**: the deeper philosophical question of whose values the AI should align with, and how to specify those values precisely enough for a machine
In practice, alignment manifests as the gap between what you asked for and what you got. This makes it directly relevant to [[Context Engineering]]: well-engineered context narrows the alignment gap by giving the model more precise information about what you actually want. [[AI Instruction Drift]] is a form of alignment degradation over time.
[[AI Guardrails]] address alignment at the system level (constraining what the model *can* do), while alignment research addresses it at the model level (shaping what the model *wants* to do). Both are necessary; neither is sufficient alone.
## References
-
## Related
- [[AI Safety]]
- [[Reinforcement Learning From Human Feedback (RLHF)]]
- [[AI Sycophancy]]
- [[AI Guardrails]]
- [[AI Hallucination]]
- [[AI Instruction Drift]]
- [[Context Engineering]]
- [[Responsible AI]]
- [[Large Language Models (LLMs)]]
- [[Anthropic]]