Reinforcement Learning From Human Feedback (RLHF)

# Reinforcement Learning From Human Feedback (RLHF) RLHF is the training technique that turned raw [[Large Language Models (LLMs)]] from next-token predictors into useful assistants. The process has three stages: 1. **Supervised fine-tuning**: the base model is trained on curated examples of desired behavior (high-quality conversations, helpful answers) 2. **Reward model training**: human evaluators rank multiple model outputs for the same prompt. A separate model learns to predict which outputs humans prefer 3. **Reinforcement learning**: the LLM is optimized to produce outputs that score highly with the reward model, using algorithms like PPO (Proximal Policy Optimization) The result is a model that's not just predicting likely text, but producing text that humans judge as helpful, harmless, and honest. This is what makes [[Claude]], [[ChatGPT]], and other assistants feel cooperative rather than chaotic. RLHF is the primary mechanism behind [[AI Alignment]] in practice. It's also imperfect: the reward model can have blind spots, leading to [[AI Sycophancy]] (the model learns that agreeable answers get higher human ratings) and mode collapse (the model converges on safe, generic responses rather than taking useful risks). Alternatives and extensions include RLAIF (AI feedback instead of human), Constitutional AI ([[Anthropic]]'s approach where the model self-critiques against a set of principles), and DPO (Direct Preference Optimization, which skips the reward model). ## References - ## Related - [[AI Alignment]] - [[AI Safety]] - [[AI Sycophancy]] - [[Large Language Models (LLMs)]] - [[Machine Learning (ML)]] - [[Deep Learning]] - [[Anthropic]] - [[Claude]]