Reinforcement Learning (RL)

# Reinforcement Learning (RL) Reinforcement Learning is a branch of [[Machine Learning (ML)]] where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives rewards or penalties, and adjusts its strategy (policy) to maximize cumulative reward over time. Unlike [[Supervised Learning (SL)]] which learns from labeled examples, RL learns from trial and error with delayed feedback. RL is inspired by behavioral psychology (operant conditioning). The agent never gets told the "right" answer; it discovers optimal behavior through exploration and exploitation. ## Core Concepts | Concept | Description | |---------|-------------| | **Agent** | The learner/decision-maker | | **Environment** | Everything the agent interacts with | | **State** | Current situation of the agent | | **Action** | Choice made by the agent | | **Reward** | Scalar feedback signal after an action | | **Policy** | Strategy mapping states to actions | | **Value Function** | Expected cumulative reward from a state | | **Episode** | One complete sequence from start to terminal state | ## Key Algorithms | Algorithm | Type | Key Idea | |-----------|------|----------| | **Q-Learning** | Value-based | Learn action-value function, pick best action | | **SARSA** | Value-based | On-policy variant of Q-learning | | **Policy Gradient** | Policy-based | Directly optimize the policy | | **Actor-Critic** | Hybrid | Combine value estimation with policy optimization | | **PPO** | Policy-based | Stable policy updates with clipping (used in [[Reinforcement Learning From Human Feedback (RLHF)]]) | | **DQN** | Deep RL | Q-learning with [[Deep Learning]] (Atari games, DeepMind 2013) | | **AlphaGo/AlphaZero** | Deep RL + MCTS | Self-play mastery of Go, chess, shogi | ## Exploration vs Exploitation The fundamental dilemma: exploit known good actions for immediate reward, or explore unknown actions that might yield better long-term returns. Common strategies: epsilon-greedy, Upper Confidence Bound (UCB), Thompson sampling. ## Applications - Game playing: Go, chess, Atari, StarCraft, Dota 2 - Robotics: locomotion, manipulation, sim-to-real transfer - [[Large Language Models (LLMs)]]: [[Reinforcement Learning From Human Feedback (RLHF)]] aligns models to human preferences - Autonomous systems: self-driving, drone navigation - Resource optimization: data center cooling, chip design, traffic control - Finance: portfolio management, trading strategies ## References - Sutton & Barto, *Reinforcement Learning: An Introduction* (2018) - https://en.wikipedia.org/wiki/Reinforcement_learning ## Related - [[Machine Learning (ML)]] - [[Deep Learning]] - [[Supervised Learning (SL)]] - [[Unsupervised Learning]] - [[Reinforcement Learning From Human Feedback (RLHF)]] - [[Neural Networks (NNs)]] - [[Artificial Intelligence (AI)]] - [[AI Agents]]