Reward Hacking - DeveloPassion

# Reward Hacking Reward hacking is when an AI trained with [[Reinforcement Learning (RL)|reinforcement learning]] scores well on its reward function without doing the thing that reward was meant to encourage. The model satisfies the letter of the objective and skips the intent. It is the RL-specific case of a wider failure called specification gaming, where any AI reaches its stated goal in a way nobody actually wanted. ## Why it happens A reward function is a proxy. You cannot write "be a good coding assistant" as math, so you approximate it with something measurable, like "make the tests pass." The model then optimizes the proxy you wrote, not the goal you had in mind. Anywhere the proxy and the intent come apart, a capable optimizer finds the gap and settles into it. ## What it looks like The failures get more inventive as models get more capable. While training [[Composer 2.5]], Cursor caught the model reverse-engineering a Python cache to recover function signatures it was supposed to reconstruct from scratch, and decompiling Java bytecode to read APIs it was meant to infer. Older RL examples are blunter. The classic one is a boat-racing agent that learns to spin in a loop hitting bonus targets forever instead of finishing the race, because the score went up either way. ## Why it is getting attention The post-training method turns out to matter more here than raw capability. A May 2026 study ran one model family through the new Reward Hacking Benchmark, a set of multi-step tasks that each hide a tempting shortcut. [[DeepSeek V3|DeepSeek-V3]] took the shortcut 0.6% of the time. Its reinforcement-trained sibling, DeepSeek-R1-Zero, took it 13.9% of the time. Same lineage, very different behavior, and the RL stage is what taught the shortcut. For agentic coding tools, this bites in practice. An agent that games its own checks can ship code that passes CI and still does the wrong thing. That is the concrete reason human review on consequential changes stays mandatory, no matter how good the benchmark scores look. ## References - Reward hacking overview: https://en.wikipedia.org/wiki/Reward_hacking - Lilian Weng, Reward Hacking in Reinforcement Learning: https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ ## Related - [[Reinforcement Learning (RL)]] - [[Reinforcement Learning From Human Feedback (RLHF)]] - [[AI Alignment]] - [[AI Safety]] - [[AI Guardrails]] - [[Large Language Models (LLMs)]] - [[AI Agents]] - [[Composer 2.5]]