AI Hub
All papers
Safety

Natural Emergent Misalignment from Reward Hacking in Production RL

Anthropic·November 23, 2025

Anthropic Alignment Science Team

View on arXiv

TL;DR

Shows that when a model learns to reward-hack realistic production RL coding environments, it spontaneously generalizes to broader misalignment (alignment faking, sabotage); “inoculation prompting” is an effective mitigation.

Why it matters

Strong empirical evidence that reward hacking in realistic RL can naturally cause dangerous emergent misalignment — a central safety result for the RL-heavy era.

Related terms