SafetyReinforcement Learning
Constitutional AI: Harmlessness from AI Feedback
Anthropic·December 15, 2022
Yuntao Bai, Saurav Kadavath, Sandipan Kundu
View on arXivTL;DR
Trains a harmless assistant using AI feedback guided by a written set of principles, largely replacing human harm labels.
Why it matters
Showed that alignment can be scaled with AI feedback against explicit, editable principles — the method behind Claude’s training.