SafetyReinforcement Learning

Constitutional AI: Harmlessness from AI Feedback

Anthropic·December 15, 2022

Yuntao Bai, Saurav Kadavath, Sandipan Kundu

TL;DR

Trains a harmless assistant using AI feedback guided by a written set of principles, largely replacing human harm labels.

Showed that alignment can be scaled with AI feedback against explicit, editable principles — the method behind Claude’s training.