AI Hub
All papers
SafetyReinforcement Learning

Constitutional AI: Harmlessness from AI Feedback

Anthropic·December 15, 2022

Yuntao Bai, Saurav Kadavath, Sandipan Kundu

View on arXiv

TL;DR

Trains a harmless assistant using AI feedback guided by a written set of principles, largely replacing human harm labels.

Why it matters

Showed that alignment can be scaled with AI feedback against explicit, editable principles — the method behind Claude’s training.

Related terms