Constitutional AI

Aligning models using a written set of principles instead of human labels.

Constitutional AI replaces much of the human feedback in RLHF with AI feedback guided by an explicit "constitution" — a list of principles. The model critiques and revises its own responses against those principles, and a preference model trained on that feedback steers it toward being helpful and harmless.

Developed by Anthropic, it makes alignment more scalable and its values more transparent and editable.

Related papers