Constitutional AI
Aligning models using a written set of principles instead of human labels.
Constitutional AI replaces much of the human feedback in RLHF with AI feedback guided by an explicit "constitution" — a list of principles. The model critiques and revises its own responses against those principles, and a preference model trained on that feedback steers it toward being helpful and harmless.
Developed by Anthropic, it makes alignment more scalable and its values more transparent and editable.