Direct Preference Optimization

A simpler alternative to RLHF that optimizes preferences without a reward model.

Direct Preference Optimization (DPO) aligns a model to human preferences with a single, stable supervised-style loss — skipping the separate reward model and reinforcement-learning loop that RLHF requires.

Because it is simpler and more stable to train, DPO and its variants have become a popular default for preference tuning in open models.

Related papers