Direct Preference Optimization
A simpler alternative to RLHF that optimizes preferences without a reward model.
Direct Preference Optimization (DPO) aligns a model to human preferences with a single, stable supervised-style loss — skipping the separate reward model and reinforcement-learning loop that RLHF requires.
Because it is simpler and more stable to train, DPO and its variants have become a popular default for preference tuning in open models.