Reinforcement LearningTraining
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Stanford·May 29, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell
View on arXivTL;DR
Derives a simple supervised loss that optimizes a model directly on preference data, matching RLHF without a separate reward model or RL loop.
Why it matters
DPO made preference tuning simpler and more stable, and quickly became a default alignment method for open models.