RLHF

Reinforcement Learning from Human Feedback — aligning models to human preferences.

RLHF fine-tunes a pretrained language model using human preference data. Humans rank model outputs; those rankings train a reward model; the language model is then optimized (typically with PPO) to maximize that reward.

It was the key technique that turned raw next-token predictors into helpful, instruction-following assistants like ChatGPT.

Related papers