RLHF
Reinforcement Learning from Human Feedback — aligning models to human preferences.
RLHF fine-tunes a pretrained language model using human preference data. Humans rank model outputs; those rankings train a reward model; the language model is then optimized (typically with PPO) to maximize that reward.
It was the key technique that turned raw next-token predictors into helpful, instruction-following assistants like ChatGPT.