Reward Model

A model trained to score outputs by human preference, used to guide reinforcement learning during alignment.

In RLHF, humans rank model outputs; a reward model learns to predict those preferences, then provides the reward signal that fine-tunes the policy. The quality of the reward model strongly shapes the final model’s behavior — and reward hacking is a known failure mode.