Group Relative Policy Optimization (GRPO)

A critic-free RL algorithm that estimates the advantage baseline from a group of sampled responses to the same prompt.

GRPO drops PPO’s separate value/critic network and instead samples a group of completions for each prompt, normalizing each reward against the group’s mean and standard deviation to compute advantages. This sharply reduces the memory and compute needed for RL.

Originating in DeepSeekMath (2024), it became the default recipe for training reasoning models across the field via DeepSeek-R1, Qwen, and many others.

Related papers