Group Sequence Policy Optimization (GSPO)

A reasoning-RL algorithm that computes importance ratios and clips at the whole-sequence level rather than per token.

GSPO defines the policy importance ratio over an entire response’s likelihood and clips and optimizes at the sequence level, avoiding the unstable token-level gradients GRPO can produce.

It is notably more stable for mixture-of-experts models and powered Qwen3’s RL training — the in-window evolution of GRPO toward sequence-level optimization.

Related papers