Reinforcement LearningTraining
Group Sequence Policy Optimization
Alibaba (Qwen Team)·July 24, 2025
Chujie Zheng, Shixuan Liu
View on arXivTL;DR
An RL algorithm that defines the importance ratio and clips at the whole-sequence level (rather than per token), which stabilizes RL — especially for MoE models — and powered Qwen3’s RL training.
Why it matters
A widely adopted successor to GRPO, specifically credited with fixing the MoE RL-training instability that was a recurring pain point of the period.