Reinforcement LearningTraining

Group Sequence Policy Optimization

Alibaba (Qwen Team)·July 24, 2025

Chujie Zheng, Shixuan Liu

TL;DR

An RL algorithm that defines the importance ratio and clips at the whole-sequence level (rather than per token), which stabilizes RL — especially for MoE models — and powered Qwen3’s RL training.

Why it matters

A widely adopted successor to GRPO, specifically credited with fixing the MoE RL-training instability that was a recurring pain point of the period.

Related models

Qwen3Alibaba

Related terms

Group Sequence Policy Optimization (GSPO)