AI Hub
All papers
Reinforcement Learning

Proximal Policy Optimization Algorithms

OpenAI·July 20, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal

View on arXiv

TL;DR

Introduces PPO, a stable, simple-to-tune policy-gradient algorithm that became the default for reinforcement learning.

Why it matters

The RL workhorse behind RLHF — PPO is the optimizer that turned human preferences into aligned chat models.

Related terms