Reinforcement Learning

Proximal Policy Optimization Algorithms

OpenAI·July 20, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal

TL;DR

Introduces PPO, a stable, simple-to-tune policy-gradient algorithm that became the default for reinforcement learning.

The RL workhorse behind RLHF — PPO is the optimizer that turned human preferences into aligned chat models.