Reinforcement Learning
Proximal Policy Optimization Algorithms
OpenAI·July 20, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal
View on arXivTL;DR
Introduces PPO, a stable, simple-to-tune policy-gradient algorithm that became the default for reinforcement learning.
Why it matters
The RL workhorse behind RLHF — PPO is the optimizer that turned human preferences into aligned chat models.