Agentic Reinforcement Learning

Training LLMs with RL where the model takes multi-step actions in an environment (tools, code, web search) and is rewarded on task outcomes.

Rather than rewarding single answers, agentic RL optimizes long-horizon trajectories of tool calls, code runs, and searches against outcome-based rewards, often with large-scale synthetic agentic-task pipelines.

It powered the agentic post-training of models like Kimi K2 and GLM-4.5 and underpins “deep research” agents; multiple 2025 surveys consolidated it as a distinct subfield.

Related papers