Understanding GRPO and GRPO++ in Training Advanced Code Agents
In this post, I shift my focus from Agentic AI security to some recent development in training sophisticated coding agents with reinforcement learning (RL). I found this a particular area of interest to me.
Two key algorithms at the forefront of this methodology are Group Relative Policy Optimization (GRPO, used in DeepSeek R1 traning) and its enhanced successor, GRPO++. These algorithms have been instrumental in training the DeepSWE deep code agent, a state-of-the-art, open-source model capable of tackling complex software engineering tasks.
What is GRPO?
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to refine the performance of large language models (LLMs). It is an evolution of the popular Proximal Policy Optimization (PPO) algorithm, offering increased memory efficiency by forgoing the need for a separate "critic" model to estimate future rewards.
The core concept of GRPO involves the model generating multiple potential solutions or "trajectories" for a given problem. These solutions are then evaluated and scored based on a predefined reward function—for instance, in a coding context, whether the generated code passes a set of unit tests. The algorithm then compares the performance of these solutions relative to each other within the group. Based on this comparison, the model's internal policy is adjusted to increase the likelihood of generating higher-scoring solutions in the future. This method of learning from a group of self-generated responses allows for more stable and efficient training.
GRPO++: An Enhanced and Stabilized Algorithm
Building upon the foundation of GRPO, GRPO++ is an amalgamated algorithm that integrates several key innovations to deliver more stable training and superior performance. As detailed in the development of the DeepSWE agent, GRPO++ incorporates the following enhancements:
Clip High (from DAPO): This technique increases the upper limit of the surrogate loss function used in GRPO/PPO. By doing so, it encourages the model to explore a wider range of potential solutions and helps to stabilize the model's entropy, preventing it from becoming too predictable.
No KL Loss (from DAPO): The Kullback-Leibler (KL) divergence loss is typically used to ensure that the updated model does not stray too far from the original fine-tuned model. By eliminating this constraint, the LLM is free to move beyond the "trust region" of the initial model, allowing for more significant learning and adaptation.
No Reward Standard Deviation (from Dr.GRPO): Removing the standardization of rewards helps to better differentiate between easy and difficult problems. This prevents a "difficulty bias" in the loss calculation, ensuring that the model learns effectively from a wide spectrum of challenges.
Length Normalization (from Dr.GRPO): In the original GRPO, there can be a bias towards longer, incorrect responses. By dividing the surrogate loss by the maximum context length, this bias is removed, leading to more concise and accurate outputs.
Leave One Out (from Loop/RLOO): To reduce the variance in the policy gradient without introducing bias, this technique involves removing one sample during the advantage estimation phase of the training.
Compact Filtering (A new innovation): Inspired by the "overlong filtering" in DAPO, this method masks the loss for trajectories that exceed the maximum context length, time out during generation, or reach a maximum number of steps. This is particularly crucial in multi-turn, agentic scenarios.
No Entropy Loss (A new innovation): The researchers found that entropy loss could lead to instability and eventual training collapse. They observed that as long as the base model's token-level entropy is within a stable range (0.3-1), the entropy loss component is not necessary for effective training.
Training the DeepSWE Agent with GRPO and GRPO++
The training of the DeepSWE agent showcases the practical application and power of these RL algorithms. The process begins by extending the single-step GRPO framework to a multi-turn, or agentic, setting. This is achieved by masking out environment observations or user messages for each trajectory, allowing the model to learn from a continuous flow of interactions.
A significant challenge in training agents for software engineering tasks is the potential for "reward collapse." An agent might accidentally produce a correct patch early on but then continue with random, unproductive actions. If the final correct patch is rewarded, these undesirable intermediate steps are inadvertently reinforced.
To combat this, compact filtering in GRPO++ plays a vital role. By only assigning a reward when the agent deliberately submits its solution, the model is encouraged to be more rigorous in its testing and more confident in its final output. This leads to a more robust and reliable agent. The data shows that with compact filtering, the average response length per step decreases while the average number of environment steps increases, indicating that the agent learns to "think" more efficiently at each step and engage in longer, more reasoned problem-solving sequences.
In essence, the DeepSWE agent is trained from a base model using purely reinforcement learning with the advanced GRPO++ algorithm. This approach, as demonstrated by the agent's impressive performance on the SWE-Bench-Verified benchmark, underscores the potential of scaled RL to produce highly capable and intelligent coding agents.
For my book on Agentic AI, please refer to
https://www.amazon.com/Agentic-AI-Theories-Practices-Progress/dp/3031900251
For more info on DeepSWE: https://huggingface.co/agentica-org/DeepSWE-Preview