Hi maintainers,
I’m reproducing the GRPO method described in the paper and would like a clarification about the training algorithm used in the repo.
From what I observe, the training update appears to follow a PPO-style clipped surrogate objective with GAE/value baseline and entropy/value losses. However, the paper describes GRPO (Group Relative Policy Optimization), which (as I understand) typically involves sampling a group of K trajectories/actions per prompt/task/instruction and computing a group-relative signal (e.g., subtracting the group mean reward/advantage, possibly normalizing by group std) for the policy update.
Could you please clarify the intended behavior in this repo:
- Is the current training algorithm meant to be GRPO as defined in the paper, or a PPO-style variant?
- If it is GRPO, what is the exact “group-relative” formulation used (grouping key, K samples per task, baseline/normalization, whether a learned value function is used)?
- What rollout structure should be used to match the paper (e.g., multiple samples per instruction per update)?
Thanks a lot