Skip to content

Clarification: Paper uses GRPO, but trainer looks like PPO-clip — where is the group-relative part? #15

@lovelunaer

Description

@lovelunaer

Hi maintainers,

I’m reproducing the GRPO method described in the paper and would like a clarification about the training algorithm used in the repo.

From what I observe, the training update appears to follow a PPO-style clipped surrogate objective with GAE/value baseline and entropy/value losses. However, the paper describes GRPO (Group Relative Policy Optimization), which (as I understand) typically involves sampling a group of K trajectories/actions per prompt/task/instruction and computing a group-relative signal (e.g., subtracting the group mean reward/advantage, possibly normalizing by group std) for the policy update.

Could you please clarify the intended behavior in this repo:

  1. Is the current training algorithm meant to be GRPO as defined in the paper, or a PPO-style variant?
  2. If it is GRPO, what is the exact “group-relative” formulation used (grouping key, K samples per task, baseline/normalization, whether a learned value function is used)?
  3. What rollout structure should be used to match the paper (e.g., multiple samples per instruction per update)?

Thanks a lot

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions