Clarification: Paper uses GRPO, but trainer looks like PPO-clip — where is the group-relative part?

Hi maintainers,

I’m reproducing the GRPO method described in the paper and would like a clarification about the training algorithm used in the repo.

From what I observe, the training update appears to follow a PPO-style clipped surrogate objective with GAE/value baseline and entropy/value losses. However, the paper describes GRPO (Group Relative Policy Optimization), which (as I understand) typically involves sampling a group of K trajectories/actions per prompt/task/instruction and computing a group-relative signal (e.g., subtracting the group mean reward/advantage, possibly normalizing by group std) for the policy update.

Could you please clarify the intended behavior in this repo:
1) Is the current training algorithm meant to be GRPO as defined in the paper, or a PPO-style variant?
2) If it is GRPO, what is the exact “group-relative” formulation used (grouping key, K samples per task, baseline/normalization, whether a learned value function is used)?
3) What rollout structure should be used to match the paper (e.g., multiple samples per instruction per update)?

Thanks a lot 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification: Paper uses GRPO, but trainer looks like PPO-clip — where is the group-relative part? #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification: Paper uses GRPO, but trainer looks like PPO-clip — where is the group-relative part? #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions