-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Hi! Thank you for this great project. I have a question about the RL implementation.
The trainer is named GRPOTrainer (in navr1/rl/grpo.py), but the implementation seems to use standard PPO components: it trains a value network (navr1/models/policy.py), computes advantages using GAE, and collects different episodes rather than multiple attempts at the same input.
From my understanding, GRPO's key feature is avoiding value function training by using group-relative reward comparison (reward - group_mean). Could you clarify if this is a GRPO variant adapted for navigation tasks, or is it PPO with GRPO-inspired techniques?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels