Conversation
There was a problem hiding this comment.
Do you think it would also make sense to use per_node_batch_size instead of args.batch_size here?
There was a problem hiding this comment.
yes - it should be the case.
@younik -- the batch size being actually trained per rank -- are we sure it's being divided correctly? if not our results might be due to a bug...
There was a problem hiding this comment.
@chirayuharyan that's very important, thanks for spotting this, I am gonna fix it now
josephdviviano
left a comment
There was a problem hiding this comment.
please see the batch size comment
|
|
||
| n_iterations = ceil(args.n_trajectories / args.batch_size) | ||
| per_node_batch_size = args.batch_size // distributed_context.world_size | ||
| per_node_batch_size = args.batch_size // distributed_context.num_training_ranks |
There was a problem hiding this comment.
yes - it should be the case.
@younik -- the batch size being actually trained per rank -- are we sure it's being divided correctly? if not our results might be due to a bug...
One of the worker is dedicated to the replay buffer manager, so we shouldn't include it to compute the local batch size
Spotted by @chirayuharyan