Skip to content

Fix local batch size#467

Merged
younik merged 2 commits intomasterfrom
fix-minibatch-size
Feb 5, 2026
Merged

Fix local batch size#467
younik merged 2 commits intomasterfrom
fix-minibatch-size

Conversation

@younik
Copy link
Collaborator

@younik younik commented Feb 4, 2026

One of the worker is dedicated to the replay buffer manager, so we shouldn't include it to compute the local batch size

Spotted by @chirayuharyan

@younik younik requested a review from josephdviviano February 4, 2026 11:58
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would also make sense to use per_node_batch_size instead of args.batch_size here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - it should be the case.

@younik -- the batch size being actually trained per rank -- are we sure it's being divided correctly? if not our results might be due to a bug...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chirayuharyan that's very important, thanks for spotting this, I am gonna fix it now

Copy link
Collaborator

@josephdviviano josephdviviano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see the batch size comment


n_iterations = ceil(args.n_trajectories / args.batch_size)
per_node_batch_size = args.batch_size // distributed_context.world_size
per_node_batch_size = args.batch_size // distributed_context.num_training_ranks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - it should be the case.

@younik -- the batch size being actually trained per rank -- are we sure it's being divided correctly? if not our results might be due to a bug...

@younik younik merged commit 913e36a into master Feb 5, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants