Support gradient accumulation #40
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #9
This pull request adds support for gradient accumulation to enable training with larger effective batch sizes, especially useful when GPU memory is limited. The implementation introduces a new configuration option, updates training logic to account for accumulation, and provides tests and documentation to verify and demonstrate the feature.
The most important changes are:
Gradient Accumulation Feature:
gradient_accumulation_steps(default: 1) to theArgsclass inarguments.pyand to config files, allowing users to specify accumulation steps for training. [1] [2]run.pyso thattrain_stepsnow represents the number of optimizer steps (after accumulation), not the number of micro-batches, ensuring correct training duration with accumulation. [1] [2]Training Loop Adjustments:
utils.pyto scale loss by1/gradient_accumulation_steps, accumulate gradients, and only step the optimizer after the specified number of micro-batches. Logging, validation, and checkpointing are now triggered only on optimizer steps. [1] [2] [3]Testing and Validation:
test_gradient_accumulation.pyto verify that the number of optimizer steps matches expectations for a given number of accumulation steps and micro-batches.smoke_test_accumulation.py, a synthetic end-to-end test that runs a minimal pipeline with accumulation logic to ensure correct integration.Documentation:
README.mdwith a new section explaining gradient accumulation, configuration usage, and instructions for running the new tests.