Skip to content

Conversation

@bbkx226
Copy link

@bbkx226 bbkx226 commented Dec 14, 2025

Resolves #9

This pull request adds support for gradient accumulation to enable training with larger effective batch sizes, especially useful when GPU memory is limited. The implementation introduces a new configuration option, updates training logic to account for accumulation, and provides tests and documentation to verify and demonstrate the feature.

The most important changes are:

Gradient Accumulation Feature:

  • Added a new argument gradient_accumulation_steps (default: 1) to the Args class in arguments.py and to config files, allowing users to specify accumulation steps for training. [1] [2]
  • Updated training step calculation in run.py so that train_steps now represents the number of optimizer steps (after accumulation), not the number of micro-batches, ensuring correct training duration with accumulation. [1] [2]

Training Loop Adjustments:

  • Modified the main training loop in utils.py to scale loss by 1/gradient_accumulation_steps, accumulate gradients, and only step the optimizer after the specified number of micro-batches. Logging, validation, and checkpointing are now triggered only on optimizer steps. [1] [2] [3]

Testing and Validation:

  • Added test_gradient_accumulation.py to verify that the number of optimizer steps matches expectations for a given number of accumulation steps and micro-batches.
  • Added smoke_test_accumulation.py, a synthetic end-to-end test that runs a minimal pipeline with accumulation logic to ensure correct integration.

Documentation:

  • Updated README.md with a new section explaining gradient accumulation, configuration usage, and instructions for running the new tests.

@bbkx226
Copy link
Author

bbkx226 commented Dec 14, 2025

#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Codefuse开源轻训营] Support for gradient accumulation

1 participant