Open
Conversation
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #9
This pull request adds support for gradient accumulation to enable training with larger effective batch sizes, especially useful when GPU memory is limited. The implementation introduces a new configuration option, updates training logic to account for accumulation, and provides tests and documentation to verify and demonstrate the feature.
The most important changes are:
Gradient Accumulation Feature:
gradient_accumulation_steps(default: 1) to theArgsclass inarguments.pyand to config files, allowing users to specify accumulation steps for training. [1] [2]run.pyso thattrain_stepsnow represents the number of optimizer steps (after accumulation), not the number of micro-batches, ensuring correct training duration with accumulation. [1] [2]Training Loop Adjustments:
utils.pyto scale loss by1/gradient_accumulation_steps, accumulate gradients, and only step the optimizer after the specified number of micro-batches. Logging, validation, and checkpointing are now triggered only on optimizer steps. [1] [2] [3]Testing and Validation:
test_gradient_accumulation.pyto verify that the number of optimizer steps matches expectations for a given number of accumulation steps and micro-batches.smoke_test_accumulation.py, a synthetic end-to-end test that runs a minimal pipeline with accumulation logic to ensure correct integration.Documentation:
README.mdwith a new section explaining gradient accumulation, configuration usage, and instructions for running the new tests.