Open
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
27eb0b4 to
a99586d
Compare
a81b8fd to
85e4e0d
Compare
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
85e4e0d to
b0f353b
Compare
There was a problem hiding this comment.
This PR enables the selective parameter training strategy (dense warm-up and sparse training stages) for the DeepSeek V3.2 Indexer. It refactors parameter freezing flags and adds tests to verify proper isolation of indexer gradients from the rest of the model.
🔍 General Feedback
- Memory Optimization in Selective Training: The current implementation of optimizer masking computes and stores Adam state parameters for the entire model before zeroing out the updates. I've suggested an explicit mapping with
optax.multi_transformto avoid allocating massive memory blocks for frozen parameter states, which is critical for 671B model scaling. - Gradient Isolation in KL Divergence: I left an inline comment pointing out a gradient leak when calculating the KL divergence in
calculate_indexer_loss. Ensurejax.lax.stop_gradientis applied to the targetattention_probsdistribution, so that the main model's queries and keys do not get updated by the indexer's loss.
0d0a638 to
4ec8a1e
Compare
4ec8a1e to
906f12f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Enable selective parameter training strategy for DeeSeek V3.2 Indexer - paper
trainable_parameters_maskflag, allowing specific parameters to be targeted for training while freezing the rest of the model.TrainableParametersMaskTestunit tests for validation.sparse_indexer_trainingflag to indicate Dense Warm-up stage or Sparse Training stage for DS v3.2.test_indexer_gradientsunit test to verify proper gradient isolation.use_sparse_indexer-->use_indexer;index_head_dim-->indexer_head_dim;index_n_heads-->indexer_n_heads, andindex_topk-->indexer_topkTests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.