-
Notifications
You must be signed in to change notification settings - Fork 255
GLM-4.7 MTP support #792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GLM-4.7 MTP support #792
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #792 +/- ##
=======================================
Coverage 73.73% 73.73%
=======================================
Files 196 196
Lines 20412 20412
=======================================
Hits 15050 15050
Misses 5362 5362 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
examples/llm_ptq/hf_ptq.py
Outdated
| calibration_only = True | ||
|
|
||
| # Load any missing weights from non-standard safetensors (handled in get_model for non-low-memory mode) | ||
| from example_utils import load_mtp_weights_if_needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move to the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| return False | ||
|
|
||
| # Load the index to find all referenced safetensors files | ||
| with open(index_file) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: single line: index = json.loads(index_path.read_text())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single is not really “better” here. json.load(f) is idiomatic and streams from the file object; json.loads(index_file.read_text()) reads the whole file into memory first.
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
ac6b609 to
39c6195
Compare
📝 WalkthroughWalkthroughAdds PTQ support for GLM-4.7 models by implementing utilities to load MTP layer weights from separate files and automatically exclude these layers from quantization during the PTQ process. Includes documentation updates reflecting the new model support. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
| return False | ||
|
|
||
| # Load the index to find all referenced safetensors files | ||
| with open(index_file) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single is not really “better” here. json.load(f) is idiomatic and streams from the file object; json.loads(index_file.read_text()) reads the whole file into memory first.
examples/llm_ptq/hf_ptq.py
Outdated
| calibration_only = True | ||
|
|
||
| # Load any missing weights from non-standard safetensors (handled in get_model for non-low-memory mode) | ||
| from example_utils import load_mtp_weights_if_needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| - Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model. | ||
| - Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models. | ||
| - Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is. | ||
| - Add support for image-text data calibration in PTQ for Nemotron VL models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added for this PR: #755
| break | ||
|
|
||
| # Load the weights | ||
| weights = load_file(str(filepath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe do device="cpu" here?
What does this PR do?
Type of change: ?
Overview: Enable GLM-4.7 PTQ workflow, including loading the standalone MTP modules and export as-is.
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Documentation