Skip to content

Adapt Mindspeed/Megatron 0.15.3#25

Merged
addsubmuldiv merged 4 commits intomodelscope:mainfrom
addsubmuldiv:adapt_mindspeed_015
Apr 10, 2026
Merged

Adapt Mindspeed/Megatron 0.15.3#25
addsubmuldiv merged 4 commits intomodelscope:mainfrom
addsubmuldiv:adapt_mindspeed_015

Conversation

@addsubmuldiv
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings April 10, 2026 01:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the mcore bridge to better interoperate with MindSpeed/Megatron 0.15.x Transformer Engine (TE) behavior, especially around LoRA adapter injection and TE debug/compat fields.

Changes:

  • Add MindSpeed version detection and a helper to build “local” TE linear layers with MindSpeed 0.15.x-compatible semantics on NPU.
  • Make LoRA tensor-parallel group resolution robust to MindSpeed TE variants that store the group under parallel_group instead of tp_group.
  • Patch TELinear.__repr__ to tolerate MindSpeed 0.15.x field name changes on NPU (input_size/output_size).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/mcore_bridge/tuners/lora.py Adds MindSpeed-aware helpers and updates LoRA adapter layer construction / tp group resolution for NPU + MindSpeed 0.15.x.
src/mcore_bridge/patcher.py Updates TE __repr__ patch for MindSpeed NPU field compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 8 to 10
from contextlib import contextmanager, nullcontext
from importlib import metadata
from megatron.core import parallel_state
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from importlib import metadata introduces a module-level name that collides with the existing metadata parameter used later in sharded_state_dict(...). This shadowing is legal but makes the file harder to read and can lead to accidental misuse of the module vs. the parameter. Consider importing with an alias (e.g., importlib_metadata) and updating _get_mindspeed_version() accordingly.

Copilot uses AI. Check for mistakes.
# semantics.
in_features = getattr(self, 'in_features', getattr(self, 'input_size', None))
out_features = getattr(self, 'out_features', getattr(self, 'output_size', None))
use_bias = getattr(self, 'use_bias', getattr(self, 'bias', None) is not None)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the NPU path, use_bias = getattr(self, 'use_bias', getattr(self, 'bias', None) is not None) can misreport bias when self.bias is a boolean (e.g., False still makes the expression true because it is not None). To make __repr__ robust across TE/MindSpeed variants, compute use_bias by first reading bias_attr = getattr(self, 'bias', None) and handling the boolean case explicitly (otherwise fall back to bias_attr is not None).

Suggested change
use_bias = getattr(self, 'use_bias', getattr(self, 'bias', None) is not None)
bias_attr = getattr(self, 'bias', None)
if hasattr(self, 'use_bias'):
use_bias = self.use_bias
elif isinstance(bias_attr, bool):
use_bias = bias_attr
else:
use_bias = bias_attr is not None

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces compatibility for MindSpeed 0.15.x on NPU devices by updating the TELinear patched representation and refactoring LoRA layer updates. It adds logic to handle NPU-specific attribute names and introduces helper functions to manage version-dependent linear layer instantiation and tensor-parallel group resolution. Feedback focuses on improving the robustness of tp_size inference in the patched repr and optimizing performance by caching the MindSpeed version lookup.

Comment on lines +607 to +608
parallel_mode = getattr(self, 'parallel_mode', None)
tp_size = 1 if parallel_mode == 'duplicated' else 'unknown'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When tp_size is missing, it is inferred from parallel_mode. However, parallel_mode can be None for local layers (as seen in the GPU path or default initialization), which should also imply a tp_size of 1. Currently, it defaults to 'unknown' in this case.

Suggested change
parallel_mode = getattr(self, 'parallel_mode', None)
tp_size = 1 if parallel_mode == 'duplicated' else 'unknown'
parallel_mode = getattr(self, 'parallel_mode', None)
tp_size = 1 if parallel_mode in ('duplicated', None) else 'unknown'

Comment on lines +37 to +43
def _get_mindspeed_version():
try:
return version.parse(metadata.version('mindspeed'))
except metadata.PackageNotFoundError:
return None
except Exception:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _get_mindspeed_version function is called multiple times during the LoRA update process. Since metadata.version involves filesystem access and parsing, it is inefficient to call it repeatedly. Caching the result would improve performance.

Suggested change
def _get_mindspeed_version():
try:
return version.parse(metadata.version('mindspeed'))
except metadata.PackageNotFoundError:
return None
except Exception:
return None
_MINDSPEED_VERSION = None
def _get_mindspeed_version():
global _MINDSPEED_VERSION
if _MINDSPEED_VERSION is not None:
return _MINDSPEED_VERSION
try:
_MINDSPEED_VERSION = version.parse(metadata.version('mindspeed'))
except (metadata.PackageNotFoundError, Exception):
_MINDSPEED_VERSION = False
return _MINDSPEED_VERSION if _MINDSPEED_VERSION is not False else None

@addsubmuldiv addsubmuldiv merged commit 2a0c001 into modelscope:main Apr 10, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants