Skip to content

[New Feature] MOPD#9035

Open
doctorMcy wants to merge 1 commit intomodelscope:mainfrom
doctorMcy:feature_MOPD
Open

[New Feature] MOPD#9035
doctorMcy wants to merge 1 commit intomodelscope:mainfrom
doctorMcy:feature_MOPD

Conversation

@doctorMcy
Copy link
Copy Markdown

PR type

  • Bug Fix
  • [ √ ] New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

https://github.com/XiaomiMiMo/MiMo-V2-Flash/blob/main/paper.pdf
小米团队推出的MiMo-V2中提出了一种新的后训练范式Multi-Teacher On-Policy Distillation (MOPD),用于解决能力不平衡的问题(例如:提升了数学,代码能力下降)。
Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the teacher_model_group argument and refactors the GKDTrainer to use a choose_teacher_model method for selecting teacher models during loss computation. Feedback highlights that the use_mopd flag is missing from the argument definitions, which will result in an AttributeError. Additionally, the choose_teacher_model implementation is currently incomplete and will cause crashes when MOPD is enabled, and the core logic for incorporating teacher weights into the JSD loss calculation is still pending.

remotely. When this is set, `teacher_model` is not required. Defaults to None.
"""
teacher_model: Optional[str] = None
teacher_model_group: List[str] = field(default_factory=list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

use_mopd 标志在 GKDTrainer 中被引用,但未在参数定义中声明。应在此处添加以避免 AttributeError。此外,建议更新 TeacherModelArguments 的 docstring 以包含 teacher_model_groupuse_mopd 的说明。

Suggested change
teacher_model_group: List[str] = field(default_factory=list)
teacher_model_group: List[str] = field(default_factory=list)
use_mopd: bool = False

Comment on lines +447 to +450
def choose_teacher_model(self):
if not self.args.use_mopd:
return self.teacher_model
#todo 使用mopd时从教师模型组选择最佳模型
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

choose_teacher_model 的实现不完整。当启用 use_mopd 时,该函数目前隐式返回 None,这将导致 compute_loss 在预期有效模型的地方(例如第 311 行或第 419 行)发生崩溃。此外,teacher_model_group 中的模型需要在训练器初始化期间加载并准备为 module 对象,以便在此处使用。

t_log_probs = F.log_softmax(t_chunk, dim=-1)
del s_chunk, t_chunk

#todo 使用mopd的计算函数,增加教师模型权重
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

此 TODO 表明 MOPD 的核心逻辑(将教师权重纳入 JSD 损失计算)尚未实现。如果没有这部分实现,MOPD 功能将无法按预期工作。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants