Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the teacher_model_group argument and refactors the GKDTrainer to use a choose_teacher_model method for selecting teacher models during loss computation. Feedback highlights that the use_mopd flag is missing from the argument definitions, which will result in an AttributeError. Additionally, the choose_teacher_model implementation is currently incomplete and will cause crashes when MOPD is enabled, and the core logic for incorporating teacher weights into the JSD loss calculation is still pending.
| remotely. When this is set, `teacher_model` is not required. Defaults to None. | ||
| """ | ||
| teacher_model: Optional[str] = None | ||
| teacher_model_group: List[str] = field(default_factory=list) |
There was a problem hiding this comment.
use_mopd 标志在 GKDTrainer 中被引用,但未在参数定义中声明。应在此处添加以避免 AttributeError。此外,建议更新 TeacherModelArguments 的 docstring 以包含 teacher_model_group 和 use_mopd 的说明。
| teacher_model_group: List[str] = field(default_factory=list) | |
| teacher_model_group: List[str] = field(default_factory=list) | |
| use_mopd: bool = False |
| def choose_teacher_model(self): | ||
| if not self.args.use_mopd: | ||
| return self.teacher_model | ||
| #todo 使用mopd时从教师模型组选择最佳模型 |
| t_log_probs = F.log_softmax(t_chunk, dim=-1) | ||
| del s_chunk, t_chunk | ||
|
|
||
| #todo 使用mopd的计算函数,增加教师模型权重 |
PR type
PR information
https://github.com/XiaomiMiMo/MiMo-V2-Flash/blob/main/paper.pdf
小米团队推出的MiMo-V2中提出了一种新的后训练范式Multi-Teacher On-Policy Distillation (MOPD),用于解决能力不平衡的问题(例如:提升了数学,代码能力下降)。
Write the detail information belongs to this PR.
Experiment results
Paste your experiment result here(if needed).