Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive NPU support for the Megatron backend, featuring documentation updates for environment requirements and a new MindSpeed runtime bootstrap for NPU-specific patching and argument synthesis. Key technical changes include refined process group initialization for single-rank environments, optimized attention mask handling for NPU FlashAttention, and the use of Gloo groups for object gathering on NPU to prevent hangs. Review feedback pointed out a potential initialization error regarding invalid arguments in init_process_group, a hard dependency on megatron-core in utility functions, hardcoded paths in the documentation, and suggested expanding the mask-dropping logic to all causal NPU configurations.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR completes Twinkle’s NPU Megatron integration targeting the Megatron-LM 0.15.3 + MindSpeed 0.15.3 + mcore-bridge stack, focusing on stabilizing 8-card dense/LoRA training on NPU by fixing MindSpeed bootstrap timing, distributed/metric collectives, and NPU FlashAttention mask handling.
Changes:
- Add an NPU MindSpeed bootstrap layer to ensure adaptor patching happens before
mcore_bridgeimports Megatron/TE, and synthesize/refresh MindSpeed runtime args fromModelConfig. - Adjust Megatron initialization for NPU (default PG fallback, Gloo process groups, metrics/object-gather behavior) and fix causal mask handling for NPU FlashAttention.
- Update NPU documentation and add Megatron NPU smoke cookbooks/scripts.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/twinkle/utils/framework.py | Prefer Megatron’s Gloo DP group for all_gather_object on NPU to avoid HCCL hangs during metric/object collection. |
| src/twinkle/model/megatron/strategy/megatron.py | NPU-specific Megatron init tweaks (Gloo PG creation, device binding cleanup), MoE sequence-parallel auto-enable, and MindSpeed runtime arg configuration. |
| src/twinkle/model/megatron/multi_lora_megatron.py | Reorder MindSpeed patching ahead of mcore_bridge import for NPU multi-LoRA Megatron path. |
| src/twinkle/model/megatron/megatron.py | Add default-PG fallback for single-rank smoke, ensure early MindSpeed patching, and drop dense 4D causal masks on NPU causal TE flash path. |
| src/twinkle/model/megatron/_mindspeed_runtime.py | New module implementing early MindSpeed adaptor patching + runtime args synthesis + conditional repatching. |
| docs/source_en/Usage Guide/NPU-Support.md | Update NPU dependency guidance, add Megatron backend install steps, and point to Megatron NPU smoke cookbooks. |
| cookbook/megatron/ascend/tp_npu.py (+ .sh) | Add 8-card TP/PP/DP NPU Megatron smoke script. |
| cookbook/megatron/ascend/tp_moe_npu.py (+ .sh) | Add 8-card MoE NPU smoke script. |
| cookbook/megatron/ascend/tp_moe_cp_npu.py (+ .sh) | Add 8-card MoE+CP NPU smoke script (megatron_cp_algo path). |
PR Type
Summary
This PR completes Twinkle's NPU Megatron adaptation and targets the Twinkle + Megatron-LM 0.15.3 + MindSpeed 0.15.3 + mcore-bridge stack. The goal is to make the dense / LoRA 8-card training path stable on NPU.
Main changes:
mcore_bridgeis imported to avoid late patching and early binding of TE / Megatron symbols.ModelConfigand the runtime parallel topology, then callrepatch()when the runtime signature changes.What Changed
1. MindSpeed runtime bootstrap
mcore_bridgeimport.2. Process group / metric gather
gather_object()to prefer Megatron's Gloo DP group to avoid hangs in metrics / Python object gathering.3. NPU FlashAttention
4. LoRA / Multi-LoRA
ddp_configis not incorrectly treated as a model that can run native finalize.5. Documentation
Notes
This PR targets the following version stack: