Skip to content

Npu adapt megatron#153

Open
addsubmuldiv wants to merge 10 commits intomodelscope:mainfrom
addsubmuldiv:npu_adapt_megatron
Open

Npu adapt megatron#153
addsubmuldiv wants to merge 10 commits intomodelscope:mainfrom
addsubmuldiv:npu_adapt_megatron

Conversation

@addsubmuldiv
Copy link
Copy Markdown
Collaborator

@addsubmuldiv addsubmuldiv commented Apr 13, 2026

PR Type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

Summary

This PR completes Twinkle's NPU Megatron adaptation and targets the Twinkle + Megatron-LM 0.15.3 + MindSpeed 0.15.3 + mcore-bridge stack. The goal is to make the dense / LoRA 8-card training path stable on NPU.

Main changes:

  • Move MindSpeed bootstrap before mcore_bridge is imported to avoid late patching and early binding of TE / Megatron symbols.
  • Build MindSpeed runtime args from the current ModelConfig and the runtime parallel topology, then call repatch() when the runtime signature changes.
  • Fix distributed initialization and metric gathering on NPU:
    • add a default PG fallback for single-rank local smoke
    • reuse Megatron's Gloo DP group for Python object gathering on NPU
  • Fix causal mask handling for NPU FlashAttention:
    • stop feeding Twinkle's 4D dense causal mask directly into the MindSpeed TE flash path
    • let MindSpeed generate the compressed causal mask on the causal NPU path
  • Complete multi-LoRA compatibility for the NPU Megatron path:
    • multi-tenant LoRA training
    • multi-tenant save/export flow
    • optimizer capability selection cleanup

What Changed

1. MindSpeed runtime bootstrap

  • Added an NPU-only runtime bootstrap to ensure MindSpeed patching happens before mcore_bridge import.
  • Unified MindSpeed runtime arg generation into one path so Twinkle and MindSpeed do not read inconsistent runtime state.

2. Process group / metric gather

  • Fixed default PG initialization for single-rank Megatron smoke.
  • Changed NPU gather_object() to prefer Megatron's Gloo DP group to avoid hangs in metrics / Python object gathering.
  • Kept the DP+CP group selection for CP-enabled runs.

3. NPU FlashAttention

  • Fixed causal attention mask handling on NPU.
  • For causal NPU paths, no longer pass Twinkle's 4D dense mask directly, avoiding the MindSpeed TE FlashAttention shape mismatch.

4. LoRA / Multi-LoRA

  • Fixed runtime checks for LoRA finalize so a bare model with ddp_config is not incorrectly treated as a model that can run native finalize.
  • Cleaned up optimizer capability selection for multi-LoRA so it uses the local bf16 optimizer path that fits the model structure.
  • Fixed the multi-LoRA save callback signature so the current tenant adapter is correctly passed through during save.

5. Documentation

  • Updated the NPU support docs with Megatron backend installation and usage guidance.
  • Added installation notes for Megatron / MindSpeed / mcore-bridge and the matching cookbook smoke entrypoints.

Notes

This PR targets the following version stack:

  • Megatron-LM 0.15.3
  • MindSpeed 0.15.3
  • mcore-bridge
  • Twinkle NPU environment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive NPU support for the Megatron backend, featuring documentation updates for environment requirements and a new MindSpeed runtime bootstrap for NPU-specific patching and argument synthesis. Key technical changes include refined process group initialization for single-rank environments, optimized attention mask handling for NPU FlashAttention, and the use of Gloo groups for object gathering on NPU to prevent hangs. Review feedback pointed out a potential initialization error regarding invalid arguments in init_process_group, a hard dependency on megatron-core in utility functions, hardcoded paths in the documentation, and suggested expanding the mask-dropping logic to all causal NPU configurations.

Comment thread src/twinkle/model/megatron/megatron.py Outdated
Comment thread src/twinkle/utils/framework.py Outdated
Comment thread docs/source_en/Usage Guide/NPU-Support.md Outdated
Comment thread src/twinkle/model/megatron/megatron.py Outdated
addsubmuldiv and others added 4 commits April 13, 2026 16:50
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@addsubmuldiv addsubmuldiv marked this pull request as ready for review April 13, 2026 11:42
Copilot AI review requested due to automatic review settings April 13, 2026 11:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes Twinkle’s NPU Megatron integration targeting the Megatron-LM 0.15.3 + MindSpeed 0.15.3 + mcore-bridge stack, focusing on stabilizing 8-card dense/LoRA training on NPU by fixing MindSpeed bootstrap timing, distributed/metric collectives, and NPU FlashAttention mask handling.

Changes:

  • Add an NPU MindSpeed bootstrap layer to ensure adaptor patching happens before mcore_bridge imports Megatron/TE, and synthesize/refresh MindSpeed runtime args from ModelConfig.
  • Adjust Megatron initialization for NPU (default PG fallback, Gloo process groups, metrics/object-gather behavior) and fix causal mask handling for NPU FlashAttention.
  • Update NPU documentation and add Megatron NPU smoke cookbooks/scripts.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/twinkle/utils/framework.py Prefer Megatron’s Gloo DP group for all_gather_object on NPU to avoid HCCL hangs during metric/object collection.
src/twinkle/model/megatron/strategy/megatron.py NPU-specific Megatron init tweaks (Gloo PG creation, device binding cleanup), MoE sequence-parallel auto-enable, and MindSpeed runtime arg configuration.
src/twinkle/model/megatron/multi_lora_megatron.py Reorder MindSpeed patching ahead of mcore_bridge import for NPU multi-LoRA Megatron path.
src/twinkle/model/megatron/megatron.py Add default-PG fallback for single-rank smoke, ensure early MindSpeed patching, and drop dense 4D causal masks on NPU causal TE flash path.
src/twinkle/model/megatron/_mindspeed_runtime.py New module implementing early MindSpeed adaptor patching + runtime args synthesis + conditional repatching.
docs/source_en/Usage Guide/NPU-Support.md Update NPU dependency guidance, add Megatron backend install steps, and point to Megatron NPU smoke cookbooks.
cookbook/megatron/ascend/tp_npu.py (+ .sh) Add 8-card TP/PP/DP NPU Megatron smoke script.
cookbook/megatron/ascend/tp_moe_npu.py (+ .sh) Add 8-card MoE NPU smoke script.
cookbook/megatron/ascend/tp_moe_cp_npu.py (+ .sh) Add 8-card MoE+CP NPU smoke script (megatron_cp_algo path).

Comment thread src/twinkle/model/megatron/megatron.py
Comment thread src/twinkle/model/megatron/megatron.py
Comment thread src/twinkle/model/megatron/strategy/megatron.py
Comment thread src/twinkle/utils/framework.py
Comment thread src/twinkle/model/megatron/strategy/megatron.py
Comment thread src/twinkle/model/megatron/megatron.py Outdated
Comment thread src/twinkle/model/megatron/megatron.py Outdated
Comment thread src/twinkle/utils/framework.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants