Ascend NPU support DeepXTrace #12

chaosisnotopen · 2025-12-13T13:57:59Z

1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing.
2.Link to the related Ascend MOE operations pull request and an external case study article.

Summary by Sourcery

Document cross-platform DeepXTrace diagnostics for MoE-based distributed environments, including new NPU support and MoE communication probing.

Documentation:

Describe DeepXTrace as a MoE-focused diagnostic tool supporting both GPU and NPU communication libraries.
Document the generic MoE communication metrics probe and list DeepEP (GPU) and MC2 (NPU) as supported implementations with links to related PRs and an external blog.
Clarify that the metrics analysis module is cross-platform and applicable to GPU/NPU clusters.

1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing. 2.Link to the related Ascend MOE operations pull request and an external case study article.

sourcery-ai · 2025-12-13T13:58:05Z

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Adds documentation updates to describe Ascend NPU (MC2) support and generalizes DeepXTrace from DeepEP-only GPU environments to a cross-platform MoE communication diagnostics tool, while linking to related MOE ops PRs and an external case study article.

Flow diagram for DeepXTrace MoE communication diagnostics

flowchart TD
    A[Start_MoE_training_job] --> B[MoE_COMM_operations_on_GPU_or_NPU]
    B --> C{Backend_type}
    C -->|GPU| D[DeepEP_GPU_execution_with_integrated_probe_PR_311]
    C -->|NPU| E[MC2_NPU_execution_with_native_instrumentation_PR_288]

    D --> F[MoE_COMM_Metrics_Probe_collects_dispatch_and_combine_metrics]
    E --> F

    F --> G[Persist_metrics_from_all_ranks]
    G --> H[DeepXTrace_Metrics_Analysis_processes_metrics]
    H --> I[Build_latency_matrices_and_diagnostic_indicators]
    I --> J[Identify_Comp_Slow_Mixed_Slow_Comm_Slow_bottlenecks]
    J --> K[Report_slow_ranks_and_paths_across_GPU_NPU_clusters]
    K --> L[End]

File-Level Changes

Change	Details	Files
Generalize DeepXTrace description from DeepEP-only GPU environments to MoE-based, cross-backend (GPU/NPU) communication diagnostics.	Update project overview to describe DeepXTrace as a MoE-focused diagnostic tool that instruments multiple communication libraries instead of a DeepEP-only system tool. Clarify that supported backends include DeepEP for GPU and MC2 for NPU, with appropriate links. Broaden terminology from GPU/CPU to xPU and from DeepEP-specific wording to MoE communication in general.	`README.md`
Document the MoE communication metrics probe implementations and supporting resources for GPU and Ascend NPU backends.	Rename the probe section from `DeepEP-Metrics-Probe` to `MoE-COMM-Metrics-Probe` to reflect multi-backend MoE communication support. Describe DeepEP (GPU) integration via the existing Diagnose PR #311 link. Add MC2 (NPU) support details, including a link to MC2 Diagnose PR #288 and an external Ascend and DeepXTrace blog article. Update the analysis module description to emphasize cross-platform analysis across GPU/NPU clusters.	`README.md`

Possibly linked issues

#N/A: The PR adds Ascend/MC2 NPU instrumentation and analysis support, directly implementing the requested NPU adaptation.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist · 2025-12-13T13:58:14Z

Summary of Changes

Hello @chaosisnotopen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances DeepXTrace by integrating support for Ascend NPUs, allowing the diagnostic tool to identify performance bottlenecks in a wider array of distributed computing environments. The changes primarily involve updating the project's documentation to reflect this expanded capability, ensuring clarity for users working with both GPU and NPU clusters in Mixture of Experts (MoE) setups.

Highlights

Ascend NPU Support: DeepXTrace now officially supports Ascend NPUs for diagnosing slow ranks in MoE-based distributed environments, expanding its hardware compatibility.
Generalized Terminology: The documentation has been updated to use more general terms like 'MoE-based distributed environments' and 'xPU compute latency' to reflect broader applicability beyond just DeepEP and GPUs/CPUs.
Updated Metrics Probe: The 'DeepEP-Metrics-Probe' has been renamed to 'MoE-COMM-Metrics-Probe' and its description now explicitly details support for both DeepEP (GPU) and MC2 (NPU) communication libraries, including links to relevant PRs and a blog post.
Cross-Platform Analysis: The DeepXTrace-Metrics-Analysis module is now explicitly described as a 'cross-platform analysis module' capable of identifying bottlenecks across both GPU and NPU clusters.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the README.md to reflect the new support for Ascend NPUs in DeepXTrace. The documentation is generalized from being DeepEP/GPU-specific to cover MoE-based environments, including both GPUs and NPUs. Links to the NPU-specific communication library (MC2) and a relevant blog post have been added. The changes are good and improve the documentation's accuracy. I've suggested a couple of minor formatting improvements to enhance readability.

README.md

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

In the updated introduction sentence, add a space before the parenthesis in libraries(e.g., [DeepEP for GPU]...) and after the comma before [MC2 for NPU] to avoid Markdown rendering/reading issues.
Consider briefly defining xPU the first time it is introduced (e.g., GPU/NPU/other accelerators) so readers unfamiliar with the term immediately understand the scope.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In the updated introduction sentence, add a space before the parenthesis in `libraries(e.g., [DeepEP for GPU]...)` and after the comma before `[MC2 for NPU]` to avoid Markdown rendering/reading issues.
- Consider briefly defining `xPU` the first time it is introduced (e.g., GPU/NPU/other accelerators) so readers unfamiliar with the term immediately understand the scope.

## Individual Comments

### Comment 1
<location> `README.md:24` </location>
<code_context>
 ![combine](figures/combine.png)

-##  DeepEP-Metrics-Probe
+##  MoE-COMM-Metrics-Probe

-A low-overhead module for measuring critical diagnostic indicators during DeepEP communication. See also: [DeepEP Diagnose PR](https://github.com/deepseek-ai/DeepEP/pull/311).
</code_context>

<issue_to_address>
**suggestion (typo):** Consider aligning the section title with the earlier component name for consistency.

Earlier in the README this is called `MoE COMM Metrics Probe`, but this header uses `MoE-COMM-Metrics-Probe`. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.

Suggested implementation:

```
##  MoE COMM Metrics Probe

```

If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (`MoE-COMM-Metrics-Probe`, `MoE COMM Metrics-Probe`, etc.), they should also be updated to `MoE COMM Metrics Probe` for full consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-12-13T13:58:58Z

README.md

 ![combine](figures/combine.png)

-##  DeepEP-Metrics-Probe
+##  MoE-COMM-Metrics-Probe


suggestion (typo): Consider aligning the section title with the earlier component name for consistency.

Earlier in the README this is called MoE COMM Metrics Probe, but this header uses MoE-COMM-Metrics-Probe. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.

Suggested implementation:

## MoE COMM Metrics Probe

If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (MoE-COMM-Metrics-Probe, MoE COMM Metrics-Probe, etc.), they should also be updated to MoE COMM Metrics Probe for full consistency.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

wangfakang

LGTM, Thanks.

Ascend NPU support DeepXTrace

f4c864b

1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing. 2.Link to the related Ascend MOE operations pull request and an external case study article.

gemini-code-assist bot reviewed Dec 13, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

sourcery-ai bot reviewed Dec 13, 2025

View reviewed changes

wangfakang and others added 2 commits December 13, 2025 22:15

Update README.md

8b8ad5e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update README.md

861065b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

wangfakang self-requested a review December 13, 2025 14:20

wangfakang approved these changes Dec 13, 2025

View reviewed changes

wangfakang merged commit 6f90c04 into antgroup:main Dec 13, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ascend NPU support DeepXTrace #12

Ascend NPU support DeepXTrace #12

Uh oh!

chaosisnotopen commented Dec 13, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Dec 13, 2025 •

edited

Loading

Reviewer's Guide

Flow diagram for DeepXTrace MoE communication diagnostics

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Dec 13, 2025

Uh oh!

wangfakang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ascend NPU support DeepXTrace #12

Ascend NPU support DeepXTrace #12

Uh oh!

Conversation

chaosisnotopen commented Dec 13, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for DeepXTrace MoE communication diagnostics

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Dec 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

wangfakang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaosisnotopen commented Dec 13, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 13, 2025 •

edited

Loading