Skip to content

Conversation

@chaosisnotopen
Copy link
Contributor

@chaosisnotopen chaosisnotopen commented Dec 13, 2025

1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing.
2.Link to the related Ascend MOE operations pull request and an external case study article.

Summary by Sourcery

Document cross-platform DeepXTrace diagnostics for MoE-based distributed environments, including new NPU support and MoE communication probing.

Documentation:

  • Describe DeepXTrace as a MoE-focused diagnostic tool supporting both GPU and NPU communication libraries.
  • Document the generic MoE communication metrics probe and list DeepEP (GPU) and MC2 (NPU) as supported implementations with links to related PRs and an external blog.
  • Clarify that the metrics analysis module is cross-platform and applicable to GPU/NPU clusters.

1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing.
2.Link to the related Ascend MOE operations pull request and an external case study article.
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 13, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Adds documentation updates to describe Ascend NPU (MC2) support and generalizes DeepXTrace from DeepEP-only GPU environments to a cross-platform MoE communication diagnostics tool, while linking to related MOE ops PRs and an external case study article.

Flow diagram for DeepXTrace MoE communication diagnostics

flowchart TD
    A[Start_MoE_training_job] --> B[MoE_COMM_operations_on_GPU_or_NPU]
    B --> C{Backend_type}
    C -->|GPU| D[DeepEP_GPU_execution_with_integrated_probe_PR_311]
    C -->|NPU| E[MC2_NPU_execution_with_native_instrumentation_PR_288]

    D --> F[MoE_COMM_Metrics_Probe_collects_dispatch_and_combine_metrics]
    E --> F

    F --> G[Persist_metrics_from_all_ranks]
    G --> H[DeepXTrace_Metrics_Analysis_processes_metrics]
    H --> I[Build_latency_matrices_and_diagnostic_indicators]
    I --> J[Identify_Comp_Slow_Mixed_Slow_Comm_Slow_bottlenecks]
    J --> K[Report_slow_ranks_and_paths_across_GPU_NPU_clusters]
    K --> L[End]
Loading

File-Level Changes

Change Details Files
Generalize DeepXTrace description from DeepEP-only GPU environments to MoE-based, cross-backend (GPU/NPU) communication diagnostics.
  • Update project overview to describe DeepXTrace as a MoE-focused diagnostic tool that instruments multiple communication libraries instead of a DeepEP-only system tool.
  • Clarify that supported backends include DeepEP for GPU and MC2 for NPU, with appropriate links.
  • Broaden terminology from GPU/CPU to xPU and from DeepEP-specific wording to MoE communication in general.
README.md
Document the MoE communication metrics probe implementations and supporting resources for GPU and Ascend NPU backends.
  • Rename the probe section from DeepEP-Metrics-Probe to MoE-COMM-Metrics-Probe to reflect multi-backend MoE communication support.
  • Describe DeepEP (GPU) integration via the existing Diagnose PR #311 link.
  • Add MC2 (NPU) support details, including a link to MC2 Diagnose PR #288 and an external Ascend and DeepXTrace blog article.
  • Update the analysis module description to emphasize cross-platform analysis across GPU/NPU clusters.
README.md

Possibly linked issues

  • #N/A: The PR adds Ascend/MC2 NPU instrumentation and analysis support, directly implementing the requested NPU adaptation.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chaosisnotopen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances DeepXTrace by integrating support for Ascend NPUs, allowing the diagnostic tool to identify performance bottlenecks in a wider array of distributed computing environments. The changes primarily involve updating the project's documentation to reflect this expanded capability, ensuring clarity for users working with both GPU and NPU clusters in Mixture of Experts (MoE) setups.

Highlights

  • Ascend NPU Support: DeepXTrace now officially supports Ascend NPUs for diagnosing slow ranks in MoE-based distributed environments, expanding its hardware compatibility.
  • Generalized Terminology: The documentation has been updated to use more general terms like 'MoE-based distributed environments' and 'xPU compute latency' to reflect broader applicability beyond just DeepEP and GPUs/CPUs.
  • Updated Metrics Probe: The 'DeepEP-Metrics-Probe' has been renamed to 'MoE-COMM-Metrics-Probe' and its description now explicitly details support for both DeepEP (GPU) and MC2 (NPU) communication libraries, including links to relevant PRs and a blog post.
  • Cross-Platform Analysis: The DeepXTrace-Metrics-Analysis module is now explicitly described as a 'cross-platform analysis module' capable of identifying bottlenecks across both GPU and NPU clusters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the README.md to reflect the new support for Ascend NPUs in DeepXTrace. The documentation is generalized from being DeepEP/GPU-specific to cover MoE-based environments, including both GPUs and NPUs. Links to the NPU-specific communication library (MC2) and a relevant blog post have been added. The changes are good and improve the documentation's accuracy. I've suggested a couple of minor formatting improvements to enhance readability.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In the updated introduction sentence, add a space before the parenthesis in libraries(e.g., [DeepEP for GPU]...) and after the comma before [MC2 for NPU] to avoid Markdown rendering/reading issues.
  • Consider briefly defining xPU the first time it is introduced (e.g., GPU/NPU/other accelerators) so readers unfamiliar with the term immediately understand the scope.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the updated introduction sentence, add a space before the parenthesis in `libraries(e.g., [DeepEP for GPU]...)` and after the comma before `[MC2 for NPU]` to avoid Markdown rendering/reading issues.
- Consider briefly defining `xPU` the first time it is introduced (e.g., GPU/NPU/other accelerators) so readers unfamiliar with the term immediately understand the scope.

## Individual Comments

### Comment 1
<location> `README.md:24` </location>
<code_context>
 ![combine](figures/combine.png)

-##  DeepEP-Metrics-Probe
+##  MoE-COMM-Metrics-Probe

-A low-overhead module for measuring critical diagnostic indicators during DeepEP communication. See also: [DeepEP Diagnose PR](https://github.com/deepseek-ai/DeepEP/pull/311).
</code_context>

<issue_to_address>
**suggestion (typo):** Consider aligning the section title with the earlier component name for consistency.

Earlier in the README this is called `MoE COMM Metrics Probe`, but this header uses `MoE-COMM-Metrics-Probe`. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.

Suggested implementation:

```
##  MoE COMM Metrics Probe

```

If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (`MoE-COMM-Metrics-Probe`, `MoE COMM Metrics-Probe`, etc.), they should also be updated to `MoE COMM Metrics Probe` for full consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

![combine](figures/combine.png)

## DeepEP-Metrics-Probe
## MoE-COMM-Metrics-Probe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Consider aligning the section title with the earlier component name for consistency.

Earlier in the README this is called MoE COMM Metrics Probe, but this header uses MoE-COMM-Metrics-Probe. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.

Suggested implementation:

##  MoE COMM Metrics Probe

If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (MoE-COMM-Metrics-Probe, MoE COMM Metrics-Probe, etc.), they should also be updated to MoE COMM Metrics Probe for full consistency.

wangfakang and others added 2 commits December 13, 2025 22:15
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@wangfakang wangfakang self-requested a review December 13, 2025 14:20
Copy link
Collaborator

@wangfakang wangfakang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks.

@wangfakang wangfakang merged commit 6f90c04 into antgroup:main Dec 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants