-
Notifications
You must be signed in to change notification settings - Fork 7
Ascend NPU support DeepXTrace #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing. 2.Link to the related Ascend MOE operations pull request and an external case study article.
Reviewer's guide (collapsed on small PRs)Reviewer's GuideAdds documentation updates to describe Ascend NPU (MC2) support and generalizes DeepXTrace from DeepEP-only GPU environments to a cross-platform MoE communication diagnostics tool, while linking to related MOE ops PRs and an external case study article. Flow diagram for DeepXTrace MoE communication diagnosticsflowchart TD
A[Start_MoE_training_job] --> B[MoE_COMM_operations_on_GPU_or_NPU]
B --> C{Backend_type}
C -->|GPU| D[DeepEP_GPU_execution_with_integrated_probe_PR_311]
C -->|NPU| E[MC2_NPU_execution_with_native_instrumentation_PR_288]
D --> F[MoE_COMM_Metrics_Probe_collects_dispatch_and_combine_metrics]
E --> F
F --> G[Persist_metrics_from_all_ranks]
G --> H[DeepXTrace_Metrics_Analysis_processes_metrics]
H --> I[Build_latency_matrices_and_diagnostic_indicators]
I --> J[Identify_Comp_Slow_Mixed_Slow_Comm_Slow_bottlenecks]
J --> K[Report_slow_ranks_and_paths_across_GPU_NPU_clusters]
K --> L[End]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @chaosisnotopen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances DeepXTrace by integrating support for Ascend NPUs, allowing the diagnostic tool to identify performance bottlenecks in a wider array of distributed computing environments. The changes primarily involve updating the project's documentation to reflect this expanded capability, ensuring clarity for users working with both GPU and NPU clusters in Mixture of Experts (MoE) setups. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the README.md to reflect the new support for Ascend NPUs in DeepXTrace. The documentation is generalized from being DeepEP/GPU-specific to cover MoE-based environments, including both GPUs and NPUs. Links to the NPU-specific communication library (MC2) and a relevant blog post have been added. The changes are good and improve the documentation's accuracy. I've suggested a couple of minor formatting improvements to enhance readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In the updated introduction sentence, add a space before the parenthesis in
libraries(e.g., [DeepEP for GPU]...)and after the comma before[MC2 for NPU]to avoid Markdown rendering/reading issues. - Consider briefly defining
xPUthe first time it is introduced (e.g., GPU/NPU/other accelerators) so readers unfamiliar with the term immediately understand the scope.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the updated introduction sentence, add a space before the parenthesis in `libraries(e.g., [DeepEP for GPU]...)` and after the comma before `[MC2 for NPU]` to avoid Markdown rendering/reading issues.
- Consider briefly defining `xPU` the first time it is introduced (e.g., GPU/NPU/other accelerators) so readers unfamiliar with the term immediately understand the scope.
## Individual Comments
### Comment 1
<location> `README.md:24` </location>
<code_context>

-## DeepEP-Metrics-Probe
+## MoE-COMM-Metrics-Probe
-A low-overhead module for measuring critical diagnostic indicators during DeepEP communication. See also: [DeepEP Diagnose PR](https://github.com/deepseek-ai/DeepEP/pull/311).
</code_context>
<issue_to_address>
**suggestion (typo):** Consider aligning the section title with the earlier component name for consistency.
Earlier in the README this is called `MoE COMM Metrics Probe`, but this header uses `MoE-COMM-Metrics-Probe`. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.
Suggested implementation:
```
## MoE COMM Metrics Probe
```
If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (`MoE-COMM-Metrics-Probe`, `MoE COMM Metrics-Probe`, etc.), they should also be updated to `MoE COMM Metrics Probe` for full consistency.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|  | ||
|
|
||
| ## DeepEP-Metrics-Probe | ||
| ## MoE-COMM-Metrics-Probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (typo): Consider aligning the section title with the earlier component name for consistency.
Earlier in the README this is called MoE COMM Metrics Probe, but this header uses MoE-COMM-Metrics-Probe. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.
Suggested implementation:
## MoE COMM Metrics Probe
If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (MoE-COMM-Metrics-Probe, MoE COMM Metrics-Probe, etc.), they should also be updated to MoE COMM Metrics Probe for full consistency.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
wangfakang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks.
1.Ascend NPU support for DeepXTrace with MOE dispatch/combine metrics probing.
2.Link to the related Ascend MOE operations pull request and an external case study article.
Summary by Sourcery
Document cross-platform DeepXTrace diagnostics for MoE-based distributed environments, including new NPU support and MoE communication probing.
Documentation: