Skip to content

Conversation

@wangfakang
Copy link
Collaborator

@wangfakang wangfakang commented Dec 11, 2025

Summary by Sourcery

Add a command-line Python tool to generate DeepXTrace performance heatmaps from log file matrix data and document its usage in the tools README.

New Features:

  • Introduce a DeepXTrace heatmap generator that parses matrix data from logs and produces color-coded visualizations with a custom colormap.
  • Provide a CLI interface to configure heatmap generation parameters such as title, figure size, DPI, output format, and cell size scaling.

Documentation:

  • Add tools documentation describing installation, input format expectations, CLI options, and usage examples for the DeepXTrace heatmap generator.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 11, 2025

Reviewer's Guide

Adds a standalone Python CLI tool to generate DeepXTrace heatmap visualizations from log files plus accompanying usage documentation.

Sequence diagram for DeepXTrace heatmap CLI execution flow

sequenceDiagram
    actor User
    participant CLI as DeepXTraceHeatmapCLI
    participant Argparse
    participant FileSystem
    participant Parser as MatrixParser
    participant Plotter as HeatmapPlotter
    participant Matplotlib

    User->>CLI: python deepxtrace_heatmap.py input_file --options

    CLI->>Argparse: parse_args()
    Argparse-->>CLI: args

    CLI->>FileSystem: open(input_file)
    FileSystem-->>CLI: log_content

    CLI->>Parser: parse_matrix_data(log_content)
    Parser-->>CLI: data_matrix

    CLI->>Plotter: plot_deepxtrace_heatmap(data_matrix, title, figsize, dpi, format, cell_ratio)

    Plotter->>Matplotlib: configure_figure_and_colormap()
    Matplotlib-->>Plotter: figure_axes

    Plotter->>Matplotlib: sns.heatmap(...)
    Matplotlib-->>Plotter: rendered_heatmap

    Plotter->>Matplotlib: savefig("deepxtrace.{format}")
    Matplotlib-->>Plotter: file_written

    Plotter-->>CLI: completion
    CLI-->>User: print saving_started_and_completed_timestamps
Loading

Class diagram for deepxtrace_heatmap Python tool structure

classDiagram
    class DeepxtraceHeatmapModule {
        <<module>>
        +create_optimized_ryg_cmap() LinearSegmentedColormap
        +parse_matrix_data(log_data string) ndarray
        +read_log_file(file_path string) string
        +plot_deepxtrace_heatmap(matrix ndarray, title string, figsize float[], dpi int, output_format string, cell_ratio float) void
        +main() void
    }

    class ArgparseModule {
        <<library>>
    }

    class NumpyModule {
        <<library>>
    }

    class MatplotlibPyplotModule {
        <<library>>
    }

    class SeabornModule {
        <<library>>
    }

    DeepxtraceHeatmapModule ..> ArgparseModule : uses
    DeepxtraceHeatmapModule ..> NumpyModule : uses
    DeepxtraceHeatmapModule ..> MatplotlibPyplotModule : uses
    DeepxtraceHeatmapModule ..> SeabornModule : uses
Loading

File-Level Changes

Change Details Files
Introduce a DeepXTrace heatmap generator script with CLI support for reading log files, parsing matrix data, and rendering configurable heatmaps to disk.
  • Create a custom optimized red-yellow-green LinearSegmentedColormap for visualizing token wait times
  • Implement matrix parsing from log content using regex to extract bracketed numeric sequences into a 2D NumPy array
  • Add robust file-reading helper with basic error handling and user-friendly messages
  • Generate a log-scaled seaborn heatmap with dynamic figure sizing, annotation font sizing, and customized colorbar/labels
  • Provide a command-line interface that wires together input parsing, log reading, matrix conversion, and heatmap generation with configurable output options
tools/deepxtrace_heatmap.py
Document how to install dependencies and use the DeepXTrace heatmap visualization tool from the command line, including expected input format and sample output.
  • Describe tool purpose and main capabilities in a new README for tools
  • Document installation requirements and basic/advanced CLI usage examples
  • Specify CLI parameters, defaults, and expected log input matrix format with a concrete example
  • Include a sample output heatmap reference image path for quick visual confirmation
tools/README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @wangfakang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new utility to enhance the analysis of DeepXTrace performance data. It provides a dedicated Python tool that transforms raw numerical matrix data from log files into visually intuitive heatmaps. This aims to streamline the process of identifying performance bottlenecks and patterns by offering a clear, customizable graphical representation of the data.

Highlights

  • New Visualization Tool: Introduced a new Python-based tool for generating heatmap visualizations specifically designed for DeepXTrace performance data.
  • Data Parsing and Processing: The tool automatically parses matrix data from log files, supporting Dispatch or Combine matrix formats, and applies a log-scale transformation for better visualization of varied data.
  • Customizable Heatmaps: Heatmaps are generated with an optimized Red-Yellow-Green colormap, dynamic cell sizing, and configurable output options including title, figure dimensions, DPI, and format (PNG, SVG, PDF).
  • Command-Line Interface: A user-friendly command-line interface is provided, allowing easy configuration of heatmap generation parameters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider validating the parsed matrix (e.g., non-empty, consistent row lengths) and failing with a clear error message before calling np.array and plotting, to avoid confusing runtime errors on malformed logs.
  • The output filename is currently hard-coded to deepxtrace.<format>; you may want to allow the user to override this (e.g., via a --output argument) so multiple runs on different inputs don’t overwrite each other.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider validating the parsed matrix (e.g., non-empty, consistent row lengths) and failing with a clear error message before calling `np.array` and plotting, to avoid confusing runtime errors on malformed logs.
- The output filename is currently hard-coded to `deepxtrace.<format>`; you may want to allow the user to override this (e.g., via a `--output` argument) so multiple runs on different inputs don’t overwrite each other.

## Individual Comments

### Comment 1
<location> `tools/deepxtrace_heatmap.py:24-27` </location>
<code_context>
+    return LinearSegmentedColormap.from_list("optimized_ryg", colors)
+
+
+def parse_matrix_data(log_data):
+    """Parse matrix data from log string containing bracketed number sequences"""
+    number_sequences = re.findall(r'\[([\d\s]+)\]', log_data)
+    return np.array([list(map(int, seq.split())) for seq in number_sequences])
+
+
</code_context>

<issue_to_address>
**issue:** Handle empty or irregular matrix sequences more defensively before constructing the numpy array.

`re.findall` can return an empty list, and inconsistent row lengths will produce a ragged array. That can lead to unexpected shapes or downstream errors when accessing `matrix.shape`. Consider validating `number_sequences` (non-empty, consistent lengths) and raising a clear error on malformed input instead of relying on later failures.
</issue_to_address>

### Comment 2
<location> `tools/deepxtrace_heatmap.py:62-63` </location>
<code_context>
+        cell_ratio: Cell size scaling factor (default: 1.5)
+    """
+    # Calculate adjusted figure size based on matrix dimensions
+    rows, cols = matrix.shape
+    adjusted_figsize = (figsize[0] * cell_ratio * (cols / 10),
+                        figsize[1] * cell_ratio * (rows / 10))
+
</code_context>

<issue_to_address>
**suggestion (performance):** Consider clamping the auto-scaled figure size to avoid excessively large outputs for big matrices.

Because `adjusted_figsize` scales linearly with `rows` and `cols`, very large matrices can produce excessively large figures, risking high memory use and unusable plots. Consider capping the width/height (or switching to a nonlinear scaling) so sizes stay within reasonable bounds while still reflecting matrix dimensions.
</issue_to_address>

### Comment 3
<location> `tools/deepxtrace_heatmap.py:82-90` </location>
<code_context>
+    annot_size = max(8, min(20, 10 * cell_ratio))
+
+    # Generate heatmap
+    ax = sns.heatmap(
+        log_matrix,
+        cmap=cmap,
+        norm=norm,
+        annot=matrix,
+        fmt='.2e',
+        linewidths=0.5,
+        linecolor="white",
+        annot_kws={
+            "size": annot_size,
+            "color": "black"
</code_context>

<issue_to_address>
**suggestion (performance):** Automatically disabling annotations for large matrices would improve performance and readability.

`annot=matrix` doesn’t scale well for large matrices in terms of runtime and visual noise. Consider disabling annotations when `rows * cols` exceeds a threshold, or gate them behind a parameter/CLI flag so large traces stay responsive while smaller ones can still show detailed values.

Suggested implementation:

```python
    # Create colormap and normalize data
    cmap = create_optimized_ryg_cmap()
    log_matrix = np.log1p(matrix)
    norm = plt.Normalize(vmin=log_matrix.min(), vmax=log_matrix.max())

    # Dynamic annotation size based on cell size
    annot_size = max(8, min(20, 10 * cell_ratio))

    # Automatically disable annotations for large matrices (performance & readability)
    # Threshold can be tuned; 400 cells (~20x20) keeps small traces detailed and larger ones clean.
    show_annotations = matrix.size <= 400

    if show_annotations:
        annot_data = matrix
        annot_kws = {
            "size": annot_size,
            "color": "black",
        }
    else:
        annot_data = None
        annot_kws = None

    # Generate heatmap
    ax = sns.heatmap(
        log_matrix,
        cmap=cmap,
        norm=norm,
        annot=annot_data,
        fmt='.2e',
        linewidths=0.5,
        linecolor="white",
        annot_kws=annot_kws,
        cbar_kws={
            "label": "Log(Value + 1) Scale",
            "format": ScalarFormatter(),
            "shrink": 0.8
        }
    )

    # Configure labels and title
    ax.set_title(title, fontsize=16 * cell_ratio, pad=20, fontweight='bold')

```

If you want this behavior to be configurable (e.g., via a parameter or CLI flag), you can:
1. Add a function argument such as `enable_annotations: bool | None = None` and/or `annotation_threshold: int = 400`.
2. Replace the hardcoded `show_annotations = matrix.size <= 400` with logic that respects those parameters:
   - If `enable_annotations` is `True` / `False`, override the automatic behavior.
   - If `enable_annotations` is `None`, fall back to `matrix.size <= annotation_threshold`.
3. Thread the new argument(s) through any CLI parsing or caller functions that invoke this heatmap generator.
</issue_to_address>

### Comment 4
<location> `tools/deepxtrace_heatmap.py:116-117` </location>
<code_context>
+    )
+
+    # Save output
+    output_file = f"deepxtrace.{output_format}"
+    print(
+        f"Saving started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
+    plt.savefig(
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Derive the output filename from the input file (or allow configuration) instead of always using a fixed name.

Using a fixed `deepxtrace.{format}` filename risks overwriting previous runs and makes it hard to match outputs to their inputs. Consider deriving the name from `input_file` (e.g., its stem plus suffix/timestamp) to make repeated or batch runs safer and more traceable.

Suggested implementation:

```python
from datetime import datetime
from pathlib import Path

```

```python
    # Save output
    # Derive output filename from input file to avoid overwriting and improve traceability
    input_path = Path(input_file)  # assumes input_file is available in this scope
    timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
    output_file = f"{input_path.stem}.deepxtrace.{timestamp}.{output_format}"
    print(
        f"Saving to {output_file} started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")

```

1. Ensure that `input_file` is available in this scope (e.g., passed into the function generating the plot). If it has a different name, update `Path(input_file)` accordingly.
2. If the file header does not already contain `from datetime import datetime`, adjust the import replacement block to match the actual imports at the top of the file.
3. If you want to make the output filename fully configurable, add an optional `output_file`/`output_dir` argument to the relevant function and CLI, and only fall back to the derived name when no explicit value is provided.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +24 to +27
def parse_matrix_data(log_data):
"""Parse matrix data from log string containing bracketed number sequences"""
number_sequences = re.findall(r'\[([\d\s]+)\]', log_data)
return np.array([list(map(int, seq.split())) for seq in number_sequences])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Handle empty or irregular matrix sequences more defensively before constructing the numpy array.

re.findall can return an empty list, and inconsistent row lengths will produce a ragged array. That can lead to unexpected shapes or downstream errors when accessing matrix.shape. Consider validating number_sequences (non-empty, consistent lengths) and raising a clear error on malformed input instead of relying on later failures.

Comment on lines +62 to +63
rows, cols = matrix.shape
adjusted_figsize = (figsize[0] * cell_ratio * (cols / 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Consider clamping the auto-scaled figure size to avoid excessively large outputs for big matrices.

Because adjusted_figsize scales linearly with rows and cols, very large matrices can produce excessively large figures, risking high memory use and unusable plots. Consider capping the width/height (or switching to a nonlinear scaling) so sizes stay within reasonable bounds while still reflecting matrix dimensions.

Comment on lines +82 to +90
ax = sns.heatmap(
log_matrix,
cmap=cmap,
norm=norm,
annot=matrix,
fmt='.2e',
linewidths=0.5,
linecolor="white",
annot_kws={
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Automatically disabling annotations for large matrices would improve performance and readability.

annot=matrix doesn’t scale well for large matrices in terms of runtime and visual noise. Consider disabling annotations when rows * cols exceeds a threshold, or gate them behind a parameter/CLI flag so large traces stay responsive while smaller ones can still show detailed values.

Suggested implementation:

    # Create colormap and normalize data
    cmap = create_optimized_ryg_cmap()
    log_matrix = np.log1p(matrix)
    norm = plt.Normalize(vmin=log_matrix.min(), vmax=log_matrix.max())

    # Dynamic annotation size based on cell size
    annot_size = max(8, min(20, 10 * cell_ratio))

    # Automatically disable annotations for large matrices (performance & readability)
    # Threshold can be tuned; 400 cells (~20x20) keeps small traces detailed and larger ones clean.
    show_annotations = matrix.size <= 400

    if show_annotations:
        annot_data = matrix
        annot_kws = {
            "size": annot_size,
            "color": "black",
        }
    else:
        annot_data = None
        annot_kws = None

    # Generate heatmap
    ax = sns.heatmap(
        log_matrix,
        cmap=cmap,
        norm=norm,
        annot=annot_data,
        fmt='.2e',
        linewidths=0.5,
        linecolor="white",
        annot_kws=annot_kws,
        cbar_kws={
            "label": "Log(Value + 1) Scale",
            "format": ScalarFormatter(),
            "shrink": 0.8
        }
    )

    # Configure labels and title
    ax.set_title(title, fontsize=16 * cell_ratio, pad=20, fontweight='bold')

If you want this behavior to be configurable (e.g., via a parameter or CLI flag), you can:

  1. Add a function argument such as enable_annotations: bool | None = None and/or annotation_threshold: int = 400.
  2. Replace the hardcoded show_annotations = matrix.size <= 400 with logic that respects those parameters:
    • If enable_annotations is True / False, override the automatic behavior.
    • If enable_annotations is None, fall back to matrix.size <= annotation_threshold.
  3. Thread the new argument(s) through any CLI parsing or caller functions that invoke this heatmap generator.

Comment on lines +116 to +117
output_file = f"deepxtrace.{output_format}"
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Derive the output filename from the input file (or allow configuration) instead of always using a fixed name.

Using a fixed deepxtrace.{format} filename risks overwriting previous runs and makes it hard to match outputs to their inputs. Consider deriving the name from input_file (e.g., its stem plus suffix/timestamp) to make repeated or batch runs safer and more traceable.

Suggested implementation:

from datetime import datetime
from pathlib import Path
    # Save output
    # Derive output filename from input file to avoid overwriting and improve traceability
    input_path = Path(input_file)  # assumes input_file is available in this scope
    timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
    output_file = f"{input_path.stem}.deepxtrace.{timestamp}.{output_format}"
    print(
        f"Saving to {output_file} started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
  1. Ensure that input_file is available in this scope (e.g., passed into the function generating the plot). If it has a different name, update Path(input_file) accordingly.
  2. If the file header does not already contain from datetime import datetime, adjust the import replacement block to match the actual imports at the top of the file.
  3. If you want to make the output filename fully configurable, add an optional output_file/output_dir argument to the relevant function and CLI, and only fall back to the derived name when no explicit value is provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a very useful command-line tool for generating heatmaps from DeepXTrace log data, along with corresponding documentation. The implementation is solid and covers a good range of features like custom colormaps and configurable output. My review focuses on improving the tool's robustness and usability. The key suggestions are to add a command-line option for the output file to avoid overwriting results, handle cases where no data is found in the log file to prevent crashes, and make the data parsing more resilient. I've also included a minor formatting suggestion to improve code readability.

def parse_matrix_data(log_data):
"""Parse matrix data from log string containing bracketed number sequences"""
number_sequences = re.findall(r'\[([\d\s]+)\]', log_data)
return np.array([list(map(int, seq.split())) for seq in number_sequences])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list comprehension for parsing numbers can be made more robust. If a log line contains brackets with only whitespace (e.g., [ ]), seq.split() will produce an empty list. This will result in np.array creating a jagged array, which will likely cause errors later on. You can prevent this by filtering out sequences that are empty or contain only whitespace.

Suggested change
return np.array([list(map(int, seq.split())) for seq in number_sequences])
return np.array([list(map(int, seq.split())) for seq in number_sequences if seq.strip()])

Signed-off-by: wangfakang <fakangwang@gmail.com>
Co-authored-by: 毅松 <fakang.wfk@antgroup.com>
Signed-off-by: wangfakang <fakangwang@gmail.com>
@wangfakang wangfakang merged commit 30b57ca into antgroup:main Dec 11, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant