-
Notifications
You must be signed in to change notification settings - Fork 7
Add a visualization tool for DeepXTrace heatmap generation #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideAdds a standalone Python CLI tool to generate DeepXTrace heatmap visualizations from log files plus accompanying usage documentation. Sequence diagram for DeepXTrace heatmap CLI execution flowsequenceDiagram
actor User
participant CLI as DeepXTraceHeatmapCLI
participant Argparse
participant FileSystem
participant Parser as MatrixParser
participant Plotter as HeatmapPlotter
participant Matplotlib
User->>CLI: python deepxtrace_heatmap.py input_file --options
CLI->>Argparse: parse_args()
Argparse-->>CLI: args
CLI->>FileSystem: open(input_file)
FileSystem-->>CLI: log_content
CLI->>Parser: parse_matrix_data(log_content)
Parser-->>CLI: data_matrix
CLI->>Plotter: plot_deepxtrace_heatmap(data_matrix, title, figsize, dpi, format, cell_ratio)
Plotter->>Matplotlib: configure_figure_and_colormap()
Matplotlib-->>Plotter: figure_axes
Plotter->>Matplotlib: sns.heatmap(...)
Matplotlib-->>Plotter: rendered_heatmap
Plotter->>Matplotlib: savefig("deepxtrace.{format}")
Matplotlib-->>Plotter: file_written
Plotter-->>CLI: completion
CLI-->>User: print saving_started_and_completed_timestamps
Class diagram for deepxtrace_heatmap Python tool structureclassDiagram
class DeepxtraceHeatmapModule {
<<module>>
+create_optimized_ryg_cmap() LinearSegmentedColormap
+parse_matrix_data(log_data string) ndarray
+read_log_file(file_path string) string
+plot_deepxtrace_heatmap(matrix ndarray, title string, figsize float[], dpi int, output_format string, cell_ratio float) void
+main() void
}
class ArgparseModule {
<<library>>
}
class NumpyModule {
<<library>>
}
class MatplotlibPyplotModule {
<<library>>
}
class SeabornModule {
<<library>>
}
DeepxtraceHeatmapModule ..> ArgparseModule : uses
DeepxtraceHeatmapModule ..> NumpyModule : uses
DeepxtraceHeatmapModule ..> MatplotlibPyplotModule : uses
DeepxtraceHeatmapModule ..> SeabornModule : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @wangfakang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new utility to enhance the analysis of DeepXTrace performance data. It provides a dedicated Python tool that transforms raw numerical matrix data from log files into visually intuitive heatmaps. This aims to streamline the process of identifying performance bottlenecks and patterns by offering a clear, customizable graphical representation of the data. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Consider validating the parsed matrix (e.g., non-empty, consistent row lengths) and failing with a clear error message before calling
np.arrayand plotting, to avoid confusing runtime errors on malformed logs. - The output filename is currently hard-coded to
deepxtrace.<format>; you may want to allow the user to override this (e.g., via a--outputargument) so multiple runs on different inputs don’t overwrite each other.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider validating the parsed matrix (e.g., non-empty, consistent row lengths) and failing with a clear error message before calling `np.array` and plotting, to avoid confusing runtime errors on malformed logs.
- The output filename is currently hard-coded to `deepxtrace.<format>`; you may want to allow the user to override this (e.g., via a `--output` argument) so multiple runs on different inputs don’t overwrite each other.
## Individual Comments
### Comment 1
<location> `tools/deepxtrace_heatmap.py:24-27` </location>
<code_context>
+ return LinearSegmentedColormap.from_list("optimized_ryg", colors)
+
+
+def parse_matrix_data(log_data):
+ """Parse matrix data from log string containing bracketed number sequences"""
+ number_sequences = re.findall(r'\[([\d\s]+)\]', log_data)
+ return np.array([list(map(int, seq.split())) for seq in number_sequences])
+
+
</code_context>
<issue_to_address>
**issue:** Handle empty or irregular matrix sequences more defensively before constructing the numpy array.
`re.findall` can return an empty list, and inconsistent row lengths will produce a ragged array. That can lead to unexpected shapes or downstream errors when accessing `matrix.shape`. Consider validating `number_sequences` (non-empty, consistent lengths) and raising a clear error on malformed input instead of relying on later failures.
</issue_to_address>
### Comment 2
<location> `tools/deepxtrace_heatmap.py:62-63` </location>
<code_context>
+ cell_ratio: Cell size scaling factor (default: 1.5)
+ """
+ # Calculate adjusted figure size based on matrix dimensions
+ rows, cols = matrix.shape
+ adjusted_figsize = (figsize[0] * cell_ratio * (cols / 10),
+ figsize[1] * cell_ratio * (rows / 10))
+
</code_context>
<issue_to_address>
**suggestion (performance):** Consider clamping the auto-scaled figure size to avoid excessively large outputs for big matrices.
Because `adjusted_figsize` scales linearly with `rows` and `cols`, very large matrices can produce excessively large figures, risking high memory use and unusable plots. Consider capping the width/height (or switching to a nonlinear scaling) so sizes stay within reasonable bounds while still reflecting matrix dimensions.
</issue_to_address>
### Comment 3
<location> `tools/deepxtrace_heatmap.py:82-90` </location>
<code_context>
+ annot_size = max(8, min(20, 10 * cell_ratio))
+
+ # Generate heatmap
+ ax = sns.heatmap(
+ log_matrix,
+ cmap=cmap,
+ norm=norm,
+ annot=matrix,
+ fmt='.2e',
+ linewidths=0.5,
+ linecolor="white",
+ annot_kws={
+ "size": annot_size,
+ "color": "black"
</code_context>
<issue_to_address>
**suggestion (performance):** Automatically disabling annotations for large matrices would improve performance and readability.
`annot=matrix` doesn’t scale well for large matrices in terms of runtime and visual noise. Consider disabling annotations when `rows * cols` exceeds a threshold, or gate them behind a parameter/CLI flag so large traces stay responsive while smaller ones can still show detailed values.
Suggested implementation:
```python
# Create colormap and normalize data
cmap = create_optimized_ryg_cmap()
log_matrix = np.log1p(matrix)
norm = plt.Normalize(vmin=log_matrix.min(), vmax=log_matrix.max())
# Dynamic annotation size based on cell size
annot_size = max(8, min(20, 10 * cell_ratio))
# Automatically disable annotations for large matrices (performance & readability)
# Threshold can be tuned; 400 cells (~20x20) keeps small traces detailed and larger ones clean.
show_annotations = matrix.size <= 400
if show_annotations:
annot_data = matrix
annot_kws = {
"size": annot_size,
"color": "black",
}
else:
annot_data = None
annot_kws = None
# Generate heatmap
ax = sns.heatmap(
log_matrix,
cmap=cmap,
norm=norm,
annot=annot_data,
fmt='.2e',
linewidths=0.5,
linecolor="white",
annot_kws=annot_kws,
cbar_kws={
"label": "Log(Value + 1) Scale",
"format": ScalarFormatter(),
"shrink": 0.8
}
)
# Configure labels and title
ax.set_title(title, fontsize=16 * cell_ratio, pad=20, fontweight='bold')
```
If you want this behavior to be configurable (e.g., via a parameter or CLI flag), you can:
1. Add a function argument such as `enable_annotations: bool | None = None` and/or `annotation_threshold: int = 400`.
2. Replace the hardcoded `show_annotations = matrix.size <= 400` with logic that respects those parameters:
- If `enable_annotations` is `True` / `False`, override the automatic behavior.
- If `enable_annotations` is `None`, fall back to `matrix.size <= annotation_threshold`.
3. Thread the new argument(s) through any CLI parsing or caller functions that invoke this heatmap generator.
</issue_to_address>
### Comment 4
<location> `tools/deepxtrace_heatmap.py:116-117` </location>
<code_context>
+ )
+
+ # Save output
+ output_file = f"deepxtrace.{output_format}"
+ print(
+ f"Saving started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
+ plt.savefig(
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Derive the output filename from the input file (or allow configuration) instead of always using a fixed name.
Using a fixed `deepxtrace.{format}` filename risks overwriting previous runs and makes it hard to match outputs to their inputs. Consider deriving the name from `input_file` (e.g., its stem plus suffix/timestamp) to make repeated or batch runs safer and more traceable.
Suggested implementation:
```python
from datetime import datetime
from pathlib import Path
```
```python
# Save output
# Derive output filename from input file to avoid overwriting and improve traceability
input_path = Path(input_file) # assumes input_file is available in this scope
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
output_file = f"{input_path.stem}.deepxtrace.{timestamp}.{output_format}"
print(
f"Saving to {output_file} started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
```
1. Ensure that `input_file` is available in this scope (e.g., passed into the function generating the plot). If it has a different name, update `Path(input_file)` accordingly.
2. If the file header does not already contain `from datetime import datetime`, adjust the import replacement block to match the actual imports at the top of the file.
3. If you want to make the output filename fully configurable, add an optional `output_file`/`output_dir` argument to the relevant function and CLI, and only fall back to the derived name when no explicit value is provided.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def parse_matrix_data(log_data): | ||
| """Parse matrix data from log string containing bracketed number sequences""" | ||
| number_sequences = re.findall(r'\[([\d\s]+)\]', log_data) | ||
| return np.array([list(map(int, seq.split())) for seq in number_sequences]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Handle empty or irregular matrix sequences more defensively before constructing the numpy array.
re.findall can return an empty list, and inconsistent row lengths will produce a ragged array. That can lead to unexpected shapes or downstream errors when accessing matrix.shape. Consider validating number_sequences (non-empty, consistent lengths) and raising a clear error on malformed input instead of relying on later failures.
| rows, cols = matrix.shape | ||
| adjusted_figsize = (figsize[0] * cell_ratio * (cols / 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): Consider clamping the auto-scaled figure size to avoid excessively large outputs for big matrices.
Because adjusted_figsize scales linearly with rows and cols, very large matrices can produce excessively large figures, risking high memory use and unusable plots. Consider capping the width/height (or switching to a nonlinear scaling) so sizes stay within reasonable bounds while still reflecting matrix dimensions.
| ax = sns.heatmap( | ||
| log_matrix, | ||
| cmap=cmap, | ||
| norm=norm, | ||
| annot=matrix, | ||
| fmt='.2e', | ||
| linewidths=0.5, | ||
| linecolor="white", | ||
| annot_kws={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): Automatically disabling annotations for large matrices would improve performance and readability.
annot=matrix doesn’t scale well for large matrices in terms of runtime and visual noise. Consider disabling annotations when rows * cols exceeds a threshold, or gate them behind a parameter/CLI flag so large traces stay responsive while smaller ones can still show detailed values.
Suggested implementation:
# Create colormap and normalize data
cmap = create_optimized_ryg_cmap()
log_matrix = np.log1p(matrix)
norm = plt.Normalize(vmin=log_matrix.min(), vmax=log_matrix.max())
# Dynamic annotation size based on cell size
annot_size = max(8, min(20, 10 * cell_ratio))
# Automatically disable annotations for large matrices (performance & readability)
# Threshold can be tuned; 400 cells (~20x20) keeps small traces detailed and larger ones clean.
show_annotations = matrix.size <= 400
if show_annotations:
annot_data = matrix
annot_kws = {
"size": annot_size,
"color": "black",
}
else:
annot_data = None
annot_kws = None
# Generate heatmap
ax = sns.heatmap(
log_matrix,
cmap=cmap,
norm=norm,
annot=annot_data,
fmt='.2e',
linewidths=0.5,
linecolor="white",
annot_kws=annot_kws,
cbar_kws={
"label": "Log(Value + 1) Scale",
"format": ScalarFormatter(),
"shrink": 0.8
}
)
# Configure labels and title
ax.set_title(title, fontsize=16 * cell_ratio, pad=20, fontweight='bold')If you want this behavior to be configurable (e.g., via a parameter or CLI flag), you can:
- Add a function argument such as
enable_annotations: bool | None = Noneand/orannotation_threshold: int = 400. - Replace the hardcoded
show_annotations = matrix.size <= 400with logic that respects those parameters:- If
enable_annotationsisTrue/False, override the automatic behavior. - If
enable_annotationsisNone, fall back tomatrix.size <= annotation_threshold.
- If
- Thread the new argument(s) through any CLI parsing or caller functions that invoke this heatmap generator.
| output_file = f"deepxtrace.{output_format}" | ||
| print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Derive the output filename from the input file (or allow configuration) instead of always using a fixed name.
Using a fixed deepxtrace.{format} filename risks overwriting previous runs and makes it hard to match outputs to their inputs. Consider deriving the name from input_file (e.g., its stem plus suffix/timestamp) to make repeated or batch runs safer and more traceable.
Suggested implementation:
from datetime import datetime
from pathlib import Path # Save output
# Derive output filename from input file to avoid overwriting and improve traceability
input_path = Path(input_file) # assumes input_file is available in this scope
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
output_file = f"{input_path.stem}.deepxtrace.{timestamp}.{output_format}"
print(
f"Saving to {output_file} started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")- Ensure that
input_fileis available in this scope (e.g., passed into the function generating the plot). If it has a different name, updatePath(input_file)accordingly. - If the file header does not already contain
from datetime import datetime, adjust the import replacement block to match the actual imports at the top of the file. - If you want to make the output filename fully configurable, add an optional
output_file/output_dirargument to the relevant function and CLI, and only fall back to the derived name when no explicit value is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a very useful command-line tool for generating heatmaps from DeepXTrace log data, along with corresponding documentation. The implementation is solid and covers a good range of features like custom colormaps and configurable output. My review focuses on improving the tool's robustness and usability. The key suggestions are to add a command-line option for the output file to avoid overwriting results, handle cases where no data is found in the log file to prevent crashes, and make the data parsing more resilient. I've also included a minor formatting suggestion to improve code readability.
| def parse_matrix_data(log_data): | ||
| """Parse matrix data from log string containing bracketed number sequences""" | ||
| number_sequences = re.findall(r'\[([\d\s]+)\]', log_data) | ||
| return np.array([list(map(int, seq.split())) for seq in number_sequences]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension for parsing numbers can be made more robust. If a log line contains brackets with only whitespace (e.g., [ ]), seq.split() will produce an empty list. This will result in np.array creating a jagged array, which will likely cause errors later on. You can prevent this by filtering out sequences that are empty or contain only whitespace.
| return np.array([list(map(int, seq.split())) for seq in number_sequences]) | |
| return np.array([list(map(int, seq.split())) for seq in number_sequences if seq.strip()]) |
3bdddfa to
35bf73c
Compare
Signed-off-by: wangfakang <fakangwang@gmail.com> Co-authored-by: 毅松 <fakang.wfk@antgroup.com> Signed-off-by: wangfakang <fakangwang@gmail.com>
35bf73c to
f5fccad
Compare
Summary by Sourcery
Add a command-line Python tool to generate DeepXTrace performance heatmaps from log file matrix data and document its usage in the tools README.
New Features:
Documentation: