Skip to content

Comments

feat(dashboard): add Hostname template variable and update panel legend labels (fixes #630)#632

Open
cluster2600 wants to merge 1 commit intoNVIDIA:mainfrom
cluster2600:feat/630-hostname-label-grafana-dashboard
Open

feat(dashboard): add Hostname template variable and update panel legend labels (fixes #630)#632
cluster2600 wants to merge 1 commit intoNVIDIA:mainfrom
cluster2600:feat/630-hostname-label-grafana-dashboard

Conversation

@cluster2600
Copy link

What

Add a Hostname template variable to the Grafana dashboard and update all panel legend formats to include the hostname alongside the GPU index.

Changes in grafana/dcgm-exporter-dashboard.json:

  • New template variable hostname sourced from label_values(DCGM_FI_DEV_GPU_TEMP, Hostname), supporting multi-select and All
  • All panel PromQL queries now include Hostname=~"$hostname" as an additional filter
  • Legend format updated from GPU {{gpu}}{{Hostname}}-GPU {{gpu}} for all time-series panels

Why

Closes #630

In multi-node GPU clusters the existing dashboard labels series as GPU 0, GPU 1 … without any host information, making it impossible to tell which physical node a GPU belongs to. When multiple nodes each report a GPU 0, graphs become ambiguous and misleading.

The Hostname label is already exported by dcgm-exporter (controlled by the -n flag). Surfacing it in the dashboard label eliminates the confusion with zero configuration changes on the exporter side.

How

  • Added a query-type Grafana template variable that dynamically discovers all hostnames via label_values(DCGM_FI_DEV_GPU_TEMP, Hostname). Supports includeAll and multi so operators can scope the view to a single node or compare across nodes.
  • Injected Hostname=~"$hostname" into every panel's PromQL selector so the hostname variable drives real-time filtering.
  • Changed legendFormat for the six time-series panels that showed only GPU {{gpu}} to {{Hostname}}-GPU {{gpu}}. Aggregate/stat panels (Avg Temp, Power Total) that had no legend format are left unchanged.

Testing

  • JSON validated with python3 -c "import json; json.load(open('grafana/dcgm-exporter-dashboard.json'))"
  • Confirmed template variable list order: datasource → hostname → instance → gpu
  • Verified all eight panels have the updated PromQL selectors and the six time-series panels have the updated legend format

Checklist

  • JSON is valid and imports cleanly into Grafana
  • Backward compatible: the Hostname=~".*" default (All) matches all existing metrics unchanged
  • No exporter code changes required — Hostname label is already present in default metrics
  • Docs not applicable (dashboard is self-documenting in Grafana)

Closes NVIDIA#630

Add a Hostname template variable to the Grafana dashboard so operators
can filter by node name in multi-node clusters.  Update every panel's
legendFormat from 'GPU {{gpu}}' to '{{Hostname}}-GPU {{gpu}}' so graph
series are unambiguously identified by host and GPU index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Show Hostname in the Grafana Dashboard graph labels to prevent confusion on the GPU number

1 participant