Fix grafana dashboard cannot display properly in vGPU cluster by Levi080513 · Pull Request #240 · NVIDIA/dcgm-exporter

Levi080513 · 2024-01-29T02:25:51Z

Test

Create a k8s cluster with vGPU configured on one node.

kc get node hw-sks-test-vgpu-vgpunode-8jwnn -oyaml | yq '.status.allocatable'
cpu: "8"
ephemeral-storage: "57976119610"
hugepages-2Mi: "0"
memory: 15968092Ki
nvidia.com/gpu: "1"
pods: "110"

kc exec -ti nvidia-driver-daemonset-4.18.0-477.27.1.el8.8-rocky8.8-x7qj8 -n sks-system-nvidia-gpu -- nvidia-smi
Mon Jan 29 10:20:21 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:0A.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |     12MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    243152      C   /app/gpu_burn                      12MiB |
+-----------------------------------------------------------------------------+

Before fixing

After fixing

…name) * Change PromQL queries to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

Levi080513 · 2025-02-21T02:29:12Z

This PR is a duplicate of #355

Fix grafana dashboard cannot display properly in GPU cluster

e458834

frittentheke mentioned this pull request Jul 8, 2024

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355

Open

Levi080513 closed this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix grafana dashboard cannot display properly in vGPU cluster#240

Fix grafana dashboard cannot display properly in vGPU cluster#240
Levi080513 wants to merge 1 commit intoNVIDIA:mainfrom
Levi080513:hw/fix-grafana-dashboard-display-failed

Levi080513 commented Jan 29, 2024

Uh oh!

Levi080513 commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Levi080513 commented Jan 29, 2024

Test

Uh oh!

Levi080513 commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant