Skip to content

Comments

Fix grafana dashboard cannot display properly in vGPU cluster#240

Closed
Levi080513 wants to merge 1 commit intoNVIDIA:mainfrom
Levi080513:hw/fix-grafana-dashboard-display-failed
Closed

Fix grafana dashboard cannot display properly in vGPU cluster#240
Levi080513 wants to merge 1 commit intoNVIDIA:mainfrom
Levi080513:hw/fix-grafana-dashboard-display-failed

Conversation

@Levi080513
Copy link

Fix #236

Test

Create a k8s cluster with vGPU configured on one node.

kc get node hw-sks-test-vgpu-vgpunode-8jwnn -oyaml | yq '.status.allocatable'
cpu: "8"
ephemeral-storage: "57976119610"
hugepages-2Mi: "0"
memory: 15968092Ki
nvidia.com/gpu: "1"
pods: "110"

kc exec -ti nvidia-driver-daemonset-4.18.0-477.27.1.el8.8-rocky8.8-x7qj8 -n sks-system-nvidia-gpu -- nvidia-smi
Mon Jan 29 10:20:21 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:0A.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |     12MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    243152      C   /app/gpu_burn                      12MiB |
+-----------------------------------------------------------------------------+

Before fixing
image

After fixing
image

image

frittentheke added a commit to frittentheke/dcgm-exporter that referenced this pull request Jul 8, 2024
…name)

* Change PromQL queries to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this pull request Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this pull request Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
@Levi080513
Copy link
Author

This PR is a duplicate of #355

@Levi080513 Levi080513 closed this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster

1 participant