Enhanced Kubernetes Pod Label Support with Kubelet API Integration by Raid57 · Pull Request #592 · NVIDIA/dcgm-exporter

Raid57 · 2025-12-03T07:54:26Z

Summary

Builds upon the existing Kubernetes pod label support PR #515 and label filtering feature PR #564 by adding kubelet API integration. This enables more granular GPU application classification, better GPU utilization tracking, and support for internal scheduling labels (e.g., accelerator, quota) to better understand GPU card type utilization patterns, while significantly reducing API server load in large clusters.

We have been successfully running these changes in production environments, and they significantly improve observability and performance for GPU resource management in large-scale Kubernetes clusters.

Implementation

Extends the existing PodMapper with kubelet API integration.

Key Features

Kubelet API Support: Optional use of kubelet /pods API instead of API server for fetching pod metadata (labels, UID), reducing API server load in large clusters
Backward Compatible: All new features are opt-in and disabled by default

Configuration

Command Line Flags

# Enable kubelet API for pod metadata (instead of API server)
--kubernetes-use-kubelet-api

# Configure kubelet URL (default: https://127.0.0.1:10250)
--kubernetes-kubelet-url=https://127.0.0.1:10250

# Filter pod labels using regex patterns (from PR #564)
--kubernetes-pod-label-allowlist-regex="^accelerator$,^quota$,^app\\.kubernetes\\.io/.*"

# Configure label filter cache size (from PR #564, default: 150000)
--kubernetes-pod-label-cache-size=150000

Environment Variables

DCGM_EXPORTER_KUBERNETES_USE_KUBELET_API=true
DCGM_EXPORTER_KUBERNETES_KUBELET_URL=https://127.0.0.1:10250
DCGM_EXPORTER_KUBERNETES_POD_LABEL_ALLOWLIST_REGEX="^accelerator$,^quota$"
DCGM_EXPORTER_KUBERNETES_POD_LABEL_CACHE_SIZE=150000

Helm Chart Configuration

kubernetes:
  enablePodLabels: true
  useKubeletAPI: true  # New: Enable kubelet API
  kubeletURL: "https://127.0.0.1:10250"  # New: Kubelet URL
  podLabelAllowlistRegex:  # From PR #564
    - "^accelerator$"      # Match accelerator label
    - "^quota$"            # Match quota label
    - "^app\\.kubernetes\\.io/.*"  # Match all app.kubernetes.io/* labels
  podLabelCacheSize: 150000  # From PR #564

Use Cases

GPU Card Type Utilization Tracking
- Use accelerator label to track utilization by GPU model/type
- Example: accelerator: "nvidia-tesla-v100" or accelerator: "nvidia-a100"
Quota and Resource Management
- Use quota label to track GPU usage by quota/resource pool
- Enables chargeback and resource allocation analysis

Related PRs and Issues

This enhancement builds upon:

Kubernetes Pod Label Support #515: Kubernetes Pod Label Support
Add allow list for pod label filtering #564: Add allow list for pod label filtering (provides labelFilterCache and label filtering functionality)

glowkey · 2025-12-03T15:51:59Z

Thanks for this PR! Would it be possible to add a few unit tests or integration tests in tests/e2e?

Raid57 · 2025-12-04T03:02:23Z

Sure! I’ll add some tests soon.

To Verify Get Pod Labels With Kubelet API

Raid57 · 2025-12-10T05:41:05Z

Thanks for this PR! Would it be possible to add a few unit tests or integration tests in tests/e2e?

i‘ve submitted，waiting for your review~

glowkey · 2025-12-10T16:38:53Z

Thanks for updating the PR with some tests! We are planning to test and validate this MR for our next major release in January 2026.

yufeng.huang added 2 commits December 2, 2025 22:37

Get Pod Label With Kubelet API

f815a62

Update Template of clusterrole.yaml

4dc579c

ADD tests/e2e

f8bf978

To Verify Get Pod Labels With Kubelet API

Raid57 closed this Dec 10, 2025

Raid57 reopened this Dec 10, 2025

Merge branch 'main' into 4.4.2-4.7.0-raid57-dev

5e94961

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Enhanced Kubernetes Pod Label Support with Kubelet API Integration#592

Enhanced Kubernetes Pod Label Support with Kubelet API Integration#592
Raid57 wants to merge 4 commits intoNVIDIA:mainfrom
Raid57:4.4.2-4.7.0-raid57-dev

Raid57 commented Dec 3, 2025

Uh oh!

glowkey commented Dec 3, 2025

Uh oh!

Raid57 commented Dec 4, 2025

Uh oh!

Raid57 commented Dec 10, 2025 •

edited

Loading

Uh oh!

glowkey commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Raid57 commented Dec 3, 2025

Summary

Implementation

Key Features

Configuration

Command Line Flags

Environment Variables

Helm Chart Configuration

Use Cases

Related PRs and Issues

Uh oh!

glowkey commented Dec 3, 2025

Uh oh!

Raid57 commented Dec 4, 2025

Uh oh!

Raid57 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glowkey commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Raid57 commented Dec 10, 2025 •

edited

Loading