Skip to content

Comments

Enhanced Kubernetes Pod Label Support with Kubelet API Integration#592

Open
Raid57 wants to merge 4 commits intoNVIDIA:mainfrom
Raid57:4.4.2-4.7.0-raid57-dev
Open

Enhanced Kubernetes Pod Label Support with Kubelet API Integration#592
Raid57 wants to merge 4 commits intoNVIDIA:mainfrom
Raid57:4.4.2-4.7.0-raid57-dev

Conversation

@Raid57
Copy link

@Raid57 Raid57 commented Dec 3, 2025

Summary

Builds upon the existing Kubernetes pod label support PR #515 and label filtering feature PR #564 by adding kubelet API integration. This enables more granular GPU application classification, better GPU utilization tracking, and support for internal scheduling labels (e.g., accelerator, quota) to better understand GPU card type utilization patterns, while significantly reducing API server load in large clusters.

We have been successfully running these changes in production environments, and they significantly improve observability and performance for GPU resource management in large-scale Kubernetes clusters.

Implementation

Extends the existing PodMapper with kubelet API integration.

Key Features

  • Kubelet API Support: Optional use of kubelet /pods API instead of API server for fetching pod metadata (labels, UID), reducing API server load in large clusters
  • Backward Compatible: All new features are opt-in and disabled by default

Configuration

Command Line Flags

# Enable kubelet API for pod metadata (instead of API server)
--kubernetes-use-kubelet-api

# Configure kubelet URL (default: https://127.0.0.1:10250)
--kubernetes-kubelet-url=https://127.0.0.1:10250

# Filter pod labels using regex patterns (from PR #564)
--kubernetes-pod-label-allowlist-regex="^accelerator$,^quota$,^app\\.kubernetes\\.io/.*"

# Configure label filter cache size (from PR #564, default: 150000)
--kubernetes-pod-label-cache-size=150000

Environment Variables

DCGM_EXPORTER_KUBERNETES_USE_KUBELET_API=true
DCGM_EXPORTER_KUBERNETES_KUBELET_URL=https://127.0.0.1:10250
DCGM_EXPORTER_KUBERNETES_POD_LABEL_ALLOWLIST_REGEX="^accelerator$,^quota$"
DCGM_EXPORTER_KUBERNETES_POD_LABEL_CACHE_SIZE=150000

Helm Chart Configuration

kubernetes:
  enablePodLabels: true
  useKubeletAPI: true  # New: Enable kubelet API
  kubeletURL: "https://127.0.0.1:10250"  # New: Kubelet URL
  podLabelAllowlistRegex:  # From PR #564
    - "^accelerator$"      # Match accelerator label
    - "^quota$"            # Match quota label
    - "^app\\.kubernetes\\.io/.*"  # Match all app.kubernetes.io/* labels
  podLabelCacheSize: 150000  # From PR #564

Use Cases

  1. GPU Card Type Utilization Tracking
    • Use accelerator label to track utilization by GPU model/type
    • Example: accelerator: "nvidia-tesla-v100" or accelerator: "nvidia-a100"
  2. Quota and Resource Management
    • Use quota label to track GPU usage by quota/resource pool
    • Enables chargeback and resource allocation analysis

Related PRs and Issues

This enhancement builds upon:

@glowkey
Copy link
Collaborator

glowkey commented Dec 3, 2025

Thanks for this PR! Would it be possible to add a few unit tests or integration tests in tests/e2e?

@Raid57
Copy link
Author

Raid57 commented Dec 4, 2025

Sure! I’ll add some tests soon.

To Verify Get Pod Labels With Kubelet API
@Raid57
Copy link
Author

Raid57 commented Dec 10, 2025

Thanks for this PR! Would it be possible to add a few unit tests or integration tests in tests/e2e?

i‘ve submitted,waiting for your review~

@Raid57 Raid57 closed this Dec 10, 2025
@Raid57 Raid57 reopened this Dec 10, 2025
@glowkey
Copy link
Collaborator

glowkey commented Dec 10, 2025

Thanks for updating the PR with some tests! We are planning to test and validate this MR for our next major release in January 2026.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants