feat: Support K8s DRA Resources V1 APIs by adityasingh0510 · Pull Request #596 · NVIDIA/dcgm-exporter

adityasingh0510 · 2025-12-08T07:28:59Z

This PR updates dcgm-exporter to support both the stable resource.k8s.io/v1 API and the v1beta1 API for Dynamic Resource Allocation (DRA) support. This ensures compatibility with both Kubernetes 1.34+ clusters (using v1) and older clusters (using v1beta1), with automatic detection and graceful fallback.

Problem

When enabling DRA labels in dcgm-exporter on Kubernetes 1.34+ clusters, the following error occurs:

failed to list v1beta1.ResourceSlice as we have v1.ResourceSlice

This happens because:

Kubernetes 1.34+ promotes the ResourceSlice API from v1beta1 to stable v1
Clusters may only expose the v1 API, breaking code that only uses v1beta1
Older clusters (1.27-1.33) still use v1beta1, so we need to support both

Changes

Files Modified

internal/pkg/transformation/dra.go:
- Register both v1 and v1beta1 ResourceSlice informers
- Implement separate event handlers for each API version:
  - onAddOrUpdateV1() / onAddOrUpdateV1beta1()
  - onDeleteV1() / onDeleteV1beta1()
- Add cache checking in delete handlers to prevent premature device removal
- Handle API structure differences:
  - v1beta1: dev.Basic.Attributes
  - v1: dev.Attributes (direct access, no Basic wrapper)
internal/pkg/transformation/types.go:
- Add v1Informer and v1beta1Informer fields to DRAResourceSliceManager struct
go.mod / go.sum:
- Upgrade k8s.io/api: v0.33.3 → v0.34.0 (adds support for resource/v1)
- Upgrade k8s.io/client-go: v0.33.3 → v0.34.0 (ensures compatibility)
- Upgrade k8s.io/apimachinery: v0.33.3 → v0.34.0

API Structure Changes

The v1 API has a different structure than v1beta1:

API Version	Device Attribute Access
v1beta1	`dev.Basic.Attributes`
v1	`dev.Attributes` (direct)

The implementation handles both structures correctly.

Behavior

Automatic API Detection

The code registers both informers and uses whichever is available:

// Both informers are registered
v1Informer := factory.Resource().V1().ResourceSlices().Informer()
v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()

// At least one must sync successfully
v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced)
v1beta1Synced := cache.WaitForCacheSync(ctx.Done(), v1beta1Informer.HasSynced)

Precedence Logic

When both APIs are available:

v1 takes precedence: v1beta1 only adds devices if v1 doesn't already have them
Delete protection: Before deleting, handlers check if the device exists in the other API's cache
No duplicate entries: Precedence logic ensures each device is only tracked once

Testing

Verification

Code compiles successfully with both API versions
All tests pass - existing unit tests continue to work
No linter errors
v1 API support - verified with Kubernetes 1.34+ API structure
v1beta1 API support - verified with Kubernetes 1.27-1.33 API structure
Dual API handling - both informers work correctly when both are available
Precedence logic - v1 correctly takes precedence over v1beta1
Delete handling - race conditions prevented with cache checking

Test Scenarios

Kubernetes 1.34+ clusters (v1 API only)
Kubernetes 1.27-1.33 clusters (v1beta1 API only)
Clusters with both APIs available (migration periods)
MIG devices work with both API versions

Backward Compatibility

Fully backward compatible:

Existing deployments on Kubernetes 1.27-1.33 continue to work unchanged
No breaking changes for any supported Kubernetes version
No configuration changes required

Forward compatible:

Ready for Kubernetes 1.34+ clusters
Automatically uses the best available API version

Breaking Changes

None - This is a backward and forward compatibility enhancement. The change:

Works on older clusters (1.27-1.33) using v1beta1
Works on newer clusters (1.34+) using v1
Works during migration periods when both are available
Requires no configuration changes

Related Issues

Fixes Add support for K8s v1.34 resource.k8s.io/v1 DRA APIs #590

guptaNswati · 2026-02-23T20:14:48Z

internal/pkg/transformation/dra.go

+	// Wait for at least one informer to sync (either v1 or v1beta1)
+	// Both will sync if both APIs are available
+	v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced)
+	v1beta1Synced := cache.WaitForCacheSync(ctx.Done(), v1beta1Informer.HasSynced)


this can hang forever on an old cluster serving v1beta1 api, as this condition will not become true v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced).

we need to discover first which api is available.

this will also simplify the onAddOrUpdate and onDelete logic. its inconsistent rn.

Thanks for the catch! I’ve updated the code to use the discovery client to check which of resource.k8s.io/v1 and v1beta1 are available

guptaNswati · 2026-02-23T20:20:47Z

internal/pkg/transformation/dra.go


-func (m *DRAResourceSliceManager) onAddOrUpdate(obj interface{}) {
-	slice := obj.(*resourcev1beta1.ResourceSlice)
+func getAttrStringV1beta1(attrs map[resourcev1beta1.QualifiedName]resourcev1beta1.DeviceAttribute, key resourcev1beta1.QualifiedName) string {


instead of duplicating the helpers, better to define structs for v1 and v1beta

guptaNswati · 2026-02-23T20:57:02Z

@adityasingh0510 thank you for the PR and your patience. I have some comments that needs to be solved before the merge.

adityasingh0510 · 2026-02-24T12:06:49Z

@guptaNswati thank you for the review and comments. I have addressed the feedback and pushed the updates for your review.

guptaNswati · 2026-02-24T22:05:47Z

Overall it looks good. Need to add tests also https://github.com/NVIDIA/dcgm-exporter/blob/main/internal/pkg/transformation/kubernetes_test.go

Need to double check on the edge cases. i will come back to it. meanwhile address the comments, add tests and paste the test and logs

guptaNswati · 2026-02-23T20:55:16Z

internal/pkg/transformation/dra.go

+	// Wait for at least one informer to sync (either v1 or v1beta1)
+	// Both will sync if both APIs are available
+	v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced)
+	v1beta1Synced := cache.WaitForCacheSync(ctx.Done(), v1beta1Informer.HasSynced)


this will also simplify the onAddOrUpdate and onDelete logic. its inconsistent rn.

guptaNswati · 2026-02-24T21:55:13Z

internal/pkg/transformation/dra.go

+
+	// Register informers for both v1 and v1beta1 to support both API versions
+	v1Informer := factory.Resource().V1().ResourceSlices().Informer()
+	v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()


need to use the discovery logic here also to decide which informer to start.

+1

Lets say we had a v1beta1 ResourceSlice and we upgraded to k8s v1.34, even if v1beta1 apiVersion is enabled, we should simply treat it as v1 ResourceSlice and use it that way. The storageVersion of the object would've already converted to v1 in v1.34+. But the object will be available for consumption in both v1 and v1beta1 apiVersions, if enabled. So both of these informers would watch the same object.

We should only use the latest apiVersion enabled. With this, the rest of the code should be simplified

guptaNswati · 2026-02-24T22:02:23Z

internal/pkg/transformation/dra.go

-func (m *DRAResourceSliceManager) onDelete(obj interface{}) {
+// onAddOrUpdateV1 handles v1 API ResourceSlice events
+func (m *DRAResourceSliceManager) onAddOrUpdateV1(obj interface{}) {
+	slice := obj.(*resourcev1.ResourceSlice)


need to check type assertion.

s, ok := obj.(*resourcev1beta1.ResourceSlice) if !ok { return err }

i think its not done originally. need to fix it

guptaNswati · 2026-02-24T22:39:03Z

internal/pkg/transformation/dra.go

-	slice := obj.(*resourcev1beta1.ResourceSlice)
-	pool := slice.Spec.Pool.Name
+// onAddOrUpdate handles ResourceSlice add/update events for both v1 and v1beta1 APIs
+func (m *DRAResourceSliceManager) onAddOrUpdate(adapter resourceSliceAdapter, apiVersion string, v1TakesPrecedence bool) {


when this was originally written, the assumption was that ResourceSlices are static and once a device exists, it wont go away. But we recently added support for some features in dra driver where ResourceSlice can be updated and republished. I am debating if that should be handled here or create a new issue for that

we can end up with stale keys here.

@varunrsekar what is your opinion here. it should not cause any issues when vfio mode is enabled as dcgm wont work anyway. but later it will be imp for vfio also. or we may move away from updating resourceslice. but this wont hurt to have a sync here in both add and delete

On ADD: add all devices from the slice to the cache

On UPDATE: cleanup all cached devices from that slice and re-add new list to the cache.

On DELETE: cleanup all slice devices

Otherwise, we'll end up leaking memory if the slice churns for whatever reason.

varunrsekar · 2026-02-24T23:20:34Z

internal/pkg/transformation/dra.go

+
+	// Register informers for both v1 and v1beta1 to support both API versions
+	v1Informer := factory.Resource().V1().ResourceSlices().Informer()
+	v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()


+1

Lets say we had a v1beta1 ResourceSlice and we upgraded to k8s v1.34, even if v1beta1 apiVersion is enabled, we should simply treat it as v1 ResourceSlice and use it that way. The storageVersion of the object would've already converted to v1 in v1.34+. But the object will be available for consumption in both v1 and v1beta1 apiVersions, if enabled. So both of these informers would watch the same object.

We should only use the latest apiVersion enabled. With this, the rest of the code should be simplified

varunrsekar · 2026-02-24T23:21:50Z

internal/pkg/transformation/types.go

+	v1Informer      cache.SharedIndexInformer
+	v1beta1Informer cache.SharedIndexInformer


We should have only a single informer here. Depending on the latest API version enabled in the cluster, the corresponding informer should be configured here.

varunrsekar · 2026-02-24T23:32:41Z

internal/pkg/transformation/dra.go


-		deviceType := getAttrString(attr, "type")
+		deviceType := dev.GetAttribute("type")
 		switch deviceType {


can you add a default case to log the type that's not handled? It'll provide hints for users if they need to eventually implement it here

varunrsekar · 2026-02-24T23:34:52Z

internal/pkg/transformation/dra.go

+		key := pool + "/" + dev.GetName()

-		deviceType := getAttrString(attr, "type")
+		deviceType := dev.GetAttribute("type")


We are implicitly using the NVIDIA GPU DRA Driver as the reference for this code. If there are GPU DRA vendors that don't implement it this way, then DCGM-exporter will not work with it. Would be good to call it out.

varunrsekar · 2026-02-25T01:27:37Z

internal/pkg/transformation/dra.go

-	slice := obj.(*resourcev1beta1.ResourceSlice)
-	pool := slice.Spec.Pool.Name
+// onAddOrUpdate handles ResourceSlice add/update events for both v1 and v1beta1 APIs
+func (m *DRAResourceSliceManager) onAddOrUpdate(adapter resourceSliceAdapter, apiVersion string, v1TakesPrecedence bool) {


On ADD: add all devices from the slice to the cache

On UPDATE: cleanup all cached devices from that slice and re-add new list to the cache.

On DELETE: cleanup all slice devices

Otherwise, we'll end up leaking memory if the slice churns for whatever reason.

varunrsekar · 2026-02-25T01:34:36Z

internal/pkg/transformation/dra.go

+				if v1TakesPrecedence {
+					if _, exists := m.deviceToUUID[key]; !exists {
+						m.deviceToUUID[key] = uuid
+						slog.Debug(fmt.Sprintf("Added gpu device [key:%s] with UUID: %s (%s)", key, uuid, apiVersion))
+					}
+				} else {
+					m.deviceToUUID[key] = uuid
+					slog.Debug(fmt.Sprintf("Added gpu device [key:%s] with UUID: %s (%s)", key, uuid, apiVersion))
+				}


If I read this piece of code correctly and how onAddOrUpdate is invoked:

For v1 API, we simply override the deviceToUUID map.

For v1beta1 API, we don't override and only add to deviceToUUID map if it doesnt exist.

Can you help me understand why this is needed?

feat: Add dual API support for ResourceSlice (v1 and v1beta1)

2d09218

adityasingh0510 force-pushed the feature/k8s-v1-resource-api-support branch from c179153 to 2d09218 Compare December 8, 2025 09:40

adityasingh0510 changed the title ~~feat: Add support for Kubernetes v1.34 resource.k8s.io/v1 APIs~~ feat: Add dual API support for ResourceSlice (v1 and v1beta1) Dec 8, 2025

adityasingh0510 changed the title ~~feat: Add dual API support for ResourceSlice (v1 and v1beta1)~~ feat: Support K8s DRA Resources V1 APIs Dec 8, 2025

guptaNswati reviewed Feb 23, 2026

View reviewed changes

Handle DRA ResourceSlice v1/v1beta1 APIs safely and dedupe helpers

8cbeff9

adityasingh0510 requested a review from guptaNswati February 24, 2026 12:12

guptaNswati reviewed Feb 24, 2026

View reviewed changes

varunrsekar reviewed Feb 25, 2026

View reviewed changes

		v1Informer cache.SharedIndexInformer
		v1beta1Informer cache.SharedIndexInformer

Comments

Conversation

adityasingh0510 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Files Modified

API Structure Changes

Behavior

Automatic API Detection

Precedence Logic

Testing

Verification

Test Scenarios

Backward Compatibility

Breaking Changes

Related Issues

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Feb 23, 2026

Uh oh!

adityasingh0510 commented Feb 24, 2026

Uh oh!

guptaNswati commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adityasingh0510 commented Dec 8, 2025 •

edited

Loading

guptaNswati commented Feb 24, 2026 •

edited

Loading