Skip to content

feat(plugins): Add KAI Scheduler plugin for gang-scheduling support#3185

Closed
Raakshass wants to merge 1 commit intokubeflow:masterfrom
Raakshass:feature/kai-scheduler-2628
Closed

feat(plugins): Add KAI Scheduler plugin for gang-scheduling support#3185
Raakshass wants to merge 1 commit intokubeflow:masterfrom
Raakshass:feature/kai-scheduler-2628

Conversation

@Raakshass
Copy link

Description

This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer, addressing the need for advanced GPU scheduling in AI/ML workloads.

What is KAI Scheduler?

KAI Scheduler is NVIDIA's Kubernetes AI Scheduler that provides:

  • Gang scheduling for distributed training jobs
  • GPU-aware bin-packing for optimal resource utilization
  • Topology-aware placement (NVLink, NVSwitch)
  • Queue-based multi-tenant scheduling
  • Native integration with NVIDIA GPU Operator

Changes Made

API Types (pkg/apis/trainer/v1alpha1/trainingruntime_types.go)

  • Added KAI field to PodGroupPolicySource struct
  • Defined KAIPodGroupPolicySource struct with:
    • QueueName: Optional queue for multi-tenant scheduling
    • ScheduleTimeoutSeconds: Timeout before failing unschedulable PodGroups

Plugin Implementation (pkg/runtime/framework/plugins/kai/kai.go)

  • EnforcePodGroupPolicy: Sets KAI-specific pod labels
    • scheduling.kai.io/pod-group: Associates pods with their PodGroup
    • scheduling.kai.io/queue: Assigns pods to KAI queue
  • Build: Creates scheduler-plugins PodGroup resources (compatible with KAI)
    • Aggregates MinMember from all PodSets
    • Calculates total MinResources from pod requests
    • Sets ScheduleTimeoutSeconds (default: 60s)
    • Configures proper owner references

Registry (pkg/runtime/framework/plugins/registry.go)

  • Registered kai.New in the plugin registry

Tests (pkg/runtime/framework/plugins/kai/kai_test.go)

  • 8 comprehensive test cases covering:
    • Nil info/trainJob handling
    • PodGroup creation with proper MinMember/MinResources
    • Queue name label assignment
    • API error handling
    • Existing PodGroup skip logic
    • Default timeout values

Usage Example

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainingRuntime
metadata:
  name: kai-distributed-training
spec:
  mlPolicy:
    torch:
      numProcPerNode: "4"
  podGroupPolicy:
    kai:
      queueName: "high-priority"
      scheduleTimeoutSeconds: 120
  template:
    spec:
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: 4

Testing

  • All 8 unit tests pass
  • go build ./... succeeds
  • controller-gen manifests generated

Related Issues

Fixes #2628

Checklist

  • Code follows existing patterns (similar to coscheduling plugin)
  • Unit tests added with comprehensive coverage
  • API types include proper JSON tags and documentation
  • Generated deepcopy functions updated
  • CRD manifests regenerated

Base Branch

master

Files Changed

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go - Added KAI API types
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go - Generated deepcopy
  • pkg/runtime/framework/plugins/kai/kai.go - KAI plugin implementation
  • pkg/runtime/framework/plugins/kai/kai_test.go - Comprehensive tests
  • pkg/runtime/framework/plugins/registry.go - Plugin registration
  • manifests/base/crds/*.yaml - Updated CRDs

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested a review from jinchihe February 8, 2026 19:35
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@Raakshass Raakshass force-pushed the feature/kai-scheduler-2628 branch from 528ef0e to ab488e8 Compare February 8, 2026 19:36
@Raakshass
Copy link
Author

Raakshass commented Feb 8, 2026

Closing - recreated as #3186 with correct title format

@Raakshass Raakshass closed this Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support KAI Scheduler in Kubeflow Trainer

1 participant

Comments