feat(plugins): Add KAI Scheduler plugin for gang-scheduling support#3185
feat(plugins): Add KAI Scheduler plugin for gang-scheduling support#3185Raakshass wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
528ef0e to
ab488e8
Compare
|
Closing - recreated as #3186 with correct title format |
Description
This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer, addressing the need for advanced GPU scheduling in AI/ML workloads.
What is KAI Scheduler?
KAI Scheduler is NVIDIA's Kubernetes AI Scheduler that provides:
Changes Made
API Types (
pkg/apis/trainer/v1alpha1/trainingruntime_types.go)KAIfield toPodGroupPolicySourcestructKAIPodGroupPolicySourcestruct with:QueueName: Optional queue for multi-tenant schedulingScheduleTimeoutSeconds: Timeout before failing unschedulable PodGroupsPlugin Implementation (
pkg/runtime/framework/plugins/kai/kai.go)EnforcePodGroupPolicy: Sets KAI-specific pod labelsscheduling.kai.io/pod-group: Associates pods with their PodGroupscheduling.kai.io/queue: Assigns pods to KAI queueBuild: Creates scheduler-plugins PodGroup resources (compatible with KAI)Registry (
pkg/runtime/framework/plugins/registry.go)kai.Newin the plugin registryTests (
pkg/runtime/framework/plugins/kai/kai_test.go)Usage Example
Testing
go build ./...succeedscontroller-genmanifests generatedRelated Issues
Fixes #2628
Checklist
Base Branch
masterFiles Changed
manifests/base/crds/*.yaml- Updated CRDs