feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186
feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186Raakshass wants to merge 2 commits intokubeflow:masterfrom
Conversation
This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer. Changes: - Added KAI field to PodGroupPolicySource struct - Defined KAIPodGroupPolicySource with QueueName and ScheduleTimeoutSeconds - Implemented kai plugin with EnforcePodGroupPolicy and Build methods - Added 8 comprehensive unit tests - Registered kai.New in the plugin registry Fixes kubeflow#2628 Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
/ok-to-test |
|
Hi @andreyvelich @jaiakash — gentle ping on this PR. It's been ready for review since Feb 8. The KAI Scheduler plugin adds gang-scheduling support for NVIDIA's GPU scheduler with 8 test cases covering all edge cases. Happy to address any feedback. Thanks! |
|
@EkinKarabulut @Ronkahn21 Please can you help with reviewing KAI Scheduling support? |
|
thanks for pulling in the KAI team @andreyvelich 1. Different pod labeling scheme
2. Queue-based scheduling
3. Consistent plugin architecture
4. No code duplication
Happy to refactor if the team prefers a different architecture, but this mirrors how volcano was integrated. cc @EkinKarabulut @Ronkahn21 |
|
Since I’m not on the KAI team, I’ll defer to @omer-dayan and @romanbaron to review this. |
|
Thanks a lot for working on KAI Scheduler support @Raakshass! First, apologies that this has taken a while. I've been working on this integration from both ends - the KAI Scheduler side and the Trainer side. The initial integration got delayed because of some RBAC concerns raised in the team, which led us to wait for JobSet support to land in KAI's podgrouper first. Once that was merged, we were able to build TrainJob support on top of it cleanly. So it took longer than expected, but the foundation is solid now. Here is some context from the KAI side that's relevant here: KAI-side changes (merged yesterday):
What this means for the Trainer-side plugin
I have a working implementation with end-to-end validation on a real cluster (4x T4 GPUs): https://github.com/EkinKarabulut/trainer/tree/pr-kaischeduler I see two options and I'm happy with either:
Follow-up work (where collaboration would be really valuable):
Let me know what you'd prefer and happy to jump on a call if that's easier. Looking forward to getting this landed together either way! :) |
|
Thanks @EkinKarabulut for the detailed context — and congrats on getting the KAI-side TrainJob support merged (NVIDIA/KAI-Scheduler#1020)! The podgrouper approach is architecturally cleaner — having KAI own the PodGroup lifecycle via the ownership chain (Pod → Job → JobSet → TrainJob) eliminates the redundant CRUD + RBAC that my current implementation carries. I'll go with Option 1 — updating this PR to align with the actual KAI contract:
On the follow-up work — I'd like to actively collaborate on both fronts:
I'll push the updated implementation shortly. Would really appreciate your review once it's up to make sure it matches the KAI-side contract exactly. |
Pull Request Test Coverage Report for Build 22147901393Details
💛 - Coveralls |
- Replace old kai package with new kaischeduler package - Remove redundant PodGroup CRUD (KAI's podgrouper handles lifecycle) - Remove Build()/ReconcilerBuilders() methods - Rename KAIPodGroupPolicySource to KAISchedulerPodGroupPolicySource - Rename field KAI to KAIScheduler, QueueName to Queue - Remove ScheduleTimeoutSeconds (managed by KAI natively) - Use correct label key: kai.scheduler/queue - Update deepcopy, registry, and framework tests Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
6d30e23 to
83c8891
Compare
Description
This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer, addressing the need for advanced GPU scheduling in AI/ML workloads.
What is KAI Scheduler?
KAI Scheduler is NVIDIA's Kubernetes AI Scheduler that provides:
Changes Made
API Types (
pkg/apis/trainer/v1alpha1/trainingruntime_types.go)KAIfield toPodGroupPolicySourcestructKAIPodGroupPolicySourcestruct with:QueueName: Optional queue for multi-tenant schedulingScheduleTimeoutSeconds: Timeout before failing unschedulable PodGroupsPlugin Implementation (
pkg/runtime/framework/plugins/kai/kai.go)EnforcePodGroupPolicy: Sets KAI-specific pod labelsscheduling.kai.io/pod-group: Associates pods with their PodGroupscheduling.kai.io/queue: Assigns pods to KAI queueBuild: Creates scheduler-plugins PodGroup resources (compatible with KAI)Registry (
pkg/runtime/framework/plugins/registry.go)kai.Newin the plugin registryTests (
pkg/runtime/framework/plugins/kai/kai_test.go)Usage Example
Testing
go build ./...succeedscontroller-genmanifests generatedRelated Issues
Fixes #2628
Checklist
Base Branch
masterFiles Changed
manifests/base/crds/*.yaml- Updated CRDs