feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support by Raakshass · Pull Request #3186 · kubeflow/trainer

Raakshass · 2026-02-08T20:07:01Z

Description

This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer, addressing the need for advanced GPU scheduling in AI/ML workloads.

What is KAI Scheduler?

KAI Scheduler is NVIDIA's Kubernetes AI Scheduler that provides:

Gang scheduling for distributed training jobs
GPU-aware bin-packing for optimal resource utilization
Topology-aware placement (NVLink, NVSwitch)
Queue-based multi-tenant scheduling
Native integration with NVIDIA GPU Operator

Changes Made

API Types (`pkg/apis/trainer/v1alpha1/trainingruntime_types.go`)

Added KAI field to PodGroupPolicySource struct
Defined KAIPodGroupPolicySource struct with:
- QueueName: Optional queue for multi-tenant scheduling
- ScheduleTimeoutSeconds: Timeout before failing unschedulable PodGroups

Plugin Implementation (`pkg/runtime/framework/plugins/kai/kai.go`)

EnforcePodGroupPolicy: Sets KAI-specific pod labels
- scheduling.kai.io/pod-group: Associates pods with their PodGroup
- scheduling.kai.io/queue: Assigns pods to KAI queue
Build: Creates scheduler-plugins PodGroup resources (compatible with KAI)
- Aggregates MinMember from all PodSets
- Calculates total MinResources from pod requests
- Sets ScheduleTimeoutSeconds (default: 60s)
- Configures proper owner references

Registry (`pkg/runtime/framework/plugins/registry.go`)

Registered kai.New in the plugin registry

Tests (`pkg/runtime/framework/plugins/kai/kai_test.go`)

8 comprehensive test cases covering:
- Nil info/trainJob handling
- PodGroup creation with proper MinMember/MinResources
- Queue name label assignment
- API error handling
- Existing PodGroup skip logic
- Default timeout values

Usage Example

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainingRuntime
metadata:
  name: kai-distributed-training
spec:
  mlPolicy:
    torch:
      numProcPerNode: "4"
  podGroupPolicy:
    kai:
      queueName: "high-priority"
      scheduleTimeoutSeconds: 120
  template:
    spec:
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: 4

Testing

All 8 unit tests pass
go build ./... succeeds
controller-gen manifests generated

Related Issues

Fixes #2628

Checklist

Code follows existing patterns (similar to coscheduling plugin)
Unit tests added with comprehensive coverage
API types include proper JSON tags and documentation
Generated deepcopy functions updated
CRD manifests regenerated

Base Branch

master

Files Changed

pkg/apis/trainer/v1alpha1/trainingruntime_types.go - Added KAI API types
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go - Generated deepcopy
pkg/runtime/framework/plugins/kai/kai.go - KAI plugin implementation
pkg/runtime/framework/plugins/kai/kai_test.go - Comprehensive tests
pkg/runtime/framework/plugins/registry.go - Plugin registration
manifests/base/crds/*.yaml - Updated CRDs

This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer. Changes: - Added KAI field to PodGroupPolicySource struct - Defined KAIPodGroupPolicySource with QueueName and ScheduleTimeoutSeconds - Implemented kai plugin with EnforcePodGroupPolicy and Build methods - Added 8 comprehensive unit tests - Registered kai.New in the plugin registry Fixes kubeflow#2628 Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

google-oss-prow · 2026-02-08T20:07:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-08T20:07:10Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

jaiakash · 2026-02-09T19:05:26Z

/ok-to-test

Raakshass · 2026-02-16T18:47:10Z

Hi @andreyvelich @jaiakash — gentle ping on this PR. It's been ready for review since Feb 8. The KAI Scheduler plugin adds gang-scheduling support for NVIDIA's GPU scheduler with 8 test cases covering all edge cases. Happy to address any feedback. Thanks!

andreyvelich · 2026-02-16T18:53:38Z

@EkinKarabulut @Ronkahn21 Please can you help with reviewing KAI Scheduling support?
Do we need to introduce such plugin in TrainJob controller?

Raakshass · 2026-02-17T12:57:53Z

thanks for pulling in the KAI team @andreyvelich
to answer the design question — yes, KAI needs a dedicated plugin rather than reusing the existing coscheduling plugin for these reasons:

1. Different pod labeling scheme

coscheduling uses scheduling.x-k8s.io/pod-group annotation
KAI requires scheduling.kai.io/pod-group + scheduling.kai.io/queue labels
These are different label keys, namespaces, and semantics — coscheduling can't produce KAI-compatible labels

2. Queue-based scheduling

KAI's core differentiator is hierarchical queue management (queueName field)
coscheduling has no concept of queues — it only does gang scheduling
The scheduling.kai.io/queue label is mandatory for KAI to place workloads in the correct scheduling queue

3. Consistent plugin architecture

This follows the exact same pattern as the existing volcano plugin, which is also a separate plugin despite volcano also using PodGroups
volcano has its own Build() + EnforcePodGroupPolicy() because it needs volcano-specific labels/annotations
KAI follows the same separation for the same reason

4. No code duplication

The plugin reuses the scheduler-plugins PodGroup CRD (same as coscheduling) for the Build() output
Only EnforcePodGroupPolicy() differs — setting KAI-specific labels instead of coscheduling annotations

Happy to refactor if the team prefers a different architecture, but this mirrors how volcano was integrated. cc @EkinKarabulut @Ronkahn21

Ronkahn21 · 2026-02-17T22:11:34Z

Since I’m not on the KAI team, I’ll defer to @omer-dayan and @romanbaron to review this.

EkinKarabulut · 2026-02-18T10:31:11Z

Thanks a lot for working on KAI Scheduler support @Raakshass! First, apologies that this has taken a while. I've been working on this integration from both ends - the KAI Scheduler side and the Trainer side. The initial integration got delayed because of some RBAC concerns raised in the team, which led us to wait for JobSet support to land in KAI's podgrouper first. Once that was merged, we were able to build TrainJob support on top of it cleanly. So it took longer than expected, but the foundation is solid now. Here is some context from the KAI side that's relevant here:

KAI-side changes (merged yesterday):

KAI Scheduler has its own component called the podgrouper that automatically creates PodGroups by walking the pod ownership chain (Pod → Job → JobSet → TrainJob). This means the Trainer doesn't need to create PodGroups, KAI handles that entirely. The KAI-side changes were merged yesterday: feat(podgrouper): add Kubeflow Trainer v2 TrainJob support NVIDIA/KAI-Scheduler#1020
On the KAI side, the integration registers the TrainJob GVK with KAI's skipTopOwnerGrouper in the podgrouper hub (feat(podgrouper): add Kubeflow Trainer v2 TrainJob support NVIDIA/KAI-Scheduler#1020). This tells KAI: "when you see TrainJob as the top owner, skip it and use the JobSet grouper." The JobSet grouper then calculates minAvailable from the JobSet spec.

What this means for the Trainer-side plugin

Because KAI's podgrouper handles PodGroup creation natively, the Trainer plugin only needs to do one thing: set the kai.scheduler/queue label (not scheduling.kai.io/queue) on pods so KAI knows which scheduling queue to use.

I have a working implementation with end-to-end validation on a real cluster (4x T4 GPUs): https://github.com/EkinKarabulut/trainer/tree/pr-kaischeduler

I see two options and I'm happy with either:

Update this PR: Remove the Build(), ReconcilerBuilders(), and RBAC changes, switch the label key to kai.scheduler/queue, and align the API field naming. I can help with the review to make sure it matches the KAI-side contract.
I open a PR from my branch as the base, and you build on top of it with the follow-up work below.

Follow-up work (where collaboration would be really valuable):

Documentation, testing and examples: User-facing docs and tested example ClusterTrainingRuntime YAMLs showing KAI Scheduler usage with TrainJob (on both repositories)
Data cache + KAI topology-aware scheduling: Trainer's distributed data cache currently creates its LeaderWorkerSet without topology awareness. With KAI's topology-aware scheduling and existing LWS support, there's an opportunity to co-locate cache workers with their training pods for optimal data transfer (especially relevant for GB200/Infiniband setups). If this needs any addition to the integrations need to be checked/examined and if so, we need to implement it together with all e2e testings. This is a meaningful area to explore and would be a great addition on top of the base integration.
Anything else you might point that is missing on the branch that I have (any tests etc.)

Let me know what you'd prefer and happy to jump on a call if that's easier. Looking forward to getting this landed together either way! :)

Raakshass · 2026-02-18T13:06:49Z

Thanks @EkinKarabulut for the detailed context — and congrats on getting the KAI-side TrainJob support merged (NVIDIA/KAI-Scheduler#1020)! The podgrouper approach is architecturally cleaner — having KAI own the PodGroup lifecycle via the ownership chain (Pod → Job → JobSet → TrainJob) eliminates the redundant CRUD + RBAC that my current implementation carries.

I'll go with Option 1 — updating this PR to align with the actual KAI contract:

Remove Build(), ReconcilerBuilders(), and all PodGroup creation/RBAC logic
Simplify EnforcePodGroupPolicy() to only set the kai.scheduler/queue label on pods
Fix label key: scheduling.kai.io/queue → kai.scheduler/queue
Update API types — align field naming with KAI conventions
Rewrite tests to match the simplified plugin scope
Regenerate CRDs

On the follow-up work — I'd like to actively collaborate on both fronts:

Documentation + examples: I can write user-facing docs and tested ClusterTrainingRuntime YAMLs showing KAI Scheduler usage patterns with TrainJob (for both repos)
Topology-aware data cache + KAI scheduling: The LWS co-location opportunity with KAI's topology-aware placement for data cache workers sounds like a substantial integration — happy to explore this together, especially the GB200/Infiniband path

I'll push the updated implementation shortly. Would really appreciate your review once it's up to make sure it matches the KAI-side contract exactly.

cc @andreyvelich @Ronkahn21

coveralls · 2026-02-18T13:39:15Z

Pull Request Test Coverage Report for Build 22147901393

Details

11 of 18 (61.11%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.04%) to 56.035%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/registry.go	0	1	0.0%
pkg/runtime/framework/plugins/kaischeduler/kaischeduler.go	11	17	64.71%

Totals
Change from base Build 22133652847:	0.04%
Covered Lines:	1402
Relevant Lines:	2502

💛 - Coveralls

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Replace old kai package with new kaischeduler package - Remove redundant PodGroup CRUD (KAI's podgrouper handles lifecycle) - Remove Build()/ReconcilerBuilders() methods - Rename KAIPodGroupPolicySource to KAISchedulerPodGroupPolicySource - Rename field KAI to KAIScheduler, QueueName to Queue - Remove ScheduleTimeoutSeconds (managed by KAI natively) - Use correct label key: kai.scheduler/queue - Update deepcopy, registry, and framework tests Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

google-oss-prow bot requested review from jinchihe and kuizhiqing February 8, 2026 20:07

google-oss-prow bot added the size/XL label Feb 8, 2026

Raakshass mentioned this pull request Feb 8, 2026

feat(plugins): Add KAI Scheduler plugin for gang-scheduling support #3185

Closed

8 tasks

google-oss-prow bot added the ok-to-test label Feb 9, 2026

Copilot AI review requested due to automatic review settings February 18, 2026 13:34

google-oss-prow bot added size/L and removed size/XL labels Feb 18, 2026

Copilot started reviewing on behalf of Raakshass February 18, 2026 13:34 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Raakshass force-pushed the kai-scheduler-2628 branch from 6d30e23 to 83c8891 Compare February 18, 2026 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186

feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186
Raakshass wants to merge 2 commits intokubeflow:masterfrom
Raakshass:kai-scheduler-2628

Raakshass commented Feb 8, 2026

Uh oh!

google-oss-prow bot commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

jaiakash commented Feb 9, 2026

Uh oh!

Raakshass commented Feb 16, 2026

Uh oh!

andreyvelich commented Feb 16, 2026

Uh oh!

Raakshass commented Feb 17, 2026

Uh oh!

Ronkahn21 commented Feb 17, 2026

Uh oh!

EkinKarabulut commented Feb 18, 2026

Uh oh!

Raakshass commented Feb 18, 2026

Uh oh!

coveralls commented Feb 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

Conversation

Raakshass commented Feb 8, 2026

Description

What is KAI Scheduler?

Changes Made

API Types (pkg/apis/trainer/v1alpha1/trainingruntime_types.go)

Plugin Implementation (pkg/runtime/framework/plugins/kai/kai.go)

Registry (pkg/runtime/framework/plugins/registry.go)

Tests (pkg/runtime/framework/plugins/kai/kai_test.go)

Usage Example

Testing

Related Issues

Checklist

Base Branch

Files Changed

Uh oh!

google-oss-prow bot commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

jaiakash commented Feb 9, 2026

Uh oh!

Raakshass commented Feb 16, 2026

Uh oh!

andreyvelich commented Feb 16, 2026

Uh oh!

Raakshass commented Feb 17, 2026

Uh oh!

Ronkahn21 commented Feb 17, 2026

Uh oh!

EkinKarabulut commented Feb 18, 2026

Uh oh!

Raakshass commented Feb 18, 2026

Uh oh!

coveralls commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22147901393

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

API Types (`pkg/apis/trainer/v1alpha1/trainingruntime_types.go`)

Plugin Implementation (`pkg/runtime/framework/plugins/kai/kai.go`)

Registry (`pkg/runtime/framework/plugins/registry.go`)

Tests (`pkg/runtime/framework/plugins/kai/kai_test.go`)

coveralls commented Feb 18, 2026 •

edited

Loading