Skip to content

feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186

Open
Raakshass wants to merge 2 commits intokubeflow:masterfrom
Raakshass:kai-scheduler-2628
Open

feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186
Raakshass wants to merge 2 commits intokubeflow:masterfrom
Raakshass:kai-scheduler-2628

Conversation

@Raakshass
Copy link

Description

This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer, addressing the need for advanced GPU scheduling in AI/ML workloads.

What is KAI Scheduler?

KAI Scheduler is NVIDIA's Kubernetes AI Scheduler that provides:

  • Gang scheduling for distributed training jobs
  • GPU-aware bin-packing for optimal resource utilization
  • Topology-aware placement (NVLink, NVSwitch)
  • Queue-based multi-tenant scheduling
  • Native integration with NVIDIA GPU Operator

Changes Made

API Types (pkg/apis/trainer/v1alpha1/trainingruntime_types.go)

  • Added KAI field to PodGroupPolicySource struct
  • Defined KAIPodGroupPolicySource struct with:
    • QueueName: Optional queue for multi-tenant scheduling
    • ScheduleTimeoutSeconds: Timeout before failing unschedulable PodGroups

Plugin Implementation (pkg/runtime/framework/plugins/kai/kai.go)

  • EnforcePodGroupPolicy: Sets KAI-specific pod labels
    • scheduling.kai.io/pod-group: Associates pods with their PodGroup
    • scheduling.kai.io/queue: Assigns pods to KAI queue
  • Build: Creates scheduler-plugins PodGroup resources (compatible with KAI)
    • Aggregates MinMember from all PodSets
    • Calculates total MinResources from pod requests
    • Sets ScheduleTimeoutSeconds (default: 60s)
    • Configures proper owner references

Registry (pkg/runtime/framework/plugins/registry.go)

  • Registered kai.New in the plugin registry

Tests (pkg/runtime/framework/plugins/kai/kai_test.go)

  • 8 comprehensive test cases covering:
    • Nil info/trainJob handling
    • PodGroup creation with proper MinMember/MinResources
    • Queue name label assignment
    • API error handling
    • Existing PodGroup skip logic
    • Default timeout values

Usage Example

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainingRuntime
metadata:
  name: kai-distributed-training
spec:
  mlPolicy:
    torch:
      numProcPerNode: "4"
  podGroupPolicy:
    kai:
      queueName: "high-priority"
      scheduleTimeoutSeconds: 120
  template:
    spec:
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: 4

Testing

  • All 8 unit tests pass
  • go build ./... succeeds
  • controller-gen manifests generated

Related Issues

Fixes #2628

Checklist

  • Code follows existing patterns (similar to coscheduling plugin)
  • Unit tests added with comprehensive coverage
  • API types include proper JSON tags and documentation
  • Generated deepcopy functions updated
  • CRD manifests regenerated

Base Branch

master

Files Changed

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go - Added KAI API types
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go - Generated deepcopy
  • pkg/runtime/framework/plugins/kai/kai.go - KAI plugin implementation
  • pkg/runtime/framework/plugins/kai/kai_test.go - Comprehensive tests
  • pkg/runtime/framework/plugins/registry.go - Plugin registration
  • manifests/base/crds/*.yaml - Updated CRDs

This PR implements support for NVIDIA's KAI Scheduler as a new
gang-scheduling backend in Kubeflow Trainer.

Changes:
- Added KAI field to PodGroupPolicySource struct
- Defined KAIPodGroupPolicySource with QueueName and ScheduleTimeoutSeconds
- Implemented kai plugin with EnforcePodGroupPolicy and Build methods
- Added 8 comprehensive unit tests
- Registered kai.New in the plugin registry

Fixes kubeflow#2628

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@jaiakash
Copy link
Member

jaiakash commented Feb 9, 2026

/ok-to-test

@Raakshass
Copy link
Author

Hi @andreyvelich @jaiakash — gentle ping on this PR. It's been ready for review since Feb 8. The KAI Scheduler plugin adds gang-scheduling support for NVIDIA's GPU scheduler with 8 test cases covering all edge cases. Happy to address any feedback. Thanks!

@andreyvelich
Copy link
Member

@EkinKarabulut @Ronkahn21 Please can you help with reviewing KAI Scheduling support?
Do we need to introduce such plugin in TrainJob controller?

@Raakshass
Copy link
Author

thanks for pulling in the KAI team @andreyvelich
to answer the design question — yes, KAI needs a dedicated plugin rather than reusing the existing coscheduling plugin for these reasons:

1. Different pod labeling scheme

  • coscheduling uses scheduling.x-k8s.io/pod-group annotation
  • KAI requires scheduling.kai.io/pod-group + scheduling.kai.io/queue labels
  • These are different label keys, namespaces, and semantics — coscheduling can't produce KAI-compatible labels

2. Queue-based scheduling

  • KAI's core differentiator is hierarchical queue management (queueName field)
  • coscheduling has no concept of queues — it only does gang scheduling
  • The scheduling.kai.io/queue label is mandatory for KAI to place workloads in the correct scheduling queue

3. Consistent plugin architecture

  • This follows the exact same pattern as the existing volcano plugin, which is also a separate plugin despite volcano also using PodGroups
  • volcano has its own Build() + EnforcePodGroupPolicy() because it needs volcano-specific labels/annotations
  • KAI follows the same separation for the same reason

4. No code duplication

  • The plugin reuses the scheduler-plugins PodGroup CRD (same as coscheduling) for the Build() output
  • Only EnforcePodGroupPolicy() differs — setting KAI-specific labels instead of coscheduling annotations

Happy to refactor if the team prefers a different architecture, but this mirrors how volcano was integrated. cc @EkinKarabulut @Ronkahn21

@Ronkahn21
Copy link

Since I’m not on the KAI team, I’ll defer to @omer-dayan and @romanbaron to review this.

@EkinKarabulut
Copy link

Thanks a lot for working on KAI Scheduler support @Raakshass! First, apologies that this has taken a while. I've been working on this integration from both ends - the KAI Scheduler side and the Trainer side. The initial integration got delayed because of some RBAC concerns raised in the team, which led us to wait for JobSet support to land in KAI's podgrouper first. Once that was merged, we were able to build TrainJob support on top of it cleanly. So it took longer than expected, but the foundation is solid now. Here is some context from the KAI side that's relevant here:

KAI-side changes (merged yesterday):

What this means for the Trainer-side plugin

  • Because KAI's podgrouper handles PodGroup creation natively, the Trainer plugin only needs to do one thing: set the kai.scheduler/queue label (not scheduling.kai.io/queue) on pods so KAI knows which scheduling queue to use.

I have a working implementation with end-to-end validation on a real cluster (4x T4 GPUs): https://github.com/EkinKarabulut/trainer/tree/pr-kaischeduler

I see two options and I'm happy with either:

  1. Update this PR: Remove the Build(), ReconcilerBuilders(), and RBAC changes, switch the label key to kai.scheduler/queue, and align the API field naming. I can help with the review to make sure it matches the KAI-side contract.
  2. I open a PR from my branch as the base, and you build on top of it with the follow-up work below.

Follow-up work (where collaboration would be really valuable):

  1. Documentation, testing and examples: User-facing docs and tested example ClusterTrainingRuntime YAMLs showing KAI Scheduler usage with TrainJob (on both repositories)
  2. Data cache + KAI topology-aware scheduling: Trainer's distributed data cache currently creates its LeaderWorkerSet without topology awareness. With KAI's topology-aware scheduling and existing LWS support, there's an opportunity to co-locate cache workers with their training pods for optimal data transfer (especially relevant for GB200/Infiniband setups). If this needs any addition to the integrations need to be checked/examined and if so, we need to implement it together with all e2e testings. This is a meaningful area to explore and would be a great addition on top of the base integration.
  3. Anything else you might point that is missing on the branch that I have (any tests etc.)

Let me know what you'd prefer and happy to jump on a call if that's easier. Looking forward to getting this landed together either way! :)

@Raakshass
Copy link
Author

Thanks @EkinKarabulut for the detailed context — and congrats on getting the KAI-side TrainJob support merged (NVIDIA/KAI-Scheduler#1020)! The podgrouper approach is architecturally cleaner — having KAI own the PodGroup lifecycle via the ownership chain (Pod → Job → JobSet → TrainJob) eliminates the redundant CRUD + RBAC that my current implementation carries.

I'll go with Option 1 — updating this PR to align with the actual KAI contract:

  1. Remove Build(), ReconcilerBuilders(), and all PodGroup creation/RBAC logic
  2. Simplify EnforcePodGroupPolicy() to only set the kai.scheduler/queue label on pods
  3. Fix label key: scheduling.kai.io/queuekai.scheduler/queue
  4. Update API types — align field naming with KAI conventions
  5. Rewrite tests to match the simplified plugin scope
  6. Regenerate CRDs

On the follow-up work — I'd like to actively collaborate on both fronts:

  • Documentation + examples: I can write user-facing docs and tested ClusterTrainingRuntime YAMLs showing KAI Scheduler usage patterns with TrainJob (for both repos)
  • Topology-aware data cache + KAI scheduling: The LWS co-location opportunity with KAI's topology-aware placement for data cache workers sounds like a substantial integration — happy to explore this together, especially the GB200/Infiniband path

I'll push the updated implementation shortly. Would really appreciate your review once it's up to make sure it matches the KAI-side contract exactly.

cc @andreyvelich @Ronkahn21

Copilot AI review requested due to automatic review settings February 18, 2026 13:34
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Feb 18, 2026
@coveralls
Copy link

coveralls commented Feb 18, 2026

Pull Request Test Coverage Report for Build 22147901393

Details

  • 11 of 18 (61.11%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 56.035%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/registry.go 0 1 0.0%
pkg/runtime/framework/plugins/kaischeduler/kaischeduler.go 11 17 64.71%
Totals Coverage Status
Change from base Build 22133652847: 0.04%
Covered Lines: 1402
Relevant Lines: 2502

💛 - Coveralls

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Replace old kai package with new kaischeduler package
- Remove redundant PodGroup CRUD (KAI's podgrouper handles lifecycle)
- Remove Build()/ReconcilerBuilders() methods
- Rename KAIPodGroupPolicySource to KAISchedulerPodGroupPolicySource
- Rename field KAI to KAIScheduler, QueueName to Queue
- Remove ScheduleTimeoutSeconds (managed by KAI natively)
- Use correct label key: kai.scheduler/queue
- Update deepcopy, registry, and framework tests

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support KAI Scheduler in Kubeflow Trainer

6 participants

Comments