feat(runtimes): Add XGBoost runtime(KEP-2598) by Krishna-kg732 · Pull Request #3200 · kubeflow/trainer

Krishna-kg732 · 2026-02-12T04:24:45Z

What this PR does

Implements the XGBoost runtime plugin for Kubeflow Trainer V2, as proposed in KEP-2598. This plugin enables distributed XGBoost training using Rabit/Collective coordination by automatically injecting DMLC environment variables into trainer containers.

Changes

New Files

pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin implementing EnforceMLPolicyPlugin and CustomValidationPlugin. Injects DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER env vars and auto-derives numWorkersPerNode from GPU resources (1 worker per GPU, or 1 per node for CPU).
pkg/runtime/framework/plugins/xgboost/xgboost_test.go — Unit tests covering EnforceMLPolicy (nil guards, single/multi-node CPU, GPU resources, numNodes override) and Validate (reserved DMLC_* env name rejection).

Modified Files

pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct, XGBoost field to MLPolicySource, and updated CEL mutual exclusion validation rule.
pkg/constants/constants.go — Added XGBoost/Rabit constants and XGBoostReservedEnvNames set.
pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin.
pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard.
pkg/runtime/framework/core/framework_test.go — Updated TestNew to include XGBoost in expected plugin lists.
pkg/util/testing/wrapper.go — Added XGBoostPolicy() test helper.

How was this tested?

go test ./pkg/runtime/framework/plugins/xgboost/... ✅ (9 test cases)
go test ./pkg/runtime/framework/core/ -run TestNew ✅
go test ./pkg/runtime/framework/plugins/... ✅ (all plugins pass)

TODO (follow-up PRs)

Add E2E tests
Add ClusterTrainingRuntime YAML manifests
Add example notebook

/kind feature
/area runtime

google-oss-prow · 2026-02-12T04:24:50Z

@Krishna-kg732: The label(s) area/runtime cannot be applied, because the repository doesn't have them.

Details

In response to this:

What this PR does

Adds the XGBoost runtime plugin scaffold to the Trainer V2 framework. This is the foundational PR for KEP-2598: XGBoost Runtime — it introduces the plugin structure and API types without the full implementation, which will follow in a subsequent PR.

Changes

New Files

pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin scaffold implementing EnforceMLPolicyPlugin with a stub EnforceMLPolicy (Rabit env injection TODO)

Modified Files

pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct and XGBoost field to MLPolicySource

pkg/constants/constants.go — Added XGBoost/Rabit constants (DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER) and reserved env set

pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin

pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard

What's NOT in this PR (intentionally)

EnforceMLPolicy implementation (Rabit env var injection) — will be in a follow-up PR

Unit tests and E2E tests — will accompany the implementation PR

ClusterTrainingRuntime YAML manifests

How was this tested?

go build ./pkg/runtime/framework/plugins/... ✅

go vet ./pkg/runtime/framework/plugins/xgboost/... ✅

/kind feature
/area runtime

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2026-02-12T04:24:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-12T04:24:56Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

Adds an initial XGBoost runtime plugin scaffold to the Trainer V2 runtime framework (per KEP-2598), along with the API wiring and constants needed to support a future Rabit env var injection implementation.

Changes:

Introduces an xgboost runtime plugin scaffold implementing EnforceMLPolicyPlugin (stubbed behavior for now).
Extends the TrainingRuntime API (MLPolicySource) with an xgboost policy source and updates the “only one policy” validation rule.
Adds XGBoost/Rabit-related env var constants and registers the plugin in the runtime plugin registry (and updates PlainML fallback guard).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/runtime/framework/plugins/xgboost/xgboost.go	New XGBoost plugin scaffold (EnforceMLPolicy stub + plugin name/factory).
pkg/runtime/framework/plugins/registry.go	Registers the XGBoost plugin in the plugin factory registry.
pkg/runtime/framework/plugins/plainml/plainml.go	Ensures PlainML no-ops when XGBoost (and JAX) ML policy sources are configured.
pkg/constants/constants.go	Adds Rabit/XGBoost env var constants + reserved env name set.
pkg/apis/trainer/v1alpha1/trainingruntime_types.go	Adds `XGBoostMLPolicySource` + `MLPolicySource.XGBoost`, and updates ML policy exclusivity validation.

pkg/runtime/framework/plugins/xgboost/xgboost.go

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

coveralls · 2026-02-16T13:12:15Z

Pull Request Test Coverage Report for Build 22090812203

Details

77 of 84 (91.67%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.2%) to 57.148%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/registry.go	0	1	0.0%
pkg/runtime/framework/plugins/xgboost/xgboost.go	72	78	92.31%

Totals
Change from base Build 22081023611:	1.2%
Covered Lines:	1467
Relevant Lines:	2567

💛 - Coveralls

akshaychitneni · 2026-02-17T00:35:35Z

/lgtm
Thanks @Krishna-kg732

akshaychitneni · 2026-02-17T00:37:14Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

 // MLPolicy represents configuration for the model training with ML-specific parameters.
 // +kubebuilder:validation:XValidation:rule="!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", message="numNodes should not be set if torch.elasticPolicy is configured"
-// +kubebuilder:validation:XValidation:rule="!(has(self.torch) && has(self.mpi))", message="Only one of the policy can be configured"
+// +kubebuilder:validation:XValidation:rule="[has(self.torch), has(self.mpi), has(self.jax), has(self.xgboost)].filter(x, x).size() <= 1", message="Only one of the policy can be configured"


was there a bug earlier that did not consider jax?

yes, the old rule on master only checked torch vs mpi — JAX was indeed missing. In this PR, I replaced that rule with the new CEL expression that covers all four policies (torch, mpi, jax, xgboost) at once, so it fixes the existing gap as well.

@Krishna-kg732 Please can you create separate PR to fix the JAX validation bug?
cc @kaisoz

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

google-oss-prow · 2026-02-17T08:15:05Z

New changes are detected. LGTM label has been removed.

andreyvelich

Thank you for this work @Krishna-kg732!
Overall looks great, I left a few comments.
cc @kubeflow/kubeflow-trainer-team

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

andreyvelich · 2026-02-18T02:02:23Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

+									Name:  ptr.To(constants.XGBoostEnvNumWorker),
+									Value: ptr.To("2"),


Why num workers is 2, but it should be 8?
DMLC_NUM_WORKER = numNodes (2) × numGPUs (4)

andreyvelich · 2026-02-18T02:03:04Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

 // MLPolicy represents configuration for the model training with ML-specific parameters.
 // +kubebuilder:validation:XValidation:rule="!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", message="numNodes should not be set if torch.elasticPolicy is configured"
-// +kubebuilder:validation:XValidation:rule="!(has(self.torch) && has(self.mpi))", message="Only one of the policy can be configured"
+// +kubebuilder:validation:XValidation:rule="[has(self.torch), has(self.mpi), has(self.jax), has(self.xgboost)].filter(x, x).size() <= 1", message="Only one of the policy can be configured"


@Krishna-kg732 Please can you create separate PR to fix the JAX validation bug?
cc @kaisoz

andreyvelich · 2026-02-18T02:04:00Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

+
+	// xgboost defines the configuration for the XGBoost Runtime.
+	// +optional
+	XGBoost *XGBoostMLPolicySource `json:"xgboost,omitempty"`


Please also add xgboost-distributed runtime in Helm Charts and Kustomize manifests, and install it by default alongside Torch, JAX, MLX, etc: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/kustomization.yaml

andreyvelich · 2026-02-18T02:06:58Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

+// XGBoostMLPolicySource represents an XGBoost runtime configuration.
+// The number of workers per node is automatically derived from container GPU resources:
+//   - GPU training: 1 worker per GPU (from resourcesPerNode)
+//   - CPU training: 1 worker per node


Can you clarify that XGBoost still a single worker still consumes all CPU cores.
Ref: #3118 (comment)
cc @trivialfis

andreyvelich · 2026-02-18T02:17:04Z

pkg/runtime/framework/plugins/xgboost/xgboost.go

+			if res := runtime.ExtractResourcePerNodeFromRuntime(info); res != nil {
+				if gpuCount := runtime.GetNumGPUPerNode(res); gpuCount > 0 {
+					numWorkersPerNode = int32(gpuCount)
+				}
+			}


This needs to be adjusted since you need to check runtime resources first, set temporary resourcesPerNode, and it with value from TrainJob.
Check here:
https://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/pkg/runtime/framework/plugins/torch/torch.go#L114-L118

Please also create these unit tests:

Resources are not set

Resources are set in Runtime only

Resources are set in TrainJob

Resources are set in Runtime and TrainJob

andreyvelich · 2026-02-18T02:17:51Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

+	}
+}
+
+func TestXGBoostValidate(t *testing.T) {


Please move this Test to the top of the file.

andreyvelich · 2026-02-18T02:21:21Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

+	utiltesting "github.com/kubeflow/trainer/v2/pkg/util/testing"
+)
+
+func TestXGBoostEnforceMLPolicy(t *testing.T) {


@tenzen-y @astefanutti @kaisoz Shall we change JAX and Torch unit tests to similar name too?
e.g. TestJAXEnforceMLPolicyhttps://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/pkg/runtime/framework/plugins/jax/jax_test.go#L40

andreyvelich · 2026-02-18T02:23:08Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

@@ -0,0 +1,467 @@
+/*


Please also add:

Integration test. Check: https://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/test/integration/controller/trainjob_controller_test.go#L1408

E2E tests. Check: https://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/test/e2e/e2e_test.go#L184

Example Notebook with XGBoost training.

andreyvelich · 2026-02-18T02:24:19Z

pkg/runtime/framework/plugins/xgboost/xgboost.go

@@ -0,0 +1,135 @@
+/*


You also need to create dedicated Trainer Runtime Docker image for XGBoost: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2598-XGboost-runtime-trainer-v2#container-image

Copilot AI review requested due to automatic review settings February 12, 2026 04:24

google-oss-prow bot added the kind/feature label Feb 12, 2026

google-oss-prow bot requested review from akshaychitneni and kuizhiqing February 12, 2026 04:24

google-oss-prow bot added the size/M label Feb 12, 2026

Copilot started reviewing on behalf of Krishna-kg732 February 12, 2026 04:25 View session

Krishna-kg732 changed the title ~~feat(runtime): Add XGBoost runtime plugin scaffold (KEP-2598)~~ feat(runtime): Add XGBoost runtime(KEP-2598) Feb 12, 2026

Krishna-kg732 changed the title ~~feat(runtime): Add XGBoost runtime(KEP-2598)~~ feat(runtimes): Add XGBoost runtime(KEP-2598) Feb 12, 2026

Copilot AI reviewed Feb 12, 2026

View reviewed changes

pkg/runtime/framework/plugins/xgboost/xgboost.go Outdated Show resolved Hide resolved

pkg/apis/trainer/v1alpha1/trainingruntime_types.go Show resolved Hide resolved

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 729c8be to 49c768a Compare February 12, 2026 04:33

google-oss-prow bot added size/L and removed size/M labels Feb 14, 2026

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 985eaf4 to e5c552e Compare February 14, 2026 05:10

google-oss-prow bot added size/XL and removed size/L labels Feb 16, 2026

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 7ec359f to 38e1f5a Compare February 16, 2026 13:29

google-oss-prow bot assigned akshaychitneni Feb 17, 2026

google-oss-prow bot added the lgtm label Feb 17, 2026

akshaychitneni reviewed Feb 17, 2026

View reviewed changes

Krishna-kg732 added 3 commits February 17, 2026 13:44

Add XGBoost runtime plugin scaffold and register in framework

0943d2c

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

regenerated assets

425dce6

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Add XGBoost plugin unit tests and update framework test registry

dc135be

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 38e1f5a to dc135be Compare February 17, 2026 08:15

google-oss-prow bot removed the lgtm label Feb 17, 2026

andreyvelich reviewed Feb 18, 2026

View reviewed changes

Krishna-kg732 mentioned this pull request Feb 19, 2026

fix: Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX #3225

Open

		Name: ptr.To(constants.XGBoostEnvNumWorker),
		Value: ptr.To("2"),

Conversation

Krishna-kg732 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Changes

New Files

Modified Files

How was this tested?

TODO (follow-up PRs)

Uh oh!

google-oss-prow bot commented Feb 12, 2026

What this PR does

Changes

New Files

Modified Files

What's NOT in this PR (intentionally)

How was this tested?

Uh oh!

google-oss-prow bot commented Feb 12, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coveralls commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22090812203

Details

💛 - Coveralls

Uh oh!

akshaychitneni commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Krishna-kg732 commented Feb 12, 2026 •

edited

Loading

coveralls commented Feb 16, 2026 •

edited

Loading

akshaychitneni commented Feb 17, 2026 •

edited

Loading