Skip to content

feat(runtimes): Add XGBoost runtime(KEP-2598)#3200

Open
Krishna-kg732 wants to merge 3 commits intokubeflow:masterfrom
Krishna-kg732:xgboost-runtime-implementation
Open

feat(runtimes): Add XGBoost runtime(KEP-2598)#3200
Krishna-kg732 wants to merge 3 commits intokubeflow:masterfrom
Krishna-kg732:xgboost-runtime-implementation

Conversation

@Krishna-kg732
Copy link
Contributor

@Krishna-kg732 Krishna-kg732 commented Feb 12, 2026

What this PR does

Implements the XGBoost runtime plugin for Kubeflow Trainer V2, as proposed in KEP-2598. This plugin enables distributed XGBoost training using Rabit/Collective coordination by automatically injecting DMLC environment variables into trainer containers.

Changes

New Files

  • pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin implementing EnforceMLPolicyPlugin and CustomValidationPlugin. Injects DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER env vars and auto-derives numWorkersPerNode from GPU resources (1 worker per GPU, or 1 per node for CPU).
  • pkg/runtime/framework/plugins/xgboost/xgboost_test.go — Unit tests covering EnforceMLPolicy (nil guards, single/multi-node CPU, GPU resources, numNodes override) and Validate (reserved DMLC_* env name rejection).

Modified Files

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct, XGBoost field to MLPolicySource, and updated CEL mutual exclusion validation rule.
  • pkg/constants/constants.go — Added XGBoost/Rabit constants and XGBoostReservedEnvNames set.
  • pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin.
  • pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard.
  • pkg/runtime/framework/core/framework_test.go — Updated TestNew to include XGBoost in expected plugin lists.
  • pkg/util/testing/wrapper.go — Added XGBoostPolicy() test helper.

How was this tested?

  • go test ./pkg/runtime/framework/plugins/xgboost/... ✅ (9 test cases)
  • go test ./pkg/runtime/framework/core/ -run TestNew
  • go test ./pkg/runtime/framework/plugins/... ✅ (all plugins pass)

TODO (follow-up PRs)

  • Add E2E tests
  • Add ClusterTrainingRuntime YAML manifests
  • Add example notebook

/kind feature
/area runtime

Copilot AI review requested due to automatic review settings February 12, 2026 04:24
@google-oss-prow
Copy link

@Krishna-kg732: The label(s) area/runtime cannot be applied, because the repository doesn't have them.

Details

In response to this:

What this PR does

Adds the XGBoost runtime plugin scaffold to the Trainer V2 framework. This is the foundational PR for KEP-2598: XGBoost Runtime — it introduces the plugin structure and API types without the full implementation, which will follow in a subsequent PR.

Changes

New Files

  • pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin scaffold implementing EnforceMLPolicyPlugin with a stub EnforceMLPolicy (Rabit env injection TODO)

Modified Files

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct and XGBoost field to MLPolicySource
  • pkg/constants/constants.go — Added XGBoost/Rabit constants (DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER) and reserved env set
  • pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin
  • pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard

What's NOT in this PR (intentionally)

  • EnforceMLPolicy implementation (Rabit env var injection) — will be in a follow-up PR
  • Unit tests and E2E tests — will accompany the implementation PR
  • ClusterTrainingRuntime YAML manifests

How was this tested?

  • go build ./pkg/runtime/framework/plugins/...
  • go vet ./pkg/runtime/framework/plugins/xgboost/...

/kind feature
/area runtime

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@Krishna-kg732 Krishna-kg732 changed the title feat(runtime): Add XGBoost runtime plugin scaffold (KEP-2598) feat(runtime): Add XGBoost runtime(KEP-2598) Feb 12, 2026
@Krishna-kg732 Krishna-kg732 changed the title feat(runtime): Add XGBoost runtime(KEP-2598) feat(runtimes): Add XGBoost runtime(KEP-2598) Feb 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial XGBoost runtime plugin scaffold to the Trainer V2 runtime framework (per KEP-2598), along with the API wiring and constants needed to support a future Rabit env var injection implementation.

Changes:

  • Introduces an xgboost runtime plugin scaffold implementing EnforceMLPolicyPlugin (stubbed behavior for now).
  • Extends the TrainingRuntime API (MLPolicySource) with an xgboost policy source and updates the “only one policy” validation rule.
  • Adds XGBoost/Rabit-related env var constants and registers the plugin in the runtime plugin registry (and updates PlainML fallback guard).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/runtime/framework/plugins/xgboost/xgboost.go New XGBoost plugin scaffold (EnforceMLPolicy stub + plugin name/factory).
pkg/runtime/framework/plugins/registry.go Registers the XGBoost plugin in the plugin factory registry.
pkg/runtime/framework/plugins/plainml/plainml.go Ensures PlainML no-ops when XGBoost (and JAX) ML policy sources are configured.
pkg/constants/constants.go Adds Rabit/XGBoost env var constants + reserved env name set.
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Adds XGBoostMLPolicySource + MLPolicySource.XGBoost, and updates ML policy exclusivity validation.

@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 729c8be to 49c768a Compare February 12, 2026 04:33
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Feb 14, 2026
@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 985eaf4 to e5c552e Compare February 14, 2026 05:10
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Feb 16, 2026
@coveralls
Copy link

coveralls commented Feb 16, 2026

Pull Request Test Coverage Report for Build 22090812203

Details

  • 77 of 84 (91.67%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.2%) to 57.148%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/registry.go 0 1 0.0%
pkg/runtime/framework/plugins/xgboost/xgboost.go 72 78 92.31%
Totals Coverage Status
Change from base Build 22081023611: 1.2%
Covered Lines: 1467
Relevant Lines: 2567

💛 - Coveralls

@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 7ec359f to 38e1f5a Compare February 16, 2026 13:29
@akshaychitneni
Copy link
Contributor

akshaychitneni commented Feb 17, 2026

/lgtm
Thanks @Krishna-kg732

// MLPolicy represents configuration for the model training with ML-specific parameters.
// +kubebuilder:validation:XValidation:rule="!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", message="numNodes should not be set if torch.elasticPolicy is configured"
// +kubebuilder:validation:XValidation:rule="!(has(self.torch) && has(self.mpi))", message="Only one of the policy can be configured"
// +kubebuilder:validation:XValidation:rule="[has(self.torch), has(self.mpi), has(self.jax), has(self.xgboost)].filter(x, x).size() <= 1", message="Only one of the policy can be configured"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there a bug earlier that did not consider jax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the old rule on master only checked torch vs mpi — JAX was indeed missing. In this PR, I replaced that rule with the new CEL expression that covers all four policies (torch, mpi, jax, xgboost) at once, so it fixes the existing gap as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krishna-kg732 Please can you create separate PR to fix the JAX validation bug?
cc @kaisoz

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 38e1f5a to dc135be Compare February 17, 2026 08:15
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 17, 2026
@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work @Krishna-kg732!
Overall looks great, I left a few comments.
cc @kubeflow/kubeflow-trainer-team

Comment on lines +266 to +267
Name: ptr.To(constants.XGBoostEnvNumWorker),
Value: ptr.To("2"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why num workers is 2, but it should be 8?
DMLC_NUM_WORKER = numNodes (2) × numGPUs (4)

// MLPolicy represents configuration for the model training with ML-specific parameters.
// +kubebuilder:validation:XValidation:rule="!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", message="numNodes should not be set if torch.elasticPolicy is configured"
// +kubebuilder:validation:XValidation:rule="!(has(self.torch) && has(self.mpi))", message="Only one of the policy can be configured"
// +kubebuilder:validation:XValidation:rule="[has(self.torch), has(self.mpi), has(self.jax), has(self.xgboost)].filter(x, x).size() <= 1", message="Only one of the policy can be configured"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krishna-kg732 Please can you create separate PR to fix the JAX validation bug?
cc @kaisoz


// xgboost defines the configuration for the XGBoost Runtime.
// +optional
XGBoost *XGBoostMLPolicySource `json:"xgboost,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add xgboost-distributed runtime in Helm Charts and Kustomize manifests, and install it by default alongside Torch, JAX, MLX, etc: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/kustomization.yaml

// XGBoostMLPolicySource represents an XGBoost runtime configuration.
// The number of workers per node is automatically derived from container GPU resources:
// - GPU training: 1 worker per GPU (from resourcesPerNode)
// - CPU training: 1 worker per node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify that XGBoost still a single worker still consumes all CPU cores.
Ref: #3118 (comment)
cc @trivialfis

Comment on lines +94 to +98
if res := runtime.ExtractResourcePerNodeFromRuntime(info); res != nil {
if gpuCount := runtime.GetNumGPUPerNode(res); gpuCount > 0 {
numWorkersPerNode = int32(gpuCount)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be adjusted since you need to check runtime resources first, set temporary resourcesPerNode, and it with value from TrainJob.
Check here:
https://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/pkg/runtime/framework/plugins/torch/torch.go#L114-L118

Please also create these unit tests:

  1. Resources are not set
  2. Resources are set in Runtime only
  3. Resources are set in TrainJob
  4. Resources are set in Runtime and TrainJob

}
}

func TestXGBoostValidate(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this Test to the top of the file.

utiltesting "github.com/kubeflow/trainer/v2/pkg/util/testing"
)

func TestXGBoostEnforceMLPolicy(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti @kaisoz Shall we change JAX and Torch unit tests to similar name too?
e.g. TestJAXEnforceMLPolicyhttps://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/pkg/runtime/framework/plugins/jax/jax_test.go#L40

@@ -0,0 +1,467 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,135 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to create dedicated Trainer Runtime Docker image for XGBoost: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2598-XGboost-runtime-trainer-v2#container-image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments