fix: Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX by Krishna-kg732 · Pull Request #3225 · kubeflow/trainer

Krishna-kg732 · 2026-02-19T04:37:52Z

What this PR solves

This PR fixes a validation issue where multiple ML runtime policies (Torch, MPI, JAX) could be configured simultaneously in a TrainingRuntime, leading to conflicting runtime configurations.

The previous validation logic : !(has(self.torch) && has(self.mpi)) which only prevented Torch and MPI from being set together, but didn't account for:

JAX runtime policy
Scenarios where all three policies could be partially configured
Future extensibility for additional runtime policies

This allowed invalid configurations where users could set multiple incompatible runtime policies.

Solution

Added comprehensive CEL validation: Updated the validation rule to [has(self.torch), has(self.mpi), has(self.jax)].filter(x, x).size() <= 1 which:
- Creates a list of boolean values for each policy field
- Filters for truthy values (policies that are set)
- Ensures at most one policy is configured
Updated PlainML plugin: Modified the EnforceMLPolicy function to check for JAX policy alongside Torch and MPI, ensuring PlainML only applies when no other runtime policy is active

Testing

Validation occurs at the CRD level via CEL expressions
Runtime enforcement in PlainML plugin ensures correct fallback behavior

mentioned in PR#3200

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

google-oss-prow · 2026-02-19T04:37:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-19T04:38:00Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR tightens TrainingRuntime ML policy validation to prevent configuring multiple incompatible runtime policies (Torch/MPI/JAX) at the same time, and aligns PlainML fallback behavior with that constraint.

Changes:

Updated CRD CEL validation to enforce “at most one of torch/mpi/jax is set”.
Updated PlainML’s EnforceMLPolicy to no-op when a JAX policy is configured (matching existing Torch/MPI behavior).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`pkg/runtime/framework/plugins/plainml/plainml.go`	Extends PlainML’s fallback guard to treat JAX as an explicitly selected runtime policy (so PlainML won’t apply).
`pkg/apis/trainer/v1alpha1/trainingruntime_types.go`	Replaces pairwise Torch/MPI exclusion with a single CEL rule that limits the number of configured ML policies to 1 across Torch/MPI/JAX.

pkg/runtime/framework/plugins/plainml/plainml.go

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

andreyvelich · 2026-02-19T14:31:19Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

 // MLPolicy represents configuration for the model training with ML-specific parameters.
 // +kubebuilder:validation:XValidation:rule="!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", message="numNodes should not be set if torch.elasticPolicy is configured"
-// +kubebuilder:validation:XValidation:rule="!(has(self.torch) && has(self.mpi))", message="Only one of the policy can be configured"
+// +kubebuilder:validation:XValidation:rule="[has(self.torch), has(self.mpi), has(self.jax)].filter(x, x).size() <= 1", message="Only one of the policy can be configured"


@astefanutti @tenzen-y Does this CEL policy look good to you?

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

fix: enforce single ML policy constraint with CEL validation

c24496e

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Copilot AI review requested due to automatic review settings February 19, 2026 04:37

google-oss-prow bot requested review from akshaychitneni and jinchihe February 19, 2026 04:37

google-oss-prow bot added the size/XS label Feb 19, 2026

Copilot started reviewing on behalf of Krishna-kg732 February 19, 2026 04:38 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

pkg/runtime/framework/plugins/plainml/plainml.go Show resolved Hide resolved

pkg/apis/trainer/v1alpha1/trainingruntime_types.go Show resolved Hide resolved

added plainML fallback test case

69037dc

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

google-oss-prow bot added size/S and removed size/XS labels Feb 19, 2026

Krishna-kg732 changed the title ~~fix(JAX): Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX~~ fix: Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX Feb 19, 2026

andreyvelich reviewed Feb 19, 2026

View reviewed changes

google-oss-prow bot added size/M and removed size/S labels Feb 19, 2026

added autogenerated files

c22b56c

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Krishna-kg732 force-pushed the fix/jax-validation branch from aa6bd0c to c22b56c Compare February 19, 2026 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX#3225

fix: Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX#3225
Krishna-kg732 wants to merge 3 commits intokubeflow:masterfrom
Krishna-kg732:fix/jax-validation

Krishna-kg732 commented Feb 19, 2026

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

andreyvelich Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Krishna-kg732 commented Feb 19, 2026

What this PR solves

Solution

Testing

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

andreyvelich Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments