Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1ebe37e
refactor: migrate ClusterTopology validation to CRD and enable config…
Ronkahn21 Dec 22, 2025
cde8816
feat: implement ClusterTopology management with KAI integration
Ronkahn21 Dec 24, 2025
a225df1
feat: add Kubernetes client creation for ClusterTopology management
Ronkahn21 Dec 24, 2025
87093fa
fix type
Ronkahn21 Dec 24, 2025
62ba62a
chore: downgrade Go version to 1.24.0 in go.mod
Ronkahn21 Dec 24, 2025
0fffc60
feat: update ClusterTopology resource documentation and validation pa…
Ronkahn21 Dec 24, 2025
ff31dc3
feat: enhance ClusterTopology configuration with levels and update ro…
Ronkahn21 Dec 24, 2025
1d2d50d
feat: improve validation error messages for ClusterTopology levels
Ronkahn21 Dec 24, 2025
d5aefdb
feat: update API group for ClusterRole in clusterrole.yaml
Ronkahn21 Dec 24, 2025
633e9a1
feat: refactor ClusterTopology management and add unit tests
Ronkahn21 Dec 24, 2025
17d8d8f
feat: enhance EnsureTopology tests with comprehensive scenarios and v…
Ronkahn21 Dec 24, 2025
9c3353b
feat: reorganize imports in topology.go and topology_test.go for cons…
Ronkahn21 Dec 24, 2025
2813ddf
feat: implement EnsureDeleteClusterTopology function and enhance main…
Ronkahn21 Dec 25, 2025
75d6858
feat: update Go version to 1.24.5 in go.mod (bc KAI)
Ronkahn21 Dec 25, 2025
d0eb65f
Update values.yaml
Ronkahn21 Dec 25, 2025
3be0dbe
Refactoring of the PR containing the following changes:
unmarshall Dec 28, 2025
36e7af4
fixed check errors
unmarshall Dec 28, 2025
6d9f9fa
Fixed formatting and check errors
unmarshall Dec 29, 2025
4a07073
LeaderElection never worked as it was never enabled due to missing JSON
unmarshall Dec 29, 2025
da9bf11
reverted change manager.go->RegisterControllersAndWebhooks to make th…
unmarshall Dec 29, 2025
0d66cc7
removed LD_FLAGS from init container and removed programName as its n…
unmarshall Dec 29, 2025
5a4569f
fixed ld-flags.sh
unmarshall Dec 29, 2025
c63e4d2
feat: add timeout to skaffold command and define lease role and role …
Ronkahn21 Dec 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions docs/api-reference/operator-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -608,9 +608,7 @@ _Appears in:_

_Underlying type:_ _string_

TopologyDomain represents a predefined topology level in the hierarchy.
Topology ordering (broadest to narrowest):
Region > Zone > DataCenter > Block > Rack > Host > Numa
TopologyDomain represents a level in the cluster topology hierarchy.



Expand All @@ -634,16 +632,19 @@ _Appears in:_


TopologyLevel defines a single level in the topology hierarchy.
Maps a platform-agnostic domain to a platform-specific node label key,
allowing workload operators a consistent way to reference topology levels when defining TopologyConstraint's.



_Appears in:_
- [ClusterTopologyConfiguration](#clustertopologyconfiguration)
- [ClusterTopologySpec](#clustertopologyspec)

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `domain` _[TopologyDomain](#topologydomain)_ | Domain is the predefined level identifier used in TopologyConstraint references.<br />Must be one of: region, zone, datacenter, block, rack, host, numa | | Enum: [region zone datacenter block rack host numa] <br />Required: \{\} <br /> |
| `key` _string_ | Key is the node label key that identifies this topology domain.<br />Must be a valid Kubernetes label key (qualified name).<br />Examples: "topology.kubernetes.io/zone", "kubernetes.io/hostname" | | MaxLength: 63 <br />MinLength: 1 <br />Required: \{\} <br /> |
| `domain` _[TopologyDomain](#topologydomain)_ | Domain is a platform provider-agnostic level identifier.<br />Must be one of: region, zone, datacenter, block, rack, host, numa | | Enum: [region zone datacenter block rack host numa] <br />Required: \{\} <br /> |
| `key` _string_ | Key is the node label key that identifies this topology domain.<br />Must be a valid Kubernetes label key (qualified name).<br />Examples: "topology.kubernetes.io/zone", "kubernetes.io/hostname" | | MaxLength: 63 <br />MinLength: 1 <br />Pattern: `^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]/)?([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]$` <br />Required: \{\} <br /> |



Expand Down Expand Up @@ -702,7 +703,7 @@ _Appears in:_
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `enabled` _boolean_ | Enabled indicates whether topology-aware scheduling is enabled. | | |
| `name` _string_ | Name is the ClusterTopology resource name to use.<br />Defaults to "grove-topology" if not specified when topology is enabled. | | |
| `levels` _[TopologyLevel](#topologylevel) array_ | Levels is an ordered list of topology levels from broadest to narrowest scope.<br />Used to create/update the ClusterTopology CR at operator startup. | | |


#### ControllerConfiguration
Expand Down
1 change: 1 addition & 0 deletions docs/designs/topology.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ Domain TopologyDomain `json:"domain"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:Pattern=`^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]/)?([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]$`
Key string `json:"key"`
}

Expand Down
3 changes: 1 addition & 2 deletions hack/ld-flags.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,7 @@ function build_ld_flags() {

FLAGS="-X $PACKAGE_PATH/version.gitCommit=$(git rev-parse --verify HEAD)
-X $PACKAGE_PATH/version.gitTreeState=$tree_state
-X $PACKAGE_PATH/version.buildDate=$build_date
-X $PACKAGE_PATH/version.programName=$PROGRAM_NAME"
-X $PACKAGE_PATH/version.buildDate=$build_date"

# The k8s.component-base/version.gitVersion can not be set to the version of grove
# due to the error: "emulation version 1.33 is not between [1.31, 0.1.0-dev]".
Expand Down
6 changes: 0 additions & 6 deletions operator/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,6 @@ cover-html: test-cover
@go tool cover -html=coverage.out -o coverage.html
@echo "Coverage report generated at coverage.html"

# Run envtest tests (requires envtest binaries)
.PHONY: test-envtest
test-envtest: $(SETUP_ENVTEST)
@echo "Running envtest with CRD validation..."
@KUBEBUILDER_ASSETS=$$($(SETUP_ENVTEST) use -p path) go test ./internal/webhook/admission/clustertopology/validation -v -run TestClusterTopologyCRDValidation

# Run e2e tests
.PHONY: test-e2e
test-e2e:
Expand Down
11 changes: 11 additions & 0 deletions operator/api/common/constants/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,15 @@

package constants

const (
// OperatorName is the name of the Grove operator.
OperatorName = "grove-operator"
// OperatorConfigGroupName is the name of the group for Grove operator configuration.
OperatorConfigGroupName = "operator.config.grove.io"
// OperatorGroupName is the name of the group for all Grove custom resources.
OperatorGroupName = "grove.io"
)

// Constants for finalizers.
const (
// FinalizerPodCliqueSet is the finalizer for PodCliqueSet that is added to `.metadata.finalizers[]` slice. This will be placed on all PodCliqueSet resources
Expand Down Expand Up @@ -107,4 +116,6 @@ const (
KindPodClique = "PodClique"
// KindPodCliqueScalingGroup is the kind for a PodCliqueScalingGroup resource.
KindPodCliqueScalingGroup = "PodCliqueScalingGroup"
// KindClusterTopology is the kind for a ClusterTopology resource.
KindClusterTopology = "ClusterTopology"
)
8 changes: 0 additions & 8 deletions operator/api/config/v1alpha1/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ const (
defaultLeaderElectionResourceLock = "leases"
defaultLeaderElectionResourceName = "grove-operator-leader-election"
defaultWebhookServerTLSServerCertDir = "/etc/grove-operator/webhook-certs"
defaultTopologyName = "grove-topology"
)

// SetDefaults_ClientConnectionConfiguration sets defaults for the k8s client connection.
Expand Down Expand Up @@ -115,10 +114,3 @@ func SetDefaults_PodCliqueScalingGroupControllerConfiguration(obj *PodCliqueScal
obj.ConcurrentSyncs = ptr.To(1)
}
}

// SetDefaults_ClusterTopologyConfiguration sets defaults for the ClusterTopologyConfiguration.
func SetDefaults_ClusterTopologyConfiguration(obj *ClusterTopologyConfiguration) {
if obj.Enabled && obj.Name == "" {
obj.Name = defaultTopologyName
}
}
85 changes: 0 additions & 85 deletions operator/api/config/v1alpha1/defaults_test.go

This file was deleted.

7 changes: 3 additions & 4 deletions operator/api/config/v1alpha1/register.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,15 @@
package v1alpha1

import (
"github.com/ai-dynamo/grove/operator/api/common/constants"

"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
)

// GroupName is the group name used in this package
const GroupName = "operator.config.grove.io"

var (
// SchemeGroupVersion is group version used to register these objects
SchemeGroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
SchemeGroupVersion = schema.GroupVersion{Group: constants.OperatorConfigGroupName, Version: "v1alpha1"}
// SchemeBuilder is used to add go types to the GroupVersionKind scheme
SchemeBuilder = runtime.NewSchemeBuilder(addKnownTypes, addDefaultingFuncs)
// AddToScheme is a reference to the Scheme Builder's AddToScheme function.
Expand Down
24 changes: 13 additions & 11 deletions operator/api/config/v1alpha1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
package v1alpha1

import (
corev1alpha1 "github.com/ai-dynamo/grove/operator/api/core/v1alpha1"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

Expand Down Expand Up @@ -55,14 +57,14 @@ var (
type OperatorConfiguration struct {
metav1.TypeMeta `json:",inline"`
ClientConnection ClientConnectionConfiguration `json:"runtimeClientConnection"`
LeaderElection LeaderElectionConfiguration
Server ServerConfiguration `json:"server"`
Debugging *DebuggingConfiguration `json:"debugging,omitempty"`
Controllers ControllerConfiguration `json:"controllers"`
LogLevel LogLevel `json:"logLevel"`
LogFormat LogFormat `json:"logFormat"`
Authorizer AuthorizerConfig `json:"authorizer"`
ClusterTopology ClusterTopologyConfiguration `json:"clusterTopology"`
LeaderElection LeaderElectionConfiguration `json:"leaderElection"`
Server ServerConfiguration `json:"server"`
Debugging *DebuggingConfiguration `json:"debugging,omitempty"`
Controllers ControllerConfiguration `json:"controllers"`
LogLevel LogLevel `json:"logLevel"`
LogFormat LogFormat `json:"logFormat"`
Authorizer AuthorizerConfig `json:"authorizer"`
ClusterTopology ClusterTopologyConfiguration `json:"clusterTopology"`
}

// LeaderElectionConfiguration defines the configuration for the leader election.
Expand Down Expand Up @@ -193,8 +195,8 @@ type AuthorizerConfig struct {
type ClusterTopologyConfiguration struct {
// Enabled indicates whether topology-aware scheduling is enabled.
Enabled bool `json:"enabled"`
// Name is the ClusterTopology resource name to use.
// Defaults to "grove-topology" if not specified when topology is enabled.
// Levels is an ordered list of topology levels from broadest to narrowest scope.
// Used to create/update the ClusterTopology CR at operator startup.
// +optional
Name string `json:"name,omitempty"`
Levels []corev1alpha1.TopologyLevel `json:"levels,omitempty"`
}
8 changes: 7 additions & 1 deletion operator/api/config/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion operator/api/config/v1alpha1/zz_generated.defaults.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

49 changes: 46 additions & 3 deletions operator/api/config/validation/validation.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,12 @@
package validation

import (
"fmt"
"slices"
"strings"

configv1alpha1 "github.com/ai-dynamo/grove/operator/api/config/v1alpha1"
corev1alpha1 "github.com/ai-dynamo/grove/operator/api/core/v1alpha1"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/sets"
Expand Down Expand Up @@ -114,11 +117,51 @@ func mustBeGreaterThanZeroDuration(duration metav1.Duration, fldPath *field.Path
}

// validateClusterTopologyConfiguration validates the cluster topology configuration.
// When cluster topology is enabled, it ensures the topology name is provided.
// When cluster topology is enabled, it ensures the topology name and levels are provided,
// and validates domain and key uniqueness.
func validateClusterTopologyConfiguration(clusterTopologyCfg configv1alpha1.ClusterTopologyConfiguration, fldPath *field.Path) field.ErrorList {
allErrs := field.ErrorList{}
if clusterTopologyCfg.Enabled && len(strings.TrimSpace(clusterTopologyCfg.Name)) == 0 {
allErrs = append(allErrs, field.Required(fldPath.Child("name"), "clusterTopology name is required"))
if !clusterTopologyCfg.Enabled {
return allErrs
}
allErrs = validateClusterTopologyLevels(clusterTopologyCfg.Levels, fldPath.Child("levels"))
return allErrs
}

func validateClusterTopologyLevels(levels []corev1alpha1.TopologyLevel, fldPath *field.Path) field.ErrorList {
allErrs := field.ErrorList{}
if len(levels) == 0 {
allErrs = append(allErrs, field.Required(fldPath, "levels are required when topology is enabled"))
}
allErrs = append(allErrs, mustHaveSupportedTopologyDomains(levels, fldPath)...)
allErrs = append(allErrs, mustHaveUniqueTopologyLevels(levels, fldPath)...)
return allErrs
}

func mustHaveSupportedTopologyDomains(levels []corev1alpha1.TopologyLevel, fldPath *field.Path) field.ErrorList {
allErrs := field.ErrorList{}
supportedDomains := corev1alpha1.SupportedTopologyDomains()
for i, level := range levels {
if !slices.Contains(supportedDomains, level.Domain) {
allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("domain"), level.Domain, fmt.Sprintf("must be one of %v", supportedDomains)))
}
}
return allErrs
}

func mustHaveUniqueTopologyLevels(levels []corev1alpha1.TopologyLevel, fldPath *field.Path) field.ErrorList {
allErrs := field.ErrorList{}
seenDomains := make(map[corev1alpha1.TopologyDomain]struct{})
seenKeys := make(map[string]struct{})
for i, level := range levels {
if _, exists := seenDomains[level.Domain]; exists {
allErrs = append(allErrs, field.Duplicate(fldPath.Index(i).Child("domain"), level.Domain))
}
if _, exists := seenKeys[level.Key]; exists {
allErrs = append(allErrs, field.Duplicate(fldPath.Index(i).Child("key"), level.Key))
}
seenDomains[level.Domain] = struct{}{}
seenKeys[level.Key] = struct{}{}
}
return allErrs
}
Loading
Loading