-
Notifications
You must be signed in to change notification settings - Fork 1.5k
OCPBUGS-69923: ensure deterministic zone ordering for control plane machines #10188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…achines Control plane machines were intermittently being created in different availability zones than specified in their machine specs. This occurred because the zone list returned from FilterZonesBasedOnInstanceType used a set's UnsortedList() func, which has a non-deterministic order. When CAPI and MAPI manifest generation independently called this func, they could receive zones in different orders, causing a mismatch in machine zone placements between CAPI and MAPI manifests. This commit ensures that we sort the zone slices before further processing.
|
@tthvo: This pull request references Jira Issue OCPBUGS-69923, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Notes: The new sorting only applies when defaulting to use available AZs in the region (i.e. not BYO subnets and no specific zones are defined in machine pool
I didn't add any sorting for worker machine manifests because this problem only occurs with control plane machines due to 2 separate MAPI and CAPI system. |
|
/test e2e-aws-default-config e2e-aws-ovn-shared-vpc-custom-security-groups e2e-aws-byo-subnet-role-security-groups e2e-aws-ovn-shared-vpc-edge-zones e2e-aws-ovn-edge-zones |
|
/lgtm |
|
/jira refresh |
|
@tthvo: This pull request references Jira Issue OCPBUGS-69923, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
PR verified: anan@think:~/works/openshift-versions/421beta4$ ./openshift-install version
./openshift-install 4.21.0-0-2026-01-13-132000-test-ci-ln-x92lq5b-latest
built from commit e1039636fe8f0c073ebb9b728503995c002f05a0
release image registry.build10.ci.openshift.org/ci-ln-x92lq5b/release@sha256:28d37d5e8123cde6d70b02b0f081f9e45859d1880ccc763acaac548353bce436
release architecture amd64
anan@think:~/works/openshift-versions/421beta4$ for file in openshift/99_openshift-cluster-api_master-machines-*.yaml; do echo "$(basename $file): $(yq eval '.spec.providerSpec.value.placement.availabilityZone' "$file")"; done | sort
99_openshift-cluster-api_master-machines-0.yaml: us-east-1a
99_openshift-cluster-api_master-machines-1.yaml: us-east-1b
99_openshift-cluster-api_master-machines-2.yaml: us-east-1c
anan@think:~/works/openshift-versions/421beta4$ echo "=== MAPI Zones (first 3) ===" && yq eval '.spec.template.machines_v1beta1_machine_openshift_io.failureDomains.aws[].placement.availabilityZone' openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml | head -3 | nl -v0 -w1 -s': '
=== MAPI Zones (first 3) ===
0: us-east-1a
1: us-east-1b
2: us-east-1c
---
anan@think:~/works/openshift-versions/421beta4$ export KUBECONFIG=/home/anan/works/openshift-versions/421beta4/auth/kubeconfig
anan@think:~/works/openshift-versions/421beta4$ echo "=== Zone Labels ==="
for machine in $(oc get machine -n openshift-machine-api -l machine.openshift.io/cluster-api-machine-role=master -o jsonpath='{.items[*].metadata.name}'); do
echo "$machine: $(oc get machine "$machine" -n openshift-machine-api -o jsonpath='{.metadata.labels.machine\.openshift\.io/zone}')"
done
=== Zone Labels ===
weli-test5-rwt9m-master-0: us-east-1a
weli-test5-rwt9m-master-1: us-east-1b
weli-test5-rwt9m-master-2: us-east-1c
anan@think:~/works/openshift-versions/421beta4$ echo "=== ProviderID Zones ==="
for machine in $(oc get machine -n openshift-machine-api -l machine.openshift.io/cluster-api-machine-role=master -o jsonpath='{.items[*].metadata.name}'); do
provider_id=$(oc get machine "$machine" -n openshift-machine-api -o jsonpath='{.spec.providerID}')
provider_zone=$(echo "$provider_id" | grep -oP 'aws:///\K[^/]+')
echo "$machine: $provider_zone"
done
=== ProviderID Zones ===
weli-test5-rwt9m-master-0: us-east-1a
weli-test5-rwt9m-master-1: us-east-1b
weli-test5-rwt9m-master-2: us-east-1c
anan@think:~/works/openshift-versions/421beta4$ echo "=== Spec Zones ==="
for machine in $(oc get machine -n openshift-machine-api -l machine.openshift.io/cluster-api-machine-role=master -o jsonpath='{.items[*].metadata.name}'); do
echo "$machine: $(oc get machine "$machine" -n openshift-machine-api -o jsonpath='{.spec.providerSpec.value.placement.availabilityZone}')"
done
=== Spec Zones ===
weli-test5-rwt9m-master-0: us-east-1a
weli-test5-rwt9m-master-1: us-east-1b
weli-test5-rwt9m-master-2: us-east-1c |
|
/verified by liweinan |
|
@liweinan: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/label acknowledge-critical-fixes-only |
|
@tthvo: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/tide refresh |
|
@tthvo: Jira Issue Verification Checks: Jira Issue OCPBUGS-69923 Jira Issue OCPBUGS-69923 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.21 |
|
@tthvo: new pull request created: #10214 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
CI job work in progress: openshift/release#73722 |
Add comprehensive analysis of OCPBUGS-69923 (zone inconsistency bug) and test tools for reproduction: - docs/dev/OCPBUGS-69923-analysis.md: Detailed root cause analysis explaining why FilterZonesBasedOnInstanceType() returns different zone orders for Master.Generate and ClusterAPI.Generate due to sets.UnsortedList() non-determinism. - hack/zone-test/debug-zone-check.sh: Shell script to generate manifests and compare zones between real CAPI files (cluster-api/machines/10_inframachine_*) and MAPI files (ControlPlaneMachineSet failureDomains). - hack/zone-test/map_order.go: Go test program demonstrating ~85% mismatch rate when simulating independent UnsortedList() calls, confirming the bug's root cause. Key finding: The file 99_openshift-cluster-api_master-machines-*.yaml is misleadingly named - it contains MAPI Machine objects, not CAPI. Real CAPI files are in cluster-api/machines/ directory. Reference: OCPBUGS-69923, PR openshift#10188
Control plane machines were intermittently being created in different availability zones than specified in their machine specs. This occurred because the zone list returned from FilterZonesBasedOnInstanceType used a set's UnsortedList() func, which has a non-deterministic order.
When CAPI and MAPI manifest generation independently called this func, they could receive zones in different orders, causing a mismatch in machine zone placements between CAPI and MAPI manifests.
This PR ensures that we sort the zone slices before further processing.