-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[release-4.19] OCPBUGS-73950: ensure deterministic zone ordering for control plane machines #10230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…achines Control plane machines were intermittently being created in different availability zones than specified in their machine specs. This occurred because the zone list returned from FilterZonesBasedOnInstanceType used a set's UnsortedList() func, which has a non-deterministic order. When CAPI and MAPI manifest generation independently called this func, they could receive zones in different orders, causing a mismatch in machine zone placements between CAPI and MAPI manifests. This commit ensures that we sort the zone slices before further processing.
|
@openshift-cherrypick-robot: Jira Issue OCPBUGS-73785 has been cloned as Jira Issue OCPBUGS-73950. Will retitle bug to link to clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-73950, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
tthvo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tthvo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@openshift-cherrypick-robot: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/jira refresh |
|
@tthvo: This pull request references Jira Issue OCPBUGS-73950, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I'm working on it. |
|
Verified: anan@think:~/works/openshift-versions/419nightly$ ./openshift-install version
./openshift-install 4.19.0-0-2026-01-19-075027-test-ci-ln-8hdqhxk-latest
built from commit f5a064ca85b3f762894a7c7f9424068fa2bf8ae8
release image registry.build10.ci.openshift.org/ci-ln-8hdqhxk/release@sha256:7943fde9ebbe2f2079d27a232ebd396a7dc1d513605df89d37f25ac17057cae4
release architecture amd64anan@think:~/works/openshift-versions/419nightly$ ../../my-openshift-workspace/OCPBUGS-69923/verify-manifests.sh
==========================================
Verify CAPI and MAPI Manifest Zone Consistency
==========================================
Installation directory: .
CAPI Machine Zones:
master-0 (99_openshift-cluster-api_master-machines-0.yaml): us-east-1a
master-1 (99_openshift-cluster-api_master-machines-1.yaml): us-east-1b
master-2 (99_openshift-cluster-api_master-machines-2.yaml): us-east-1c
MAPI Machine Zones:
master-0 (from 99_openshift-machine-api_master-control-plane-machine-set.yaml): us-east-1a
master-1 (from 99_openshift-machine-api_master-control-plane-machine-set.yaml): us-east-1b
master-2 (from 99_openshift-machine-api_master-control-plane-machine-set.yaml): us-east-1c
==========================================
Consistency Check
==========================================
✓ Match: master-0 - Zone: us-east-1a
✓ Match: master-1 - Zone: us-east-1b
✓ Match: master-2 - Zone: us-east-1c
✅ Verification PASSED: All machines have consistent zone allocation!anan@think:~/works/openshift-versions/419nightly$ ../../my-openshift-workspace/OCPBUGS-69923/verify-cluster.sh
==========================================
Verify Machine Zone Consistency in Cluster
==========================================
Kubeconfig: /home/anan/works/openshift-versions/419nightly/auth/kubeconfig
✓ Successfully connected to cluster
Found 3 master machine(s)
==========================================
Check Zone Consistency for Each Machine
==========================================
--- Machine: weli-test-pv77c-master-0 ---
Zone Label: us-east-1a
ProviderID Zone: us-east-1a
Spec Zone: us-east-1a
Subnet Filter: weli-test-pv77c-subnet-private-us-east-1a
✅ Zone consistent
--- Machine: weli-test-pv77c-master-1 ---
Zone Label: us-east-1b
ProviderID Zone: us-east-1b
Spec Zone: us-east-1b
Subnet Filter: weli-test-pv77c-subnet-private-us-east-1b
✅ Zone consistent
--- Machine: weli-test-pv77c-master-2 ---
Zone Label: us-east-1c
ProviderID Zone: us-east-1c
Spec Zone: us-east-1c
Subnet Filter: weli-test-pv77c-subnet-private-us-east-1c
✅ Zone consistent
==========================================
Verification Summary
==========================================
Checked 3 master machine(s)
✅ Verification PASSED: All machines have consistent zones!
Cluster verification: PASS ✓
Fix verification successful:
- Zone label, ProviderID zone, and Spec zone are all consistent
- Machines are created in the correct availability zones |
|
/verified by liweinan |
|
@liweinan: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold From @liweinan's analysis openshift/release#73935 (comment), I think this bug was introduced when we migrated to AWS SDK v2 (i.e. since 4.20). Thus, |
The commit 4068682841 has additional changes to replace deprecated use of We finished backporting to 4.20.z so it should be all good. Pending verification 👀 |
|
/close See https://gist.github.com/liweinan/1c16388052387032931c1846035d8052 for analysis. We confirmed 4.19 and older releases do not contain the bug. |
|
@tthvo: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-73950. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is an automated cherry-pick of #10219
/assign tthvo