-
Notifications
You must be signed in to change notification settings - Fork 2.1k
OCPBUGS-69923: Add static zone consistency validation test for AWS #73935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@liweinan: This pull request references Jira Issue OCPBUGS-69923, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liweinan The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
956014e to
a609dac
Compare
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-zone-consistency-f14 |
|
@liweinan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
a609dac to
7859348
Compare
Add a simplified job that validates CAPI/MAPI zone allocation consistency in generated manifests without requiring actual cluster installation. This is a static validation test using openshift-install create manifests.
|
Relative PR: openshift/installer#10214 |
|
@tthvo Could you please help to review this root cause analysis: Root Cause : openshift/installer@4068682841#r175655773 OCPBUGS-69923 Bug Root Cause AnalysisSummaryThe zone inconsistency bug was introduced by commit Root CauseThe problematic change in - return found[instanceType].Intersection(sets.NewString(zones...)).List(), nil
+ return found[instanceType].Intersection(sets.New(zones...)).UnsortedList(), nil
When CAPI and MAPI generation code independently call Git ForensicsStep 1: Find commits that modified the file git log --all --oneline --source -- pkg/asset/machines/aws/instance_types.goStep 2: Inspect the suspicious commit git show 4068682841 --statStep 3: Confirm the code change git show 4068682841 -- pkg/asset/machines/aws/instance_types.go | grep -A2 -B2 "UnsortedList"Timeline
Correct Testing MethodVersion Requirements
Important: Testing with versions before Build a Buggy Versioncd installer
# Find a commit in the buggy window (after Aug 14, before [**PR #10188**](https://github.com/openshift/installer/pull/10188))
git log --oneline --after="2025-08-14" --before="2025-12-23" -- pkg/asset/machines/aws/instance_types.go
git checkout <commit-in-buggy-window>
hack/build.shExpected Mismatch RateBased on Go map iteration simulation with
Related PRs |
OCPBUGS-69923: Zone Consistency Bug AnalysisAnalyse Branch: https://github.com/openshift/installer/compare/main...liweinan:installer:ocpbugs-69923-analysis?expand=1 Executive SummaryThis document analyzes OCPBUGS-69923, a bug where control plane machines created by the OpenShift installer have inconsistent availability zone assignments between Cluster API (CAPI) and Machine API (MAPI) resources. The root cause is the non-deterministic ordering returned by Bug DescriptionWhen deploying an OpenShift cluster on AWS without explicitly specifying availability zones in the install-config, the installer generates manifests where:
This inconsistency can cause issues during cluster operation when the control plane machine set controller attempts to reconcile machine states. Root Cause AnalysisCode Path
The ProblemIn func FilterZonesBasedOnInstanceType(ctx context.Context, meta *awsconfig.Metadata, instanceType string, zones []string) ([]string, error) {
// ... EC2 API call to get zone info ...
// BUG: UnsortedList() returns zones in non-deterministic order
return found[instanceType].Intersection(sets.New(zones...)).UnsortedList(), nil
}The Why Two Independent Calls?
ReproductionTest ResultsUsing a Go test program that simulates This confirms that two independent calls to Manual VerificationRunning the installer multiple times shows:
The generated manifests confirm the mismatch: # CAPI AWSMachine (cluster-api/machines/10_inframachine_*-master-0.yaml)
spec:
subnet:
filters:
- name: tag:Name
values:
- cluster-subnet-private-us-east-1a # Zone: a
# MAPI Machine (openshift/99_openshift-cluster-api_master-machines-0.yaml)
spec:
providerSpec:
value:
placement:
availabilityZone: us-east-1d # Zone: dFile Naming ClarificationImportant: The file
FixThe fix (implemented in PR #10188) is to sort the zones after calling func FilterZonesBasedOnInstanceType(...) ([]string, error) {
// ...
result := found[instanceType].Intersection(sets.New(zones...)).UnsortedList()
slices.Sort(result) // FIX: Sort to ensure deterministic order
return result, nil
}Alternatively, use Testing ScriptsLocal Test ScriptSee
Go Simulation TestSee
Key Findings
References
Appendix: Affected Code Locations
|
Part of CORS-3819, this focuses on helper funcs to query instance types for machines. The changes also replaced the deprecated use of sets.String with the generic sets.Set[string] to satisfy golint.
|
The bug can be reproduced by the commit with local build: openshift/installer@4068682841 #!/bin/bash
# Verify zone consistency between CAPI and MAPI in locally generated manifests
# Used to validate CI script logic
set -e
WORK_DIR="${1:-/Users/weli/works/oc-swarm/installer/bin}"
echo "=========================================="
echo "OCPBUGS-69923 Local Manifests Verification"
echo "=========================================="
echo "Directory: $WORK_DIR"
echo ""
# Extract CAPI zones (from cluster-api/machines/10_inframachine_*-master-*.yaml)
echo "=========================================="
echo "CAPI Zones (from cluster-api/machines/10_inframachine_*-master-*.yaml)"
echo "=========================================="
capi_zones=""
for file in $(find "$WORK_DIR"/cluster-api/machines -name "10_inframachine_*-master-*.yaml" -type f 2>/dev/null | sort); do
echo "File: $(basename $file)"
# Extract subnet filter name
subnet_name=$(yq eval '.spec.subnet.filters[0].values[0]' "$file" 2>/dev/null || echo "null")
echo " subnet filter: $subnet_name"
# Extract zone from subnet name (e.g., "xxx-us-east-1a" -> "us-east-1a")
zone=$(echo "$subnet_name" | grep -oE '[a-z]+-[a-z]+-[0-9][a-z]$' || echo "null")
echo " zone: $zone"
if [[ -n "$zone" && "$zone" != "null" ]]; then
capi_zones="$capi_zones $zone"
fi
done
capi_zones=$(echo "$capi_zones" | xargs)
capi_count=$(echo "$capi_zones" | wc -w | tr -d ' ')
echo ""
echo "CAPI Zones result: [$capi_zones] (total: $capi_count)"
# Extract MAPI zones (from ControlPlaneMachineSet)
echo ""
echo "=========================================="
echo "MAPI Zones (from ControlPlaneMachineSet failureDomains)"
echo "=========================================="
mapi_zones=""
cpms_file="$WORK_DIR/openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml"
if [[ -f "$cpms_file" ]]; then
echo "File: $(basename $cpms_file)"
echo "All failureDomains zones:"
all_zones=$(yq eval '.spec.template.machines_v1beta1_machine_openshift_io.failureDomains.aws[].placement.availabilityZone' "$cpms_file" 2>/dev/null)
idx=0
for zone in $all_zones; do
echo " [$idx] $zone"
if [[ "$zone" != "null" && -n "$zone" && $idx -lt $capi_count ]]; then
mapi_zones="$mapi_zones $zone"
fi
idx=$((idx + 1))
done
else
echo "ERROR: ControlPlaneMachineSet file not found!"
fi
mapi_zones=$(echo "$mapi_zones" | xargs)
echo ""
echo "MAPI Zones result (first $capi_count): [$mapi_zones]"
# Compare
echo ""
echo "=========================================="
echo "Comparison Result"
echo "=========================================="
echo "CAPI: [$capi_zones]"
echo "MAPI: [$mapi_zones]"
echo ""
if [[ "$capi_zones" == "$mapi_zones" ]]; then
echo "✅ CONSISTENT - CAPI and MAPI zones match"
echo ""
echo "Note: Current installer version did not trigger the bug, or this run happened to be consistent"
exit 0
else
echo "❌ INCONSISTENT - Zone mismatch detected! (OCPBUGS-69923)"
echo ""
echo "This indicates the bug is present in the current installer version!"
echo "Fix: PR #10188 adds slices.Sort() after UnsortedList() calls"
exit 1
fi==========================================
OCPBUGS-69923 Local Manifests Verification
==========================================
Directory: /Users/weli/works/oc-swarm/installer/bin
==========================================
CAPI Zones (from cluster-api/machines/10_inframachine_*-master-*.yaml)
==========================================
File: 10_inframachine_weli-test-trft6-master-0.yaml
subnet filter: weli-test-trft6-subnet-private-us-east-1f
zone: us-east-1f
File: 10_inframachine_weli-test-trft6-master-1.yaml
subnet filter: weli-test-trft6-subnet-private-us-east-1a
zone: us-east-1a
File: 10_inframachine_weli-test-trft6-master-2.yaml
subnet filter: weli-test-trft6-subnet-private-us-east-1b
zone: us-east-1b
CAPI Zones result: [us-east-1f us-east-1a us-east-1b] (total: 3)
==========================================
MAPI Zones (from ControlPlaneMachineSet failureDomains)
==========================================
File: 99_openshift-machine-api_master-control-plane-machine-set.yaml
All failureDomains zones:
[0] us-east-1d
[1] us-east-1f
[2] us-east-1a
[3] us-east-1b
[4] us-east-1c
MAPI Zones result (first 3): [us-east-1d us-east-1f us-east-1a]
==========================================
Comparison Result
==========================================
CAPI: [us-east-1f us-east-1a us-east-1b]
MAPI: [us-east-1d us-east-1f us-east-1a]
❌ INCONSISTENT - Zone mismatch detected! (OCPBUGS-69923)
This indicates the bug is present in the current installer version!
Fix: PR #10188 adds slices.Sort() after UnsortedList() calls
|
…I files Previous script compared MAPI files to MAPI files (wrong). Now correctly compares: - CAPI: cluster-api/machines/10_inframachine_*-master-*.yaml (subnet filter) - MAPI: openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml (failureDomains) Note: openshift/99_openshift-cluster-api_master-machines-*.yaml is MAPI despite the name!
7859348 to
ea76f19
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-zone-consistency-f14 |
|
@liweinan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@yunjiang29, I have reproduced the problem on the older relative installer commit as described above, and the occurrence rate of the problem is around 80%, so I set 10 rounds for testing in the CI job, and run the test job without any problem. Could you please help to review this PR? Thanks! |
|
@liweinan: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/assign @yunjiang29 |
|
Verification process for |
Add a simplified job that validates CAPI/MAPI zone allocation consistency in generated manifests without requiring actual cluster installation. This is a static validation test using openshift-install create manifests.