Skip to content

Conversation

@liweinan
Copy link
Contributor

Add a simplified job that validates CAPI/MAPI zone allocation consistency in generated manifests without requiring actual cluster installation. This is a static validation test using openshift-install create manifests.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 26, 2026
@openshift-ci-robot
Copy link
Contributor

@liweinan: This pull request references Jira Issue OCPBUGS-69923, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Add a simplified job that validates CAPI/MAPI zone allocation consistency in generated manifests without requiring actual cluster installation. This is a static validation test using openshift-install create manifests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liweinan
Once this PR has been reviewed and has the lgtm label, please assign liangxia for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liweinan
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-zone-consistency-f14

@openshift-ci-robot
Copy link
Contributor

@liweinan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@liweinan liweinan force-pushed the OCPBUGS-69923-remake branch from a609dac to 7859348 Compare January 26, 2026 09:38
Add a simplified job that validates CAPI/MAPI zone allocation consistency
in generated manifests without requiring actual cluster installation.
This is a static validation test using openshift-install create manifests.
@liweinan
Copy link
Contributor Author

Relative PR: openshift/installer#10214

@liweinan
Copy link
Contributor Author

liweinan commented Jan 26, 2026

@tthvo Could you please help to review this root cause analysis:

Root Cause : openshift/installer@4068682841#r175655773

OCPBUGS-69923 Bug Root Cause Analysis

Summary

The zone inconsistency bug was introduced by commit 4068682841 during AWS SDK v2 migration, where .List() (sorted) was changed to .UnsortedList() (random order).

Root Cause

The problematic change in pkg/asset/machines/aws/instance_types.go:

- return found[instanceType].Intersection(sets.NewString(zones...)).List(), nil
+ return found[instanceType].Intersection(sets.New(zones...)).UnsortedList(), nil
  • .List(): Returns a sorted slice (deterministic)
  • .UnsortedList(): Returns a random-order slice (non-deterministic)

When CAPI and MAPI generation code independently call FilterZonesBasedOnInstanceType, they may receive zones in different orders, causing zone allocation mismatch.

Git Forensics

Step 1: Find commits that modified the file

git log --all --oneline --source -- pkg/asset/machines/aws/instance_types.go

Step 2: Inspect the suspicious commit

git show 4068682841 --stat
Author: Thuan Vo <thvo@redhat.com>
Date:   Thu Aug 14 14:34:15 2025 -0700

    CORS-4056: migrate AWS SDK to v2 in asset/machines/aws
    
    The changes also replaced the deprecated use of sets.String with the
    generic sets.Set[string] to satisfy golint.

Step 3: Confirm the code change

git show 4068682841 -- pkg/asset/machines/aws/instance_types.go | grep -A2 -B2 "UnsortedList"

Timeline

Date Event
Before 2025-08-14 ✅ Using .List() - correct (sorted)
2025-08-14 ❌ Commit 4068682841 changed to .UnsortedList() - bug introduced
2025-12-23 PR #10188 added slices.Sort() after the call - bug fixed

Correct Testing Method

Version Requirements

Version Date Bug Status
Before 4068682841 < 2025-08-14 ✅ No bug (uses .List())
4068682841 ~ PR #10188 2025-08-14 ~ 2025-12-23 Has bug (uses .UnsortedList())
After PR #10188 > 2025-12-23 ✅ Bug fixed (adds slices.Sort())

Important: Testing with versions before 4068682841 will show 0% mismatch because those versions use the sorted .List() method.

Build a Buggy Version

cd installer
# Find a commit in the buggy window (after Aug 14, before [**PR #10188**](https://github.com/openshift/installer/pull/10188))
git log --oneline --after="2025-08-14" --before="2025-12-23" -- pkg/asset/machines/aws/instance_types.go
git checkout <commit-in-buggy-window>
hack/build.sh

Expected Mismatch Rate

Based on Go map iteration simulation with UnsortedList():

Scenario Mismatch Rate
3 zones ~41%
6 zones (select 3) ~78%

Related PRs

  • PR #9662: Fixed byo-subnets scenario (map iteration over subnets)
  • PR #10188: Fixed default zone selection scenario (UnsortedList in FilterZonesBasedOnInstanceType)

@liweinan
Copy link
Contributor Author

liweinan commented Jan 26, 2026

OCPBUGS-69923: Zone Consistency Bug Analysis

Analyse Branch: https://github.com/openshift/installer/compare/main...liweinan:installer:ocpbugs-69923-analysis?expand=1

Executive Summary

This document analyzes OCPBUGS-69923, a bug where control plane machines created by the OpenShift installer have inconsistent availability zone assignments between Cluster API (CAPI) and Machine API (MAPI) resources. The root cause is the non-deterministic ordering returned by sets.UnsortedList() when called independently by Master.Generate and ClusterAPI.Generate.

Bug Description

When deploying an OpenShift cluster on AWS without explicitly specifying availability zones in the install-config, the installer generates manifests where:

  • CAPI AWSMachine resources are assigned to one set of zones (e.g., a, b, c)
  • MAPI Machine resources and ControlPlaneMachineSet failureDomains are assigned to a different set of zones (e.g., d, f, a)

This inconsistency can cause issues during cluster operation when the control plane machine set controller attempts to reconcile machine states.

Root Cause Analysis

Code Path

  1. Master.Generate (pkg/asset/machines/master.go)

    • Calls aws.FilterZonesBasedOnInstanceType() to get filtered zones
    • Uses result to generate MAPI Machines (99_openshift-cluster-api_master-machines-*.yaml)
    • Uses result to generate ControlPlaneMachineSet (99_openshift-machine-api_master-control-plane-machine-set.yaml)
  2. ClusterAPI.Generate (pkg/asset/machines/clusterapi.go)

    • Independently calls aws.FilterZonesBasedOnInstanceType() to get filtered zones
    • Uses result to generate CAPI AWSMachines (cluster-api/machines/10_inframachine_*-master-*.yaml)

The Problem

In pkg/asset/machines/aws/instance_types.go:

func FilterZonesBasedOnInstanceType(ctx context.Context, meta *awsconfig.Metadata, instanceType string, zones []string) ([]string, error) {
    // ... EC2 API call to get zone info ...
    
    // BUG: UnsortedList() returns zones in non-deterministic order
    return found[instanceType].Intersection(sets.New(zones...)).UnsortedList(), nil
}

The sets.UnsortedList() function iterates over a Go map, which has intentionally randomized iteration order. Each independent call can return zones in a different order.

Why Two Independent Calls?

  • Master and ClusterAPI are independent Assets with no direct dependency on each other
  • Both depend on InstallConfig, but generate their manifests separately
  • Each calls FilterZonesBasedOnInstanceType() in its own Generate() method
  • The Asset store executes them sequentially, but they don't share the filtered zones result

Reproduction

Test Results

Using a Go test program that simulates sets.UnsortedList() behavior:

100 simulations: 85% mismatch rate

This confirms that two independent calls to UnsortedList() with identical inputs will produce different orderings approximately 85% of the time.

Manual Verification

Running the installer multiple times shows:

  • Master zones: [d, f, a, b, c] → first 3: d, f, a
  • ClusterAPI zones: [a, b, c, d, f] → first 3: a, b, c

The generated manifests confirm the mismatch:

# CAPI AWSMachine (cluster-api/machines/10_inframachine_*-master-0.yaml)
spec:
  subnet:
    filters:
    - name: tag:Name
      values:
      - cluster-subnet-private-us-east-1a  # Zone: a

# MAPI Machine (openshift/99_openshift-cluster-api_master-machines-0.yaml)
spec:
  providerSpec:
    value:
      placement:
        availabilityZone: us-east-1d  # Zone: d

File Naming Clarification

Important: The file 99_openshift-cluster-api_master-machines-*.yaml is misleading - despite having "cluster-api" in the name, it contains MAPI Machine objects (apiVersion: machine.openshift.io/v1beta1), not CAPI objects.

File Pattern Generated By Contains
openshift/99_openshift-cluster-api_master-machines-*.yaml Master.Generate MAPI Machine
openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml Master.Generate ControlPlaneMachineSet
cluster-api/machines/10_inframachine_*-master-*.yaml ClusterAPI.Generate CAPI AWSMachine
cluster-api/machines/10_machine_*-master-*.yaml ClusterAPI.Generate CAPI Machine

Fix

The fix (implemented in PR #10188) is to sort the zones after calling UnsortedList():

func FilterZonesBasedOnInstanceType(...) ([]string, error) {
    // ...
    result := found[instanceType].Intersection(sets.New(zones...)).UnsortedList()
    slices.Sort(result)  // FIX: Sort to ensure deterministic order
    return result, nil
}

Alternatively, use sets.List() which returns a sorted slice.

Testing Scripts

Local Test Script

See hack/zone-test/debug-zone-check.sh - generates manifests and compares zones between:

  • CAPI: cluster-api/machines/10_inframachine_*-master-*.yaml (subnet filter zones)
  • MAPI: openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml (failureDomains)

Go Simulation Test

See hack/zone-test/map_order.go - demonstrates that sets.UnsortedList() produces non-deterministic results:

  • Same Set, multiple iterations: different orders
  • Two independent Sets with same content: different orders
  • Simulated installer behavior: ~85% mismatch rate

Key Findings

  1. Go map iteration is intentionally non-deterministic - even within the same process, multiple iterations can produce different orders

  2. sets.UnsortedList() inherits this non-determinism - it simply iterates over the underlying map

  3. The bug is reproducible locally - no special AWS conditions needed, just multiple manifest generations

  4. Previous test scripts checked wrong files - comparing MAPI files to MAPI files always showed consistency; must compare CAPI files to MAPI files

References

Appendix: Affected Code Locations

  • pkg/asset/machines/master.go:251 - calls FilterZonesBasedOnInstanceType
  • pkg/asset/machines/clusterapi.go:156 - calls FilterZonesBasedOnInstanceType
  • pkg/asset/machines/worker.go:518 - calls FilterZonesBasedOnInstanceType
  • pkg/asset/machines/aws/instance_types.go:62 - UnsortedList() call

liweinan referenced this pull request in openshift/installer Jan 26, 2026
Part of CORS-3819, this focuses on helper funcs to query instance types
for machines.

The changes also replaced the deprecated use of sets.String with the
generic sets.Set[string] to satisfy golint.
@liweinan
Copy link
Contributor Author

The bug can be reproduced by the commit with local build: openshift/installer@4068682841

#!/bin/bash
# Verify zone consistency between CAPI and MAPI in locally generated manifests
# Used to validate CI script logic

set -e

WORK_DIR="${1:-/Users/weli/works/oc-swarm/installer/bin}"

echo "=========================================="
echo "OCPBUGS-69923 Local Manifests Verification"
echo "=========================================="
echo "Directory: $WORK_DIR"
echo ""

# Extract CAPI zones (from cluster-api/machines/10_inframachine_*-master-*.yaml)
echo "=========================================="
echo "CAPI Zones (from cluster-api/machines/10_inframachine_*-master-*.yaml)"
echo "=========================================="
capi_zones=""
for file in $(find "$WORK_DIR"/cluster-api/machines -name "10_inframachine_*-master-*.yaml" -type f 2>/dev/null | sort); do
    echo "File: $(basename $file)"
    # Extract subnet filter name
    subnet_name=$(yq eval '.spec.subnet.filters[0].values[0]' "$file" 2>/dev/null || echo "null")
    echo "  subnet filter: $subnet_name"
    # Extract zone from subnet name (e.g., "xxx-us-east-1a" -> "us-east-1a")
    zone=$(echo "$subnet_name" | grep -oE '[a-z]+-[a-z]+-[0-9][a-z]$' || echo "null")
    echo "  zone: $zone"
    if [[ -n "$zone" && "$zone" != "null" ]]; then
        capi_zones="$capi_zones $zone"
    fi
done
capi_zones=$(echo "$capi_zones" | xargs)
capi_count=$(echo "$capi_zones" | wc -w | tr -d ' ')
echo ""
echo "CAPI Zones result: [$capi_zones] (total: $capi_count)"

# Extract MAPI zones (from ControlPlaneMachineSet)
echo ""
echo "=========================================="
echo "MAPI Zones (from ControlPlaneMachineSet failureDomains)"
echo "=========================================="
mapi_zones=""
cpms_file="$WORK_DIR/openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml"
if [[ -f "$cpms_file" ]]; then
    echo "File: $(basename $cpms_file)"
    echo "All failureDomains zones:"
    all_zones=$(yq eval '.spec.template.machines_v1beta1_machine_openshift_io.failureDomains.aws[].placement.availabilityZone' "$cpms_file" 2>/dev/null)
    idx=0
    for zone in $all_zones; do
        echo "  [$idx] $zone"
        if [[ "$zone" != "null" && -n "$zone" && $idx -lt $capi_count ]]; then
            mapi_zones="$mapi_zones $zone"
        fi
        idx=$((idx + 1))
    done
else
    echo "ERROR: ControlPlaneMachineSet file not found!"
fi
mapi_zones=$(echo "$mapi_zones" | xargs)
echo ""
echo "MAPI Zones result (first $capi_count): [$mapi_zones]"

# Compare
echo ""
echo "=========================================="
echo "Comparison Result"
echo "=========================================="
echo "CAPI: [$capi_zones]"
echo "MAPI: [$mapi_zones]"
echo ""

if [[ "$capi_zones" == "$mapi_zones" ]]; then
    echo "✅ CONSISTENT - CAPI and MAPI zones match"
    echo ""
    echo "Note: Current installer version did not trigger the bug, or this run happened to be consistent"
    exit 0
else
    echo "❌ INCONSISTENT - Zone mismatch detected! (OCPBUGS-69923)"
    echo ""
    echo "This indicates the bug is present in the current installer version!"
    echo "Fix: PR #10188 adds slices.Sort() after UnsortedList() calls"
    exit 1
fi
==========================================
OCPBUGS-69923 Local Manifests Verification
==========================================
Directory: /Users/weli/works/oc-swarm/installer/bin

==========================================
CAPI Zones (from cluster-api/machines/10_inframachine_*-master-*.yaml)
==========================================
File: 10_inframachine_weli-test-trft6-master-0.yaml
  subnet filter: weli-test-trft6-subnet-private-us-east-1f
  zone: us-east-1f
File: 10_inframachine_weli-test-trft6-master-1.yaml
  subnet filter: weli-test-trft6-subnet-private-us-east-1a
  zone: us-east-1a
File: 10_inframachine_weli-test-trft6-master-2.yaml
  subnet filter: weli-test-trft6-subnet-private-us-east-1b
  zone: us-east-1b

CAPI Zones result: [us-east-1f us-east-1a us-east-1b] (total: 3)

==========================================
MAPI Zones (from ControlPlaneMachineSet failureDomains)
==========================================
File: 99_openshift-machine-api_master-control-plane-machine-set.yaml
All failureDomains zones:
  [0] us-east-1d
  [1] us-east-1f
  [2] us-east-1a
  [3] us-east-1b
  [4] us-east-1c

MAPI Zones result (first 3): [us-east-1d us-east-1f us-east-1a]

==========================================
Comparison Result
==========================================
CAPI: [us-east-1f us-east-1a us-east-1b]
MAPI: [us-east-1d us-east-1f us-east-1a]

❌ INCONSISTENT - Zone mismatch detected! (OCPBUGS-69923)

This indicates the bug is present in the current installer version!
Fix: PR #10188 adds slices.Sort() after UnsortedList() calls

…I files

Previous script compared MAPI files to MAPI files (wrong).
Now correctly compares:
- CAPI: cluster-api/machines/10_inframachine_*-master-*.yaml (subnet filter)
- MAPI: openshift/99_openshift-machine-api_master-control-plane-machine-set.yaml (failureDomains)

Note: openshift/99_openshift-cluster-api_master-machines-*.yaml is MAPI despite the name!
@liweinan liweinan force-pushed the OCPBUGS-69923-remake branch from 7859348 to ea76f19 Compare January 26, 2026 15:53
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@liweinan: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-zone-consistency-f14 N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@liweinan
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-zone-consistency-f14

@openshift-ci-robot
Copy link
Contributor

@liweinan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@liweinan
Copy link
Contributor Author

liweinan commented Jan 26, 2026

@yunjiang29, I have reproduced the problem on the older relative installer commit as described above, and the occurrence rate of the problem is around 80%, so I set 10 rounds for testing in the CI job, and run the test job without any problem.

Could you please help to review this PR? Thanks!

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2026

@liweinan: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yunjiang29
Copy link
Contributor

/assign @yunjiang29

@liweinan
Copy link
Contributor Author

Verification process for 4.19 and older releases: https://gist.github.com/liweinan/1c16388052387032931c1846035d8052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants