Skip to content

Conversation

@cardil
Copy link
Member

@cardil cardil commented Feb 7, 2026

Problem

The deployment-upgrade-failure test occasionally leaves Route resources stuck with finalizers because the metadata-webhook service is deleted before the Route can be processed by the webhook.

When the webhook service is unavailable, the MutatingWebhookConfiguration still intercepts Route deletion requests, causing a timeout and leaving the Route stuck.

Root Cause

Race condition during namespace cleanup:

  1. Tests complete and cleanup runs kubectl delete ns serving-tests
  2. Namespace deletion starts, triggering Route finalizer removal
  3. Route finalizer removal triggers the MutatingWebhookConfiguration (cluster-scoped)
  4. Webhook service (in serving-tests namespace) is already being deleted
  5. Webhook call times out → namespace deletion hangs

The MutatingWebhookConfiguration is cluster-scoped and persists even after namespace deletion starts.

Solution

Delete the metadata-webhook resources (including the cluster-scoped MutatingWebhookConfiguration) before deleting the serving-tests namespace. This mirrors the installation order:

  • Install: oc apply -f serving/metadata-webhook/config → create namespace
  • Cleanup: delete webhook resources → delete namespace

Changes:

  • Added webhook cleanup in test/lib.bash before namespace deletion
  • Cleanup uses oc delete -f serving/metadata-webhook/config to mirror installation

Testing

This fix ensures the webhook configuration is removed before any Route finalizers are processed during namespace deletion, preventing the timeout.


Assisted-by: 🤖 Claude Opus/Sonnet 4.5

Add namespaceSelector to the MutatingWebhookConfiguration to limit
the webhook's scope to namespaces with the samples.knative.dev/release
label. This prevents the webhook from blocking resource deletions in
other namespaces when the serving-tests namespace is torn down.

The issue occurred during upgrade test cleanup where the Route resource
for deployment-upgrade-failure could not be deleted because the webhook
service was unavailable after namespace cleanup started.

Assisted-by: 🤖 Claude Opus/Sonnet 4.5
@cardil
Copy link
Member Author

cardil commented Feb 7, 2026

/cherrypick main
/cherrypick release-1.38

@openshift-cherrypick-robot
Copy link
Contributor

@cardil: once the present PR merges, I will cherry-pick it on top of main, release-1.38 in new PRs and assign them to you.

Details

In response to this:

/cherrypick main
/cherrypick release-1.38

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cardil
Copy link
Member Author

cardil commented Feb 7, 2026

/retest

Delete webhook resources before namespace deletion to prevent blocking
Route finalizer removal when webhook service is unavailable.

The issue occurs when:
1. Tests complete and cleanup starts
2. 'kubectl delete ns serving-tests' begins namespace deletion
3. Routes have finalizers that need removal
4. Finalizer removal triggers the MutatingWebhookConfiguration
5. Webhook service (in serving-tests) is already being deleted
6. Webhook call times out, blocking namespace deletion

Solution: Delete the webhook resources (including the cluster-scoped
MutatingWebhookConfiguration) before deleting the serving-tests namespace.
This mirrors the installation order and prevents the race condition.
The webhook config directory includes 100-namespace.yaml which deletes
the serving-tests namespace. Adding --ignore-not-found prevents the
error when the namespace is already deleted.
@cardil
Copy link
Member Author

cardil commented Feb 9, 2026

I'll rerun this one to give a little higher chance the fix is real and not a fix-by-chance:

/test 420-mesh-upgrade

@cardil
Copy link
Member Author

cardil commented Feb 9, 2026

This was an unrelated infra failure: 2020862032168882176

/test 420-mesh-upgrade

@cardil
Copy link
Member Author

cardil commented Feb 9, 2026

/assign @maschmid

@maschmid
Copy link
Contributor

maschmid commented Feb 9, 2026

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm label Feb 9, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cardil, maschmid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 9, 2026
@openshift-merge-bot openshift-merge-bot bot merged commit ccf08fd into openshift-knative:release-1.37 Feb 9, 2026
20 checks passed
@openshift-cherrypick-robot
Copy link
Contributor

@cardil: new pull request created: #3974

Details

In response to this:

/cherrypick main
/cherrypick release-1.38

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot
Copy link
Contributor

@cardil: cannot checkout release-1.38: error checking out "release-1.38": exit status 1 error: pathspec 'release-1.38' did not match any file(s) known to git

Details

In response to this:

/cherrypick main
/cherrypick release-1.38

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cardil cardil deleted the bugfix/1.37/metadata-webhook-cleanup branch February 9, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants