-
Notifications
You must be signed in to change notification settings - Fork 74
Fix metadata-webhook cleanup race condition #3973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix metadata-webhook cleanup race condition #3973
Conversation
Add namespaceSelector to the MutatingWebhookConfiguration to limit the webhook's scope to namespaces with the samples.knative.dev/release label. This prevents the webhook from blocking resource deletions in other namespaces when the serving-tests namespace is torn down. The issue occurred during upgrade test cleanup where the Route resource for deployment-upgrade-failure could not be deleted because the webhook service was unavailable after namespace cleanup started. Assisted-by: 🤖 Claude Opus/Sonnet 4.5
|
/cherrypick main |
|
@cardil: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
Delete webhook resources before namespace deletion to prevent blocking Route finalizer removal when webhook service is unavailable. The issue occurs when: 1. Tests complete and cleanup starts 2. 'kubectl delete ns serving-tests' begins namespace deletion 3. Routes have finalizers that need removal 4. Finalizer removal triggers the MutatingWebhookConfiguration 5. Webhook service (in serving-tests) is already being deleted 6. Webhook call times out, blocking namespace deletion Solution: Delete the webhook resources (including the cluster-scoped MutatingWebhookConfiguration) before deleting the serving-tests namespace. This mirrors the installation order and prevents the race condition.
The webhook config directory includes 100-namespace.yaml which deletes the serving-tests namespace. Adding --ignore-not-found prevents the error when the namespace is already deleted.
|
I'll rerun this one to give a little higher chance the fix is real and not a fix-by-chance: /test 420-mesh-upgrade |
|
This was an unrelated infra failure: 2020862032168882176 /test 420-mesh-upgrade |
|
/assign @maschmid |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cardil, maschmid The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ccf08fd
into
openshift-knative:release-1.37
|
@cardil: new pull request created: #3974 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@cardil: cannot checkout DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Problem
The
deployment-upgrade-failuretest occasionally leaves Route resources stuck with finalizers because the metadata-webhook service is deleted before the Route can be processed by the webhook.When the webhook service is unavailable, the MutatingWebhookConfiguration still intercepts Route deletion requests, causing a timeout and leaving the Route stuck.
Root Cause
Race condition during namespace cleanup:
kubectl delete ns serving-testsThe MutatingWebhookConfiguration is cluster-scoped and persists even after namespace deletion starts.
Solution
Delete the metadata-webhook resources (including the cluster-scoped MutatingWebhookConfiguration) before deleting the serving-tests namespace. This mirrors the installation order:
oc apply -f serving/metadata-webhook/config→ create namespaceChanges:
test/lib.bashbefore namespace deletionoc delete -f serving/metadata-webhook/configto mirror installationTesting
This fix ensures the webhook configuration is removed before any Route finalizers are processed during namespace deletion, preventing the timeout.
Assisted-by: 🤖 Claude Opus/Sonnet 4.5