fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race#1485
Merged
hsong-rh merged 1 commit intoansible:mainfrom Feb 24, 2026
Merged
fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race#1485hsong-rh merged 1 commit intoansible:mainfrom
hsong-rh merged 1 commit intoansible:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #1485 +/- ##
=======================================
Coverage 91.51% 91.52%
=======================================
Files 235 235
Lines 10135 10145 +10
=======================================
+ Hits 9275 9285 +10
Misses 860 860
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Contributor
Author
|
/run-e2e |
Contributor
Author
|
Green e2e test: https://github.com/ansible/eda-server/actions/runs/22308309865 |
…or race In multi-node deployments, stale _manage tasks could dequeue on a different node than where the activation's container was running. The monitor would check the local podman, fail to find the container, and falsely mark the activation as FAILED. Add a guard in _manage_no_lock() that compares the activation's RulebookProcessQueue.queue_name to the local RULEBOOK_QUEUE_NAME before running monitor(). If they differ, the monitor is skipped. Also add pod_id to the "Missing container" log message for easier debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
ttuffin
approved these changes
Feb 23, 2026
kaiokmo
approved these changes
Feb 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



https://issues.redhat.com/browse/AAP-66287
Summary
_manage_no_lock()that compares the activation'sRulebookProcessQueue.queue_nameto the localRULEBOOK_QUEUE_NAMEbefore callingmonitor(), preventing cross-node container checks that caused false activation killsactivation_manager.pyto includepod_idwith null safety for easier debuggingProblem
In a 2-node EDA deployment, stale
_managetasks could be dequeued on the wrong node. The monitor would check the local podman for a container running on a different node, fail to find it, and trigger_missing_container_policy()to falsely kill the activation. Production logs showed 431 false "Missing container for running activation" events with 100% cross-node false positives.Root Cause
_manage_no_lock()lacked queue affinity check beforemonitor()callunique_enqueue()has no dedup guarantee across nodes_missing_container_policy()kills activations when container not found locallyVerification
Test plan
test_manage_monitor_called_with_no_requestsupdated to mock queue affinity components🤖 Generated with Claude Code