fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race by hsong-rh · Pull Request #1485 · ansible/eda-server

hsong-rh · 2026-02-23T13:18:26Z

https://issues.redhat.com/browse/AAP-66287

Summary

Adds a queue affinity guard in _manage_no_lock() that compares the activation's RulebookProcessQueue.queue_name to the local RULEBOOK_QUEUE_NAME before calling monitor(), preventing cross-node container checks that caused false activation kills
Enhances "Missing container" log message in activation_manager.py to include pod_id with null safety for easier debugging
Adds 4 new unit tests covering wrong queue skip, matching queue proceed, ValueError fallback, and None fallback scenarios

Problem

In a 2-node EDA deployment, stale _manage tasks could be dequeued on the wrong node. The monitor would check the local podman for a container running on a different node, fail to find it, and trigger _missing_container_policy() to falsely kill the activation. Production logs showed 431 false "Missing container for running activation" events with 100% cross-node false positives.

Root Cause

_manage_no_lock() lacked queue affinity check before monitor() call
unique_enqueue() has no dedup guarantee across nodes
Scheduler re-dispatches every 5s to random queues
_missing_container_policy() kills activations when container not found locally

Verification

Local 2-node docker-compose environment with 100 concurrent activations (max_workers=1)
36 cross-node monitors correctly skipped, 0 false "Missing container" events
Long-running rulebook test also passed with zero cross-node errors

Test plan

4 new unit tests added and passing
Existing test_manage_monitor_called_with_no_requests updated to mock queue affinity components
Local 2-node verification with concurrent activations
CI pipeline validation

🤖 Generated with Claude Code

codecov-commenter · 2026-02-23T13:31:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.52%. Comparing base (34a0afa) to head (73a28d2).

@@           Coverage Diff           @@
##             main    #1485   +/-   ##
=======================================
  Coverage   91.51%   91.52%           
=======================================
  Files         235      235           
  Lines       10135    10145   +10     
=======================================
+ Hits         9275     9285   +10     
  Misses        860      860

Flag	Coverage Δ
unit-int-tests-3.11	`91.52% <100.00%> (+<0.01%)`	⬆️
unit-int-tests-3.12	`91.52% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../aap_eda/services/activation/activation_manager.py	`62.34% <100.00%> (ø)`
src/aap_eda/tasks/orchestrator.py	`74.75% <100.00%> (+1.31%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hsong-rh · 2026-02-23T13:33:01Z

/run-e2e

hsong-rh · 2026-02-23T15:16:47Z

Green e2e test: https://github.com/ansible/eda-server/actions/runs/22308309865

…or race In multi-node deployments, stale _manage tasks could dequeue on a different node than where the activation's container was running. The monitor would check the local podman, fail to find the container, and falsely mark the activation as FAILED. Add a guard in _manage_no_lock() that compares the activation's RulebookProcessQueue.queue_name to the local RULEBOOK_QUEUE_NAME before running monitor(). If they differ, the monitor is skipped. Also add pod_id to the "Missing container" log message for easier debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-02-23T15:30:07Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

hsong-rh requested a review from a team as a code owner February 23, 2026 13:18

hsong-rh force-pushed the aap-66287 branch from cb0bb50 to 73a28d2 Compare February 23, 2026 15:17

hsong-rh requested review from andresberejnoi, kaiokmo, mkanoor and ttuffin February 23, 2026 15:18

ttuffin approved these changes Feb 23, 2026

View reviewed changes

kaiokmo approved these changes Feb 24, 2026

View reviewed changes

hsong-rh merged commit 1da857b into ansible:main Feb 24, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race#1485

fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race#1485
hsong-rh merged 1 commit intoansible:mainfrom
hsong-rh:aap-66287

hsong-rh commented Feb 23, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 23, 2026 •

edited

Loading

Uh oh!

hsong-rh commented Feb 23, 2026

Uh oh!

hsong-rh commented Feb 23, 2026

Uh oh!

sonarqubecloud bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

hsong-rh commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Verification

Test plan

Uh oh!

codecov-commenter commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hsong-rh commented Feb 23, 2026

Uh oh!

hsong-rh commented Feb 23, 2026

Uh oh!

sonarqubecloud bot commented Feb 23, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hsong-rh commented Feb 23, 2026 •

edited

Loading

codecov-commenter commented Feb 23, 2026 •

edited

Loading