Skip to content

Comments

fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race#1485

Merged
hsong-rh merged 1 commit intoansible:mainfrom
hsong-rh:aap-66287
Feb 24, 2026
Merged

fix: [AAP-66287] add queue affinity guard to prevent cross-node monitor race#1485
hsong-rh merged 1 commit intoansible:mainfrom
hsong-rh:aap-66287

Conversation

@hsong-rh
Copy link
Contributor

@hsong-rh hsong-rh commented Feb 23, 2026

https://issues.redhat.com/browse/AAP-66287

Summary

  • Adds a queue affinity guard in _manage_no_lock() that compares the activation's RulebookProcessQueue.queue_name to the local RULEBOOK_QUEUE_NAME before calling monitor(), preventing cross-node container checks that caused false activation kills
  • Enhances "Missing container" log message in activation_manager.py to include pod_id with null safety for easier debugging
  • Adds 4 new unit tests covering wrong queue skip, matching queue proceed, ValueError fallback, and None fallback scenarios

Problem

In a 2-node EDA deployment, stale _manage tasks could be dequeued on the wrong node. The monitor would check the local podman for a container running on a different node, fail to find it, and trigger _missing_container_policy() to falsely kill the activation. Production logs showed 431 false "Missing container for running activation" events with 100% cross-node false positives.

Root Cause

  • _manage_no_lock() lacked queue affinity check before monitor() call
  • unique_enqueue() has no dedup guarantee across nodes
  • Scheduler re-dispatches every 5s to random queues
  • _missing_container_policy() kills activations when container not found locally

Verification

  • Local 2-node docker-compose environment with 100 concurrent activations (max_workers=1)
  • 36 cross-node monitors correctly skipped, 0 false "Missing container" events
  • Long-running rulebook test also passed with zero cross-node errors

Test plan

  • 4 new unit tests added and passing
  • Existing test_manage_monitor_called_with_no_requests updated to mock queue affinity components
  • Local 2-node verification with concurrent activations
  • CI pipeline validation

🤖 Generated with Claude Code

@hsong-rh hsong-rh requested a review from a team as a code owner February 23, 2026 13:18
@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.52%. Comparing base (34a0afa) to head (73a28d2).

@@           Coverage Diff           @@
##             main    #1485   +/-   ##
=======================================
  Coverage   91.51%   91.52%           
=======================================
  Files         235      235           
  Lines       10135    10145   +10     
=======================================
+ Hits         9275     9285   +10     
  Misses        860      860           
Flag Coverage Δ
unit-int-tests-3.11 91.52% <100.00%> (+<0.01%) ⬆️
unit-int-tests-3.12 91.52% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../aap_eda/services/activation/activation_manager.py 62.34% <100.00%> (ø)
src/aap_eda/tasks/orchestrator.py 74.75% <100.00%> (+1.31%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hsong-rh
Copy link
Contributor Author

/run-e2e

@hsong-rh
Copy link
Contributor Author

…or race

In multi-node deployments, stale _manage tasks could dequeue on a
different node than where the activation's container was running.
The monitor would check the local podman, fail to find the container,
and falsely mark the activation as FAILED.

Add a guard in _manage_no_lock() that compares the activation's
RulebookProcessQueue.queue_name to the local RULEBOOK_QUEUE_NAME
before running monitor(). If they differ, the monitor is skipped.

Also add pod_id to the "Missing container" log message for easier
debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link

@hsong-rh hsong-rh merged commit 1da857b into ansible:main Feb 24, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants