Skip to content

[FEATURE] Add CloudWatch Monitoring for Repeated Container Creation in Python version of ror-reconciler #362

@adambuttrick

Description

@adambuttrick

Summary

Add CloudWatch monitoring to detect and alert on abnormal container lifecycle patterns for the python version of the ROR Reconciler service, specifically repeated container creation events that indicate crash loops, OOM kills, or deployment instability. The current deployment has no container-level observability beyond basic Docker logs.

Acceptance criteria

  • Rukes captures ECS task stop events for the reconciler service
  • CloudWatch metrics published for container restarts, OOM kills, and task creation rate
  • CloudWatch alerts Slack when hitting pre-defined error thresholds
  • Alarms validated in dev/staging environment before production deployment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions