feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

sai-praveen-os · 2026-01-19T11:15:35Z

Add Monitoring Features to connectivity_check Module

Problem

The connectivity_check module currently supports on-demand connectivity testing via script invocation. We need automated scheduled monitoring with CloudWatch metrics and alarms for proactive alerting.

Solution

Extend the module with optional monitoring capabilities while maintaining backward compatibility.

Changes

monitoring.tf (new):

EventBridge rule for scheduled Lambda invocation (configurable, default: 5 min)
CloudWatch alarms for endpoint failures (per-critical-endpoint + aggregate)
CloudWatch alarm for Lambda execution errors
IAM permissions for CloudWatch metrics and EventBridge

lambda/handler.ts:

Publish CloudWatch metrics: EndpointConnectivity (1=up, 0=down), EndpointLatency (ms)
Metrics include dimensions: FunctionName, Endpoint, Critical

lambda/package.json:

Added @aws-sdk/client-cloudwatch dependency

variables.tf:

Added monitoring configuration variables: enable_monitoring, monitoring_schedule, monitoring_targets, alarm_sns_topic_arns, alarm_evaluation_periods

Usage Example

module "connectivity_monitor" {
  source = "path/to/connectivity_check"
  
  enable_monitoring    = true
  monitoring_schedule  = "rate(5 minutes)"
  cloudwatch_namespace = "myapp/connectivity/dev"
  
  monitoring_targets = [
    { host = "api.example.com", port = 443, protocol = "https", critical = true }
  ]
  
  alarm_sns_topic_arns = [aws_sns_topic.alerts.arn]
}

Backward Compatibility

All monitoring features are optional. Existing module usage continues to work unchanged.

…trics and alarms - Add monitoring.tf with EventBridge scheduling and CloudWatch alarms - Extend Lambda handler to publish EndpointConnectivity and EndpointLatency metrics - Add @aws-sdk/client-cloudwatch dependency to package.json - Add monitoring configuration variables (enable_monitoring, monitoring_schedule, monitoring_targets, alarm_sns_topic_arns) - Create per-critical-endpoint alarms, aggregate alarm, and Lambda error alarm - All monitoring features are optional and backward compatible Task: Enable proactive monitoring for Janus external dependencies after identity service outage incident

smayberry · 2026-01-20T14:19:53Z

modules/connectivity_check/lambda/handler.ts

    }
+
+    // Include critical flag in result
+    result.critical = target.critical;


Since critical is included in target, and testTcp() and testHttp() already copy fields from target into their result, they could also include critical in their results and avoid having to add it after the fact.

kevin-secrist · 2026-01-20T15:08:05Z

modules/connectivity_check/variables.tf

+variable "monitoring_schedule" {
+  description = "EventBridge schedule expression for monitoring (e.g., 'rate(5 minutes)')"
+  type        = string
+  default     = "rate(5 minutes)"


More granularity might be a good idea. As configured it'll take 10 minutes for the alarm to fire. Running it every minute and requiring 3 or 4 evaluation periods would invoke an alarm much sooner and have a smaller false positive rate.

kevin-secrist · 2026-01-20T16:30:28Z

modules/connectivity_check/variables.tf

+variable "cloudwatch_namespace" {
+  description = "CloudWatch namespace for custom metrics"
+  type        = string
+  default     = "janus/connectivity"


Suggested change

default = "janus/connectivity"

default = "connectivity"

Or make the default empty string and it can be the switch to turn metrics on/off

Another option is hard-coding it so that all consumers have the same namespace, so when they get exported to datadog they'll all be consistent which might be useful for making cross-account/cross-team dashboards, since the metrics at that point will also (I think) be tagged with the account number, environment, etc.

kevin-secrist · 2026-01-20T16:59:49Z

modules/connectivity_check/monitoring.tf

+  description = "Number of periods to evaluate for alarms"
+  type        = number
+  default     = 2
+}


The variables in this file can be removed

kevin-secrist · 2026-01-20T17:01:27Z

modules/connectivity_check/monitoring.tf

+    targets                = var.monitoring_targets
+    publishMetrics         = true
+    cloudwatchNamespace    = var.cloudwatch_namespace
+    failOnConnectivityLoss = false  # Don't fail Lambda, just publish metrics


If this is always false we don't need the code for it. For now I think it makes sense to just remove it, if it's not being used. We're generally going to be using datadog for monitoring the metrics anyway, since those will be available.

kevin-secrist · 2026-01-20T17:01:49Z

modules/connectivity_check/monitoring.tf

+  arn   = module.lambda.lambda_function_arn
+  input = jsonencode({
+    targets                = var.monitoring_targets
+    publishMetrics         = true


This should be configurable, or it should depend on if cloudwatch_namespace is empty.

kevin-secrist · 2026-01-20T17:20:12Z

modules/connectivity_check/monitoring.tf

+
+# CloudWatch alarms for critical endpoint failures
+# Creates one alarm per critical target
+resource "aws_cloudwatch_metric_alarm" "critical_endpoint_failure" {


Might be good to try out what it'd look like to consume these in datadog as a PoC. I think we'd probably want monitors on all the hosts, but have different severities based on if the host is critical or not. Inevitably we may end up making a dashboard that aggregates all of these across all accounts.

Also might be worth looking into if these are necessary at all. We export metrics to datadog, so we could use that do the alerting on that side. Maybe skip making the cloudwatch alarms if var.alarm_sns_topic_arns is empty? That gives consumers a choice if they want to put datadog into the equation. Without datadog, I guess you'd put a SNS topic that hits alertops or something here.

smayberry reviewed Jan 20, 2026

View reviewed changes

kevin-secrist reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

sai-praveen-os commented Jan 19, 2026 •

edited by atlassian bot

Loading

Uh oh!

smayberry Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

kevin-secrist Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

Are you sure you want to change the base?

feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

Conversation

sai-praveen-os commented Jan 19, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Monitoring Features to connectivity_check Module

Problem

Solution

Changes

Usage Example

Backward Compatibility

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sai-praveen-os commented Jan 19, 2026 •

edited by atlassian bot

Loading