Add Application Insights telemetry collection specification document #209

Ritvik-Jayaswal · 2026-01-15T20:06:52Z

This document outlines the telemetry data points collected by Application Insights for the DocumentDB Kubernetes Operator, detailing metrics for operator lifecycle, cluster management, lifecycle operations, backup and restore, failover, error tracking, feature usage, performance, compliance, and deployment context.

xgerman · 2026-01-15T22:51:06Z

docs/designs/appinsights-design.md

+- **Properties**:
+  - `operator_version`: Semantic version of the operator
+  - `kubernetes_version`: K8s cluster version
+  - `cloud_provider`: Detected environment (`aks`, `eks`, `gke`, `unknown`)


iopenshift, rancher, vmware, etc. as well

xgerman · 2026-01-15T22:51:44Z

docs/designs/appinsights-design.md

+
+### Operator Health Checks
+- **Metric**: `operator.health.status`
+- **Value**: `1` (healthy) or `0` (unhealthy)


what does this mean> Unhealthy operator - how can we act on this?

xgerman · 2026-01-15T22:52:05Z

docs/designs/appinsights-design.md

+- **Metric**: `operator.health.status`
+- **Value**: `1` (healthy) or `0` (unhealthy)
+- **Frequency**: Every 60 seconds
+- **Dimensions**: `pod_name`, `namespace`


PII better not to collect

xgerman · 2026-01-15T22:52:47Z

docs/designs/appinsights-design.md

+### Cluster Size Metrics
+- **Metric**: `documentdb.cluster.configuration`
+- **Properties per cluster**:
+  - `cluster_name`: Name of the DocumentDB cluster


xgerman · 2026-01-15T22:52:58Z

docs/designs/appinsights-design.md

+- **Metric**: `documentdb.cluster.configuration`
+- **Properties per cluster**:
+  - `cluster_name`: Name of the DocumentDB cluster
+  - `namespace`: Kubernetes namespace


xgerman · 2026-01-15T23:03:08Z

docs/designs/appinsights-design.md

+- **Metric**: `documentdb.replication.lag.bytes`
+- **Description**: Replication lag in bytes
+- **Dimensions**: `cluster_name`, `replica_cluster`, `namespace`
+- **Frequency**: Every 30 seconds


let's do a tuple (min, max, avg) over a couple of hours

xgerman · 2026-01-15T23:08:14Z

docs/designs/appinsights-design.md

+
+### Cluster Environment
+- **Properties** (collected once at startup, attached to all events):
+  - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `other`


yeah, openshift

xgerman · 2026-01-15T23:08:49Z

docs/designs/appinsights-design.md

+
+## Data Privacy & Security
+
+- **No PII**: Do not collect usernames, passwords, connection strings, or IP addresses


clustername, namespace, storage class can also conatin PII - user would assume they are private.

xgerman · 2026-01-15T23:11:11Z

docs/designs/appinsights-design.md

+3. **Cardinality**: Monitor dimension cardinality to avoid explosion
+4. **Retry logic**: Implement exponential backoff for telemetry submission failures
+5. **Local buffering**: Buffer events locally if Application Insights is unreachable
+6. **Health endpoint**: Expose `/metrics` endpoint for Prometheus scraping


why would application monitoring need a promethesu endpoint?

xgerman · 2026-01-15T23:11:49Z

docs/designs/appinsights-design.md

+
+- **No PII**: Do not collect usernames, passwords, connection strings, or IP addresses
+- **Sanitize errors**: Remove sensitive data from error messages
+- **Cluster names**: Use hashed cluster names if privacy required


we need to use GUID for that we generate so we can correlate events properly

alaye-ms · 2026-01-22T16:46:33Z

docs/designs/appinsights-design.md

+- **Properties** (collected once at startup, attached to all events):
+  - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `other`
+  - `kubernetes_version`: K8s version
+  - `region`: Cloud region (if detectable)


You might be able to pull that from the pod.metadata.label["topology.kubernetes.io/region"]

Copilot

Pull request overview

This PR introduces a comprehensive telemetry specification document for the DocumentDB Kubernetes Operator, defining Application Insights metrics and events to be collected for operational monitoring, usage analysis, and error tracking.

Changes:

Added detailed specification document covering 10 categories of telemetry: operator lifecycle, cluster management, lifecycle operations, backup/restore, failover/HA, error tracking, feature usage, performance, compliance, and deployment context
Defined data privacy and security guidelines including PII protection through GUID-based identifiers and namespace hashing
Included implementation notes for sampling, batching, cardinality management, and retry logic

Copilot · 2026-02-02T17:11:16Z

docs/designs/appinsights-metrics.md

+  - `resource_id`: Auto-generated GUID of the resource
+  - `namespace_hash`: SHA-256 hash of the Kubernetes namespace
+  - `error_type`: `cluster-creation`, `backup-failure`, `restore-failure`, `volume-snapshot`, `replication-config`, `tls-cert`
+  - `error_message`: Sanitized error message (no PII)


The specification states that error messages should be "Sanitized" with "no PII" (line 186), but doesn't provide clear guidance on how to sanitize error messages or what constitutes PII in error contexts. This ambiguity could lead to inconsistent implementations. Consider adding specific examples of sanitization patterns, such as removing resource names, IP addresses, file paths, or user identifiers, and possibly referencing the error categorization approach used elsewhere (line 205).

Suggested change

- `error_message`: Sanitized error message (no PII)

- `error_message`: Sanitized error message (no PII). The message MUST:

- avoid including raw Kubernetes resource names, namespaces, node names, IP addresses, hostnames, file paths, usernames, email addresses, cloud account IDs, or any token/secret values

- be derived from a stable error category and high-level description (for example, "PVC provisioning failed" or "TLS certificate validation error") rather than raw provider/library error strings

- be safe to log in multi-tenant environments

- when in doubt, prefer mapping to a coarse-grained description based on `error_type` and `error_code`

Copilot · 2026-02-02T17:11:17Z

docs/designs/appinsights-metrics.md

+  - `namespace_hash`: SHA-256 hash of the Kubernetes namespace
+  - `restore_duration_seconds`: Time to restore from backup
+  - `backup_age_hours`: Age of backup at restore time
+  - `restore_phase`: `starting`, `running`, `completed`, `failed`


The specification lists backup_phase values as starting, running, completed, failed, skipped (line 112), and restore_phase values as starting, running, completed, failed (line 145). The restore phase is missing the skipped state. For consistency and completeness, clarify whether restore operations can be skipped, and if so, include this state in the restore_phase values.

Suggested change

- `restore_phase`: `starting`, `running`, `completed`, `failed`

- `restore_phase`: `starting`, `running`, `completed`, `failed`, `skipped`

Copilot · 2026-02-02T17:11:17Z

docs/designs/appinsights-metrics.md

+  - `old_primary_index`: Index of the previous primary instance (e.g., 0, 1, 2)
+  - `new_primary_index`: Index of the new primary instance


The specification includes both "old_primary_index" and "new_primary_index" properties with example values "e.g., 0, 1, 2" suggesting zero-based indexing. However, this conflicts with the DocumentDB cluster architecture where instances_per_node ranges from 1-3 (line 44). Clarify whether these indices are zero-based (0-2 for 3 instances) or one-based (1-3), and ensure consistency with how instances are identified elsewhere in the system.

Suggested change

- `old_primary_index`: Index of the previous primary instance (e.g., 0, 1, 2)

- `new_primary_index`: Index of the new primary instance

- `old_primary_index`: Zero-based index (instance ordinal) of the previous primary instance (`0..instances_per_node-1`, e.g., `0, 1, 2` for 3 instances)

- `new_primary_index`: Zero-based index (instance ordinal) of the new primary instance (`0..instances_per_node-1`, e.g., `0, 1, 2` for 3 instances)

Copilot · 2026-02-02T17:11:17Z

docs/designs/appinsights-metrics.md

+- **Metric**: `documentdb.replication.lag.bytes`
+- **Description**: Replication lag in bytes (aggregated over 2-hour windows)
+- **Dimensions**: `cluster_id`, `replica_cluster_id`, `namespace_hash`
+- **Statistics**: min, max, avg (reported as tuple)
+- **Frequency**: Every 2 hours (aggregated)


The metric documentdb.replication.lag.bytes specifies "aggregated over 2-hour windows" with statistics (min, max, avg) reported every 2 hours. This aggregation period seems quite long for a critical replication health metric. High replication lag can indicate serious issues, and a 2-hour reporting delay could mask problems. Consider documenting the rationale for this long aggregation window, or whether a shorter interval (e.g., 5-15 minutes) would be more appropriate for operational monitoring.

Copilot · 2026-02-02T17:11:18Z

docs/designs/appinsights-metrics.md

+- **Properties**:
+  - `sidecar_injector_plugin_enabled`: Boolean indicating if plugin is used
+  - `wal_replica_plugin_enabled`: Boolean indicating if plugin is used


The specification includes both wal_replica_plugin_enabled (line 230) as a telemetry property, but this plugin is not mentioned anywhere else in the specification. Unlike the sidecar_injector_plugin which is referenced in cluster creation events (line 77), there's no context for what the WAL replica plugin does or when it's relevant. Consider adding documentation about this plugin's purpose and usage context, or if it's not yet implemented, note it as a future feature.

Suggested change

- **Properties**:

- `sidecar_injector_plugin_enabled`: Boolean indicating if plugin is used

- `wal_replica_plugin_enabled`: Boolean indicating if plugin is used

- **Description**: Tracks usage of optional operator plugins that extend core functionality.

- **Properties**:

- `sidecar_injector_plugin_enabled`: Boolean indicating whether the sidecar injector plugin is enabled for the operator (e.g., for injecting supporting sidecars into DocumentDB pods).

- `wal_replica_plugin_enabled`: Boolean indicating whether the WAL replica plugin is enabled. This is reserved for a future/experimental plugin that manages write-ahead-log (WAL) replication behavior; in operator versions where this plugin is not implemented, this flag MUST remain `false`.

Copilot · 2026-02-02T17:11:19Z

docs/designs/appinsights-metrics.md

+- **Dimensions**:
+  - `namespace_hash`: SHA-256 hash of the Kubernetes namespace
+  - `cloud_provider`: `aks`, `eks`, `gke`
+  - `environment`: `aks`, `eks`, `gke` (from spec.environment)


The dimension name environment duplicates the cloud_provider dimension on line 35. According to line 36, "environment" comes from "spec.environment", but line 35 already captures cloud provider information. This creates ambiguity - if both dimensions capture the same values (aks, eks, gke), one should be removed or their distinction should be clarified. If they serve different purposes, the specification should explain what "spec.environment" represents and how it differs from the detected cloud_provider.

Suggested change

- `environment`: `aks`, `eks`, `gke` (from spec.environment)

- `environment`: Logical deployment environment (e.g., `dev`, `staging`, `prod`) from `spec.environment`, distinct from `cloud_provider`

Copilot · 2026-02-02T17:11:20Z

docs/designs/appinsights-metrics.md

+- **Value**: `1` (enabled) or `0` (disabled)
+- **Properties**:
+  - `cluster_id`: Auto-generated GUID for the DocumentDB cluster
+  - `cross_cloud_networking_strategy`: `AzureFleet`, `Istio`, `None`


The metric name uses "cross_cloud_networking_strategy" which includes "AzureFleet" as a value, but line 59 refers to it as "participating_cluster_count" suggesting multi-cluster coordination. However, "AzureFleet" is Azure-specific terminology. Consider whether "cross_cloud_networking_strategy" is the appropriate name when one of the values is cloud-specific, or if this should be named more generically (e.g., "multi_cluster_networking_strategy").

Suggested change

- `cross_cloud_networking_strategy`: `AzureFleet`, `Istio`, `None`

- `multi_cluster_networking_strategy`: `AzureFleet`, `Istio`, `None`

Copilot · 2026-02-02T17:11:20Z

docs/designs/appinsights-metrics.md

+- **Properties per cluster**:
+  - `cluster_id`: Auto-generated GUID for the DocumentDB cluster (for correlation without PII)
+  - `namespace_hash`: SHA-256 hash of the Kubernetes namespace
+  - `node_count`: Number of nodes (currently always 1)


The specification states "Number of nodes (currently always 1)" which suggests this is a hard-coded value in the current implementation. Recording a telemetry metric that is always 1 provides no analytical value and wastes storage. Consider either removing this property until multi-node support is implemented, or clarifying the purpose of collecting this constant value.

Suggested change

- `node_count`: Number of nodes (currently always 1)

- `node_count` (optional): Number of nodes in the cluster; omit this property while the operator only supports a single node

Copilot · 2026-02-02T17:11:20Z

docs/designs/appinsights-metrics.md

+  - `retention_days`: Configured retention period
+  - `backup_phase`: `starting`, `running`, `completed`, `failed`, `skipped`
+  - `cloud_provider`: Environment where backup was taken
+  - `is_primary_cluster`: Boolean indicating if backup from primary


The property is_primary_cluster (line 114) is inconsistent with the naming pattern used elsewhere. Other boolean indicators in the specification use verb-based naming like tls_enabled (line 75), server_tls_enabled (line 217), client_tls_enabled (line 218), or sidecar_injector_plugin_enabled (line 229). For consistency, consider renaming this to primary_cluster_backup or from_primary_cluster.

Suggested change

- `is_primary_cluster`: Boolean indicating if backup from primary

- `from_primary_cluster`: Boolean indicating if backup was taken from primary cluster

Copilot · 2026-02-02T17:11:21Z

docs/designs/appinsights-metrics.md

+  - `resource_type`: `DocumentDB`, `Backup`, `ScheduledBackup`
+  - `resource_id`: Auto-generated GUID of the resource
+  - `namespace_hash`: SHA-256 hash of the Kubernetes namespace
+  - `error_type`: `cluster-creation`, `backup-failure`, `restore-failure`, `volume-snapshot`, `replication-config`, `tls-cert`


The error type list includes tls-cert but there is no corresponding section documenting TLS certificate-related errors or their handling. Given that TLS is a critical security feature (tracked in lines 75, 212-219), certificate errors deserve explicit documentation. Consider adding details about what types of TLS certificate errors will be tracked (e.g., expiration, validation failures, missing certificates).

added Appinsights doc

db6ad55

Ritvik-Jayaswal requested review from alaye-ms, hossain-rayhan and xgerman as code owners January 15, 2026 20:06

Ritvik-Jayaswal mentioned this pull request Jan 15, 2026

AppInsight metrics #159

Open

xgerman requested changes Jan 15, 2026

View reviewed changes

alaye-ms reviewed Jan 22, 2026

View reviewed changes

dixed PII

0165f5c

Copilot AI review requested due to automatic review settings February 2, 2026 17:07

Ritvik-Jayaswal force-pushed the patch-1 branch from 7cdc39b to 0165f5c Compare February 2, 2026 17:07

Copilot started reviewing on behalf of Ritvik-Jayaswal February 2, 2026 17:07 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes


		## Data Privacy & Security

		- No PII: Do not collect usernames, passwords, connection strings, or IP addresses

-  - `error_message`: Sanitized error message (no PII)
+  - `error_message`: Sanitized error message (no PII). The message MUST:
+    - avoid including raw Kubernetes resource names, namespaces, node names, IP addresses, hostnames, file paths, usernames, email addresses, cloud account IDs, or any token/secret values
+    - be derived from a stable error category and high-level description (for example, "PVC provisioning failed" or "TLS certificate validation error") rather than raw provider/library error strings
+    - be safe to log in multi-tenant environments
+    - when in doubt, prefer mapping to a coarse-grained description based on `error_type` and `error_code`

	- `restore_phase`: `starting`, `running`, `completed`, `failed`
	- `restore_phase`: `starting`, `running`, `completed`, `failed`, `skipped`

		- `old_primary_index`: Index of the previous primary instance (e.g., 0, 1, 2)
		- `new_primary_index`: Index of the new primary instance

	- `environment`: `aks`, `eks`, `gke` (from spec.environment)
	- `environment`: Logical deployment environment (e.g., `dev`, `staging`, `prod`) from `spec.environment`, distinct from `cloud_provider`

	- `cross_cloud_networking_strategy`: `AzureFleet`, `Istio`, `None`
	- `multi_cluster_networking_strategy`: `AzureFleet`, `Istio`, `None`

	- `node_count`: Number of nodes (currently always 1)
	- `node_count` (optional): Number of nodes in the cluster; omit this property while the operator only supports a single node

	- `is_primary_cluster`: Boolean indicating if backup from primary
	- `from_primary_cluster`: Boolean indicating if backup was taken from primary cluster

Add Application Insights telemetry collection specification document #209

Are you sure you want to change the base?

Add Application Insights telemetry collection specification document #209

Uh oh!

Conversation

Ritvik-Jayaswal commented Jan 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants