-
Notifications
You must be signed in to change notification settings - Fork 14
Add Application Insights telemetry collection specification document #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,306 @@ | ||||||||||||||||
| # Application Insights Telemetry Collection Specification | ||||||||||||||||
|
|
||||||||||||||||
| ## Overview | ||||||||||||||||
| This document specifies all telemetry data points to be collected by Application Insights for the DocumentDB Kubernetes Operator. These metrics provide operational insights, usage patterns, and error tracking for operator deployments. | ||||||||||||||||
|
|
||||||||||||||||
| --- | ||||||||||||||||
|
|
||||||||||||||||
| ## 1. Operator Lifecycle Metrics | ||||||||||||||||
|
|
||||||||||||||||
| ### Operator Startup Events | ||||||||||||||||
| - **Event**: `OperatorStartup` | ||||||||||||||||
| - **Properties**: | ||||||||||||||||
| - `operator_version`: Semantic version of the operator | ||||||||||||||||
| - `kubernetes_version`: K8s cluster version | ||||||||||||||||
| - `cloud_provider`: Detected environment (`aks`, `eks`, `gke`, `unknown`) | ||||||||||||||||
| - `startup_timestamp`: ISO 8601 timestamp | ||||||||||||||||
| - `restart_count`: Number of restarts in the last hour | ||||||||||||||||
| - `helm_chart_version`: Version of the Helm chart used (if applicable) | ||||||||||||||||
|
|
||||||||||||||||
| ### Operator Health Checks | ||||||||||||||||
| - **Metric**: `operator.health.status` | ||||||||||||||||
| - **Value**: `1` (healthy) or `0` (unhealthy) | ||||||||||||||||
| - **Frequency**: Every 60 seconds | ||||||||||||||||
| - **Dimensions**: `pod_name`, `namespace` | ||||||||||||||||
|
|
||||||||||||||||
| --- | ||||||||||||||||
|
|
||||||||||||||||
| ## 2. Cluster Management Metrics | ||||||||||||||||
|
|
||||||||||||||||
| ### Cluster Count & Configuration | ||||||||||||||||
| - **Metric**: `documentdb.clusters.active.count` | ||||||||||||||||
| - **Description**: Total number of active DocumentDB clusters managed by the operator | ||||||||||||||||
| - **Dimensions**: | ||||||||||||||||
| - `namespace_hash`: SHA-256 hash of the Kubernetes namespace | ||||||||||||||||
| - `cloud_provider`: `aks`, `eks`, `gke` | ||||||||||||||||
| - `environment`: `aks`, `eks`, `gke` (from spec.environment) | ||||||||||||||||
|
|
||||||||||||||||
| ### Cluster Size Metrics | ||||||||||||||||
| - **Metric**: `documentdb.cluster.configuration` | ||||||||||||||||
| - **Properties per cluster**: | ||||||||||||||||
| - `cluster_id`: Auto-generated GUID for the DocumentDB cluster (for correlation without PII) | ||||||||||||||||
| - `namespace_hash`: SHA-256 hash of the Kubernetes namespace | ||||||||||||||||
| - `node_count`: Number of nodes (currently always 1) | ||||||||||||||||
|
||||||||||||||||
| - `node_count`: Number of nodes (currently always 1) | |
| - `node_count` (optional): Number of nodes in the cluster; omit this property while the operator only supports a single node |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric name uses "cross_cloud_networking_strategy" which includes "AzureFleet" as a value, but line 59 refers to it as "participating_cluster_count" suggesting multi-cluster coordination. However, "AzureFleet" is Azure-specific terminology. Consider whether "cross_cloud_networking_strategy" is the appropriate name when one of the values is cloud-specific, or if this should be named more generically (e.g., "multi_cluster_networking_strategy").
| - `cross_cloud_networking_strategy`: `AzureFleet`, `Istio`, `None` | |
| - `multi_cluster_networking_strategy`: `AzureFleet`, `Istio`, `None` |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property is_primary_cluster (line 114) is inconsistent with the naming pattern used elsewhere. Other boolean indicators in the specification use verb-based naming like tls_enabled (line 75), server_tls_enabled (line 217), client_tls_enabled (line 218), or sidecar_injector_plugin_enabled (line 229). For consistency, consider renaming this to primary_cluster_backup or from_primary_cluster.
| - `is_primary_cluster`: Boolean indicating if backup from primary | |
| - `from_primary_cluster`: Boolean indicating if backup was taken from primary cluster |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification lists backup_phase values as starting, running, completed, failed, skipped (line 112), and restore_phase values as starting, running, completed, failed (line 145). The restore phase is missing the skipped state. For consistency and completeness, clarify whether restore operations can be skipped, and if so, include this state in the restore_phase values.
| - `restore_phase`: `starting`, `running`, `completed`, `failed` | |
| - `restore_phase`: `starting`, `running`, `completed`, `failed`, `skipped` |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification includes both "old_primary_index" and "new_primary_index" properties with example values "e.g., 0, 1, 2" suggesting zero-based indexing. However, this conflicts with the DocumentDB cluster architecture where instances_per_node ranges from 1-3 (line 44). Clarify whether these indices are zero-based (0-2 for 3 instances) or one-based (1-3), and ensure consistency with how instances are identified elsewhere in the system.
| - `old_primary_index`: Index of the previous primary instance (e.g., 0, 1, 2) | |
| - `new_primary_index`: Index of the new primary instance | |
| - `old_primary_index`: Zero-based index (instance ordinal) of the previous primary instance (`0..instances_per_node-1`, e.g., `0, 1, 2` for 3 instances) | |
| - `new_primary_index`: Zero-based index (instance ordinal) of the new primary instance (`0..instances_per_node-1`, e.g., `0, 1, 2` for 3 instances) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric documentdb.replication.lag.bytes specifies "aggregated over 2-hour windows" with statistics (min, max, avg) reported every 2 hours. This aggregation period seems quite long for a critical replication health metric. High replication lag can indicate serious issues, and a 2-hour reporting delay could mask problems. Consider documenting the rationale for this long aggregation window, or whether a shorter interval (e.g., 5-15 minutes) would be more appropriate for operational monitoring.
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error type list includes tls-cert but there is no corresponding section documenting TLS certificate-related errors or their handling. Given that TLS is a critical security feature (tracked in lines 75, 212-219), certificate errors deserve explicit documentation. Consider adding details about what types of TLS certificate errors will be tracked (e.g., expiration, validation failures, missing certificates).
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification states that error messages should be "Sanitized" with "no PII" (line 186), but doesn't provide clear guidance on how to sanitize error messages or what constitutes PII in error contexts. This ambiguity could lead to inconsistent implementations. Consider adding specific examples of sanitization patterns, such as removing resource names, IP addresses, file paths, or user identifiers, and possibly referencing the error categorization approach used elsewhere (line 205).
| - `error_message`: Sanitized error message (no PII) | |
| - `error_message`: Sanitized error message (no PII). The message MUST: | |
| - avoid including raw Kubernetes resource names, namespaces, node names, IP addresses, hostnames, file paths, usernames, email addresses, cloud account IDs, or any token/secret values | |
| - be derived from a stable error category and high-level description (for example, "PVC provisioning failed" or "TLS certificate validation error") rather than raw provider/library error strings | |
| - be safe to log in multi-tenant environments | |
| - when in doubt, prefer mapping to a coarse-grained description based on `error_type` and `error_code` |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification includes both wal_replica_plugin_enabled (line 230) as a telemetry property, but this plugin is not mentioned anywhere else in the specification. Unlike the sidecar_injector_plugin which is referenced in cluster creation events (line 77), there's no context for what the WAL replica plugin does or when it's relevant. Consider adding documentation about this plugin's purpose and usage context, or if it's not yet implemented, note it as a future feature.
| - **Properties**: | |
| - `sidecar_injector_plugin_enabled`: Boolean indicating if plugin is used | |
| - `wal_replica_plugin_enabled`: Boolean indicating if plugin is used | |
| - **Description**: Tracks usage of optional operator plugins that extend core functionality. | |
| - **Properties**: | |
| - `sidecar_injector_plugin_enabled`: Boolean indicating whether the sidecar injector plugin is enabled for the operator (e.g., for injecting supporting sidecars into DocumentDB pods). | |
| - `wal_replica_plugin_enabled`: Boolean indicating whether the WAL replica plugin is enabled. This is reserved for a future/experimental plugin that manages write-ahead-log (WAL) replication behavior; in operator versions where this plugin is not implemented, this flag MUST remain `false`. |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification includes vmware-tanzu in the kubernetes_distribution list, but other distributions use lowercase names like openshift and rancher. For consistency, consider whether this should be vmware-tanzu, vmware_tanzu, or tanzu to match the naming convention used for other distributions.
| - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `rancher`, `vmware-tanzu`, `other` | |
| - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `rancher`, `tanzu`, `other` |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification warns to "Monitor dimension cardinality to avoid explosion" (line 293) but doesn't provide specific guidance on acceptable cardinality limits or which dimensions pose the highest risk. Given that cluster_id, backup_id, and namespace_hash are all unique identifiers that could create high cardinality, the specification should provide concrete thresholds or best practices. For example, specify maximum expected values for each dimension or identify which combinations create cardinality concerns.
| 3. **Cardinality**: Monitor dimension cardinality to avoid explosion | |
| 3. **Cardinality**: Monitor dimension cardinality to avoid explosion and apply the following rules: | |
| - Treat `cluster_id`, `backup_id`, and `operator_namespace_hash` as **high-cardinality identifiers**. For any single metric, do not use more than one of these as a dimension; additional identifiers SHOULD be sent as metric properties instead of dimensions. | |
| - For each metric, target **< 1,000** distinct dimension value combinations per hour. If a metric consistently exceeds **5,000** distinct combinations per hour, remove or coarsen one or more dimensions (for example, drop `backup_id` as a dimension and keep only `cluster_id`). | |
| - Do not use raw pod names, node names, or per-request identifiers (such as request IDs or connection IDs) as dimensions; if needed, include them only as sampled properties on diagnostic events. |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification states that GUIDs should be "generated and persisted in resource annotations" at resource creation time (line 297), but doesn't specify what happens if the annotation is missing or corrupted. For a production system, the specification should address how to handle cases where telemetry annotations are absent (e.g., for resources created before telemetry was implemented, or if annotations are accidentally deleted). Consider adding guidance on whether to generate new GUIDs on-the-fly or skip telemetry for such resources.
| 6. **GUID Generation**: Generate and persist GUIDs in resource annotations (`telemetry.documentdb.io/cluster-id`) at resource creation time | |
| 6. **GUID Generation**: Generate and persist GUIDs in resource annotations (`telemetry.documentdb.io/cluster-id`) at resource creation time. On read, if the annotation is missing or contains an invalid GUID (e.g., legacy resources or manual edits), the operator MUST generate a new GUID, persist it back to the annotation, and use that value for all subsequent telemetry for that resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dimension name
environmentduplicates thecloud_providerdimension on line 35. According to line 36, "environment" comes from "spec.environment", but line 35 already captures cloud provider information. This creates ambiguity - if both dimensions capture the same values (aks, eks, gke), one should be removed or their distinction should be clarified. If they serve different purposes, the specification should explain what "spec.environment" represents and how it differs from the detected cloud_provider.