Skip to content

Conversation

@WentingWu666666
Copy link
Collaborator

@WentingWu666666 WentingWu666666 commented Jan 7, 2026

This PR adds support for retaining PersistentVolumes after DocumentDB cluster deletion, enables recovery from retained PVs, and applies security-hardening mount options to all DocumentDB-associated PVs.

Key Features:

  • PV Retention Policy: Users can now configure persistentVolumeReclaimPolicy: Retain to preserve PersistentVolumes after cluster deletion. This enables disaster recovery scenarios where data can be recovered even after accidental cluster deletion.
  • PVC Recovery: New bootstrap option allows creating a DocumentDB cluster from an existing PVC containing retained data. This complements the existing backup-based recovery with a faster, snapshot-less recovery path.
  • Security Hardening: The PV controller automatically applies security mount options (nodev, nosuid, noexec) to all DocumentDB-associated PVs, preventing potential privilege escalation attacks through the filesystem.
  • Disk Encryption Documentation: Added cloud-specific guidance for encryption at rest, highlighting that EKS requires explicit configuration (encrypted: "true" in StorageClass) while AKS and GKE encrypt by default.

@WentingWu666666 WentingWu666666 self-assigned this Jan 7, 2026
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/retain-volume branch from f1375f0 to 20aff96 Compare January 13, 2026 00:58
@WentingWu666666 WentingWu666666 changed the title Retain PVC Retain PVC and restore from PVC Jan 14, 2026
@WentingWu666666 WentingWu666666 changed the title Retain PVC and restore from PVC Retain PVC after DocumentDB Deletion and Restore from PVC Jan 14, 2026
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/retain-volume branch from 78c3443 to 3107300 Compare January 15, 2026 18:24
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 changed the title Retain PVC after DocumentDB Deletion and Restore from PVC Retain PV after DocumentDB Deletion and Restore from PVC Jan 20, 2026
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 marked this pull request as ready for review January 21, 2026 17:35
Copilot AI review requested due to automatic review settings January 21, 2026 17:35
@WentingWu666666 WentingWu666666 changed the title Retain PV after DocumentDB Deletion and Restore from PVC Add PersistentVolume retention, security mount options, and PVC recovery support Jan 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for retaining PersistentVolumes (PVs) after DocumentDB cluster deletion and enables restoration from retained PVCs. It introduces a new PersistentVolumeReconciler controller that manages PV reclaim policies and security mount options based on DocumentDB configuration.

Changes:

  • Added PersistentVolumeReclaimPolicy field to DocumentDB storage spec (Retain/Delete options)
  • Added PVC field to recovery configuration for bootstrapping from existing PVCs
  • Implemented PersistentVolumeReconciler to automatically update PV reclaim policies and apply security mount options (nodev, nosuid, noexec)

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
operator/src/internal/controller/pv_controller.go New controller that reconciles PVs associated with DocumentDB clusters, updating reclaim policies and mount options
operator/src/internal/controller/pv_controller_test.go Comprehensive unit tests for the PV controller including predicates, ownership chain traversal, and reconciliation logic
operator/src/internal/cnpg/cnpg_cluster.go Updated bootstrap logic to support PVC recovery in addition to backup recovery
operator/src/internal/cnpg/cnpg_cluster_test.go Tests for PVC recovery bootstrap configuration
operator/src/internal/cnpg/suite_test.go Test suite setup for CNPG package
operator/src/api/preview/documentdb_types.go API additions for PVC recovery and PersistentVolumeReclaimPolicy field with CEL validation
operator/src/api/preview/zz_generated.deepcopy.go Auto-generated deepcopy code for new PVC field
operator/src/config/crd/bases/documentdb.io_dbs.yaml CRD updates with PVC recovery and reclaim policy fields
operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml Corresponding CRD updates for Helm chart
operator/src/config/rbac/role.yaml Added RBAC permissions for PV and PVC resources
operator/documentdb-helm-chart/templates/05_clusterrole.yaml Added cluster-level PV permissions
operator/src/cmd/main.go Integrated PersistentVolumeReconciler into the manager
docs/operator-public-documentation/preview/backup-and-restore.md Documentation for PV retention and recovery workflow
docs/operator-public-documentation/preview/advanced-configuration/README.md Added sections on PV security mount options and disk encryption best practices
.github/workflows/test-backup-and-restore.yml Integration tests for PV retention and PVC-based recovery

- Default to Retain for safer database workloads (data preservation)
- Add comprehensive documentation for when to use each option
- Add WARNING about data loss when using Delete
- Update controller default fallback to Retain
- Update tests to reflect new default behavior

Signed-off-by: wenting <wentingwu@microsoft.com>
- Move 'Verify mount options are set by PV controller' test to E2E
- Update 'Test PV reclaim policy default and explicit Delete' to:
  - Verify default Retain policy on existing cluster
  - Patch cluster to Delete policy and verify PV update
  - Verify PV cleanup after cluster deletion

Signed-off-by: wenting <wentingwu@microsoft.com>
- Optimize findPVsForDocumentDB with CNPG cluster label selector
- Use GitHub Actions outputs instead of temp files in workflow
- Add error scenario tests for PV update and API failures
- Improve CEL validation for RecoveryConfiguration
- Clarify PV deletion chain in documentation
- Add compatibility notes for security mount options

Signed-off-by: wenting <wentingwu@microsoft.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

operator/src/internal/controller/pv_controller_test.go:1

  • These predicate functions are defined inline and always return false. Consider extracting them into a named constant or helper function for better reusability and clarity, especially since the pattern is used in multiple predicates.
// Copyright (c) Microsoft Corporation.

pvcList := &corev1.PersistentVolumeClaimList{}
if err := r.List(ctx, pvcList,
client.InNamespace(documentdb.Namespace),
client.MatchingLabels{cnpgClusterLabel: documentdb.Name},
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant cnpgClusterLabel is used here but is defined in documentdb_controller.go (line 49). Consider defining this constant in a shared location or in this file to improve maintainability and reduce coupling between controllers.

Copilot generated this review using guidance from repository custom instructions.
Scheme *runtime.Scheme
Config *rest.Config
Clientset kubernetes.Interface
Recorder record.EventRecorder
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Recorder field should have a documentation comment explaining its purpose for emitting Kubernetes events, especially since it's used for PV retention warnings during deletion.

Suggested change
Recorder record.EventRecorder
// Recorder emits Kubernetes events for this controller, including PV retention warnings during deletion.
Recorder record.EventRecorder

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +142
Name: pvcName,
Kind: "PersistentVolumeClaim",
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why APIGroup is set to an empty string for PersistentVolumeClaim. This is because PVC is a core Kubernetes resource (in the "" API group), and this clarification would help future maintainers.

Suggested change
Name: pvcName,
Kind: "PersistentVolumeClaim",
Name: pvcName,
Kind: "PersistentVolumeClaim",
// PersistentVolumeClaim is a core Kubernetes resource and therefore uses the empty ("") API group.

Copilot uses AI. Check for mistakes.
Comment on lines +559 to +560
// Create a new recorder to ensure it's empty
localRecorder := record.NewFakeRecorder(10)
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states the recorder is created 'to ensure it's empty', but the actual validation on line 572 uses Consistently which doesn't verify emptiness but rather that no events are received during a time window. Consider clarifying the comment to match the actual test behavior: 'Create a dedicated recorder to verify no events are emitted'.

Copilot uses AI. Check for mistakes.
- ReadWriteOnce
resources:
requests:
storage: 10Gi # Must match the PV capacity
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says storage must match PV capacity, but this requirement could be more flexible in practice - Kubernetes allows requesting less storage than the PV provides. Consider clarifying whether this is a hard requirement for this recovery scenario or if it's a best practice recommendation.

Copilot uses AI. Check for mistakes.
Comment on lines +423 to +424
# Store PV name for later steps using GitHub Actions output (more robust than temp files)
echo "pv_name=$pv_name" >> $GITHUB_OUTPUT
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of GitHub Actions outputs instead of temporary files. However, consider adding error handling to verify that pv_name is not empty before writing to GITHUB_OUTPUT to prevent silent failures in subsequent steps.

Copilot uses AI. Check for mistakes.
Comment on lines +403 to +414
# Wait a bit for PV to be deleted (the storage class handles actual deletion)
echo "Waiting for PV to be deleted..."
sleep 30

# Verify PV was deleted (because reclaim policy is Delete)
pv_exists=$(kubectl get pv $pv_name --ignore-not-found 2>/dev/null)
if [ -z "$pv_exists" ]; then
echo "✓ PV $pv_name was deleted as expected (Delete policy)"
else
pv_status=$(kubectl get pv $pv_name -o jsonpath='{.status.phase}')
echo "⚠️ PV $pv_name still exists with status: $pv_status"
echo "Note: PV deletion depends on the storage provisioner. The reclaim policy was correctly set to Delete."
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test treats PV persistence after deletion as a warning rather than a failure. This could mask real issues with the Delete policy implementation. Consider adding a timeout with eventual consistency check or making this a failure if the PV still exists after a reasonable wait period (e.g., 60 seconds).

Suggested change
# Wait a bit for PV to be deleted (the storage class handles actual deletion)
echo "Waiting for PV to be deleted..."
sleep 30
# Verify PV was deleted (because reclaim policy is Delete)
pv_exists=$(kubectl get pv $pv_name --ignore-not-found 2>/dev/null)
if [ -z "$pv_exists" ]; then
echo "✓ PV $pv_name was deleted as expected (Delete policy)"
else
pv_status=$(kubectl get pv $pv_name -o jsonpath='{.status.phase}')
echo "⚠️ PV $pv_name still exists with status: $pv_status"
echo "Note: PV deletion depends on the storage provisioner. The reclaim policy was correctly set to Delete."
# Wait (with timeout) for PV to be deleted (the storage class handles actual deletion)
echo "Waiting for PV $pv_name to be deleted (up to 60 seconds)..."
MAX_PV_WAIT_SECONDS=60
PV_POLL_INTERVAL=5
ELAPSED=0
while [ $ELAPSED -lt $MAX_PV_WAIT_SECONDS ]; do
pv_exists=$(kubectl get pv $pv_name --ignore-not-found 2>/dev/null)
if [ -z "$pv_exists" ]; then
echo "✓ PV $pv_name was deleted as expected (Delete policy)"
break
fi
echo "PV $pv_name still exists. Waiting..."
sleep $PV_POLL_INTERVAL
ELAPSED=$((ELAPSED + PV_POLL_INTERVAL))
done
# After waiting, fail the test if the PV still exists
pv_exists=$(kubectl get pv $pv_name --ignore-not-found 2>/dev/null)
if [ -n "$pv_exists" ]; then
pv_status=$(kubectl get pv $pv_name -o jsonpath='{.status.phase}')
echo "❌ PV $pv_name still exists after ${MAX_PV_WAIT_SECONDS}s with status: $pv_status"
echo "PV should be deleted when reclaim policy is Delete. Investigate storage provisioner or policy configuration."
exit 1

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants