Skip to content

Conversation

@WentingWu666666
Copy link
Collaborator

@WentingWu666666 WentingWu666666 commented Jan 13, 2026

Summary

This PR introduces DocumentDB upgrade capabilities by adding a configurable PostgresImage field to the DocumentDB CRD and enabling ImageVolume-based extension loading through CNPG.

Key Changes

CRD Updates

  • Added postgresImage field to DocumentDB spec for configurable postgres image selection
  • Added documentDBVersion status field to track the current version

CNPG Integration

  • Upgraded CNPG to support ImageVolume extensions
  • Configure LD_LIBRARY_PATH for extension loading
  • Removed hardcoded PostgresUID/GID (105/108) to allow standard postgres images to use their default UID/GID values and avoid initdb: could not look up effective user ID 105: user does not exist errors

CI/CD Updates

  • Upgraded kind to v0.31.0 with Kubernetes v1.35.0 for E2E tests
  • Enabled ImageVolume feature gate in kind cluster configuration
  • Added E2E tests for upgrade scenarios

Testing

  • Unit tests: Added/updated in documentdb_controller_test.go
  • E2E tests: New upgrade test workflow added

@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from 433e80d to 1e6c870 Compare January 13, 2026 02:04
@WentingWu666666 WentingWu666666 marked this pull request as ready for review January 13, 2026 02:04
@WentingWu666666 WentingWu666666 marked this pull request as draft January 13, 2026 02:04
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from 1e6c870 to ed68947 Compare January 13, 2026 02:16
nodeCount: ${{ inputs.node-count }}
instancesPerNode: ${{ inputs.instances-per-node }}
documentDBImage: ghcr.io/microsoft/documentdb/documentdb-local:16
documentDBImage: ghcr.io/guanzhousongmicrosoft/documentdb-pg18-amd64:0.110.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See it's a draft. Just in case, we should not use any personal repo link.

Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from ab04d57 to 1cc45ff Compare January 21, 2026 20:38
Signed-off-by: wenting <wentingwu@microsoft.com>
…stgres images

Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from 1ae514c to 35b47ca Compare January 22, 2026 02:07
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from 006724a to a61d520 Compare January 22, 2026 14:29
Signed-off-by: wenting <wentingwu@microsoft.com>
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch 5 times, most recently from 34f29f2 to b2836ec Compare January 29, 2026 02:45
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch 2 times, most recently from b9d97be to 5144126 Compare January 29, 2026 03:12
Signed-off-by: wenting <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from 5144126 to d6db610 Compare January 29, 2026 15:20
@WentingWu666666 WentingWu666666 changed the title upgrade documentdb extension Add DocumentDB upgrade support with configurable PostgresImage and ImageVolume extensions Jan 29, 2026
@WentingWu666666 WentingWu666666 marked this pull request as ready for review January 29, 2026 17:21
Copilot AI review requested due to automatic review settings January 29, 2026 17:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces DocumentDB upgrade capabilities by adding support for configurable PostgreSQL and DocumentDB extension images, along with automated extension upgrade logic. The changes enable rolling upgrades of DocumentDB clusters with minimal downtime.

Changes:

  • Adds postgresImage field to DocumentDB CRD spec for configurable Postgres base image selection
  • Adds documentDBVersion field to DocumentDB status to track the currently installed extension version
  • Implements automated extension upgrade logic that updates the extension image and runs ALTER EXTENSION when needed
  • Upgrades CNPG dependency from 0.26.1 to 0.27.0 to support ImageVolume-based extension loading
  • Adds E2E upgrade tests that verify version transitions
  • Removes DocumentDB image build from CI pipeline (now using external registry)
  • Migrates certificate tests from testify to Ginkgo/Gomega for consistency

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
operator/src/api/preview/documentdb_types.go Adds PostgresImage spec field and DocumentDBVersion status field to DocumentDB CRD
operator/src/config/crd/bases/documentdb.io_dbs.yaml Updates CRD schema with new postgresImage and documentDBVersion fields
operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml Mirrors CRD schema changes in Helm chart
operator/src/internal/cnpg/cnpg_cluster.go Configures CNPG cluster with ImageVolume extensions and uses postgresImage from spec
operator/src/internal/controller/documentdb_controller.go Adds upgrade logic with upgradeDocumentDBExtensionIfNeeded and updateExtensionImageIfNeeded functions, plus version parsing
operator/src/internal/controller/documentdb_controller_test.go Comprehensive unit tests for upgrade logic using Ginkgo/Gomega
operator/src/internal/controller/certificate_controller_test.go Extracts certificate tests into separate file, maintaining testify framework
operator/documentdb-helm-chart/Chart.yaml Bumps CNPG dependency from 0.26.1 to 0.27.0
.github/workflows/test-E2E.yml Adds upgrade test scenario that patches image and verifies version transition
.github/workflows/test-integration.yml Updates to use hardcoded DocumentDB image instead of version parameter
.github/workflows/test-backup-and-restore.yml Removes unused documentdb_version input parameter
.github/workflows/test-build-and-package.yml Removes DocumentDB image build step (now using external images)
.github/actions/setup-test-environment/action.yml Enables ImageVolume feature gate in kind cluster config, upgrades to kind v0.31.0 with K8s v1.35.0

Comment on lines +261 to +265
// Check if documentdb extension needs to be upgraded (image update + ALTER EXTENSION)
if err := r.upgradeDocumentDBExtensionIfNeeded(ctx, currentCnpgCluster, desiredCnpgCluster, documentdb); err != nil {
logger.Error(err, "Failed to upgrade DocumentDB extension")
return ctrl.Result{RequeueAfter: RequeueAfterShort}, nil
}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition here. Between lines 234-259, the documentdb status is updated with cluster phase and connection string. Then at line 262, upgradeDocumentDBExtensionIfNeeded is called which also updates the documentdb status at lines 550-555 and 578-581. If the status was modified by another reconcile loop between these calls, the second update could overwrite changes. Consider refetching the documentdb object before the upgrade check, or better yet, consolidate all status updates to happen once at the end of the reconcile loop.

Copilot uses AI. Check for mistakes.
return ctrl.Result{RequeueAfter: RequeueAfterShort}, nil
}

// Don't reque again unless there is a change
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "reque" should be "requeue".

Suggested change
// Don't reque again unless there is a change
// Don't requeue again unless there is a change

Copilot uses AI. Check for mistakes.
Comment on lines +577 to +581
// Step 6: Update DocumentDB version in status after upgrade
documentdb.Status.DocumentDBVersion = defaultVersion
if err := r.Status().Update(ctx, documentdb); err != nil {
logger.Error(err, "Failed to update DocumentDB status after extension upgrade")
}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status update failure should cause the function to return an error rather than just logging it. If the status update fails after successfully upgrading the extension, the operator will lose track of the current version and may attempt to upgrade again unnecessarily, or the status will be inconsistent with reality.

Copilot uses AI. Check for mistakes.
spec := cnpgv1.ClusterSpec{
Instances: documentdb.Spec.InstancesPerNode,
ImageName: documentdb_image,
ImageName: documentdb.Spec.PostgresImage,
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a defensive default value check for PostgresImage to match the pattern used elsewhere in the codebase. While the kubebuilder default should handle this, adding a runtime check like cmp.Or(documentdb.Spec.PostgresImage, "ghcr.io/cloudnative-pg/postgresql:18-standard-bookworm") on line 59 would provide additional safety against potential edge cases where the default might not be applied (e.g., when migrating from older CRD versions).

Suggested change
ImageName: documentdb.Spec.PostgresImage,
ImageName: cmp.Or(documentdb.Spec.PostgresImage, "ghcr.io/cloudnative-pg/postgresql:18-standard-bookworm"),

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +182
# Extract expected version from image tag (e.g., "0.110.0" -> "0.110-0")
NEW_VERSION_TAG="${NEW_IMAGE##*:}"
EXPECTED_VERSION=$(echo "$NEW_VERSION_TAG" | sed 's/\.\([^.]*\)$/-\1/')
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version parsing logic using sed 's/\.\([^.]*\)$/-\1/' assumes a specific format where the last dot-separated component should be converted to a hyphen-separated component (e.g., "0.110.0" -> "0.110-0"). This is fragile and may not work correctly for all version formats. For example, if the image tag includes a pre-release version like "0.110.0-beta", the parsing would produce "0.110.0-beta" (unchanged), not matching the expected format. Consider documenting the expected version format or making the parsing more robust.

Suggested change
# Extract expected version from image tag (e.g., "0.110.0" -> "0.110-0")
NEW_VERSION_TAG="${NEW_IMAGE##*:}"
EXPECTED_VERSION=$(echo "$NEW_VERSION_TAG" | sed 's/\.\([^.]*\)$/-\1/')
# Extract expected version from image tag:
# - For semver-like tags: "0.110.0" -> "0.110-0"
# - With pre-release: "0.110.0-beta" -> "0.110-0-beta"
NEW_VERSION_TAG="${NEW_IMAGE##*:}"
if [[ "$NEW_VERSION_TAG" =~ ^([0-9]+\.[0-9]+)\.([0-9]+)(-.+)?$ ]]; then
EXPECTED_VERSION="${BASH_REMATCH[1]}-${BASH_REMATCH[2]}${BASH_REMATCH[3]}"
else
echo "⚠️ Unable to parse version from image tag '$NEW_VERSION_TAG'; using tag as-is for comparison"
EXPECTED_VERSION="$NEW_VERSION_TAG"
fi

Copilot uses AI. Check for mistakes.
Comment on lines +552 to +553
if err := r.Status().Update(ctx, documentdb); err != nil {
logger.Error(err, "Failed to update DocumentDB status with extension version")
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status update at line 552 should not be silently ignored if it fails. When the status update fails, the operator should requeue to ensure the version is eventually recorded. Currently, the error is only logged but the function continues, which could lead to inconsistent state where the extension version in the database differs from what's recorded in the status.

Suggested change
if err := r.Status().Update(ctx, documentdb); err != nil {
logger.Error(err, "Failed to update DocumentDB status with extension version")
if err := r.Status().Update(ctx, documentdb); err != nil {
err = fmt.Errorf("failed to update DocumentDB status with extension version: %w", err)
logger.Error(err, "Failed to update DocumentDB status with extension version")
return err

Copilot uses AI. Check for mistakes.
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/upgrade-documentdb branch from 13fad47 to d6db610 Compare January 29, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants