Skip to content

pg_rewind during failover might have permission issues #223

@xgerman

Description

@xgerman

I ran AI to perform a test:

Summary

   I completed the multi-region Azure deployment workflow:

   ┌──────────────────────────────────────┬────────┬──────────────────────────────────────────────────────────────────────────────────┐
   │ Step                                 │ Status │ Notes                                                                            │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 1. Deploy AKS Fleet                  │ ✅     │ Created 3 AKS clusters (eastus2, uksouth, westus3) with KubeFleet                │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 2. Install cert-manager              │ ✅     │ Installed on all member clusters                                                 │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 3. Install DocumentDB operator       │ ✅     │ Helm chart deployed to all 3 clusters                                            │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 4. Deploy multi-region DocumentDB    │ ✅     │ 3-node cluster with HA deployed (experienced failover during demo)               │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 5. Test connection                   │ ✅     │ Connected successfully, but writes timed out due to cluster failover             │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 6. Clean up demo resources           │ ✅     │ Deleted namespaces on all clusters                                               │
   ├──────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────────────────┤
   │ 7. Clean up operators/infrastructure │ ✅     │ Uninstalled operators, cert-manager, CRDs, and initiated resource group deletion │
   └──────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────┘

   Key observations:

     - The multi-region deployment worked, creating DocumentDB clusters across 3 Azure regions
     - A pod restart triggered a CNPG failover, causing temporary write unavailability (expected HA behavior)
     - The pg_rewind operation failed due to a permission issue with documentdb_core schema - this may be a known issue
     - All Azure resources are being deleted (resource group deletion is async)

When chatting with @alaye-ms about the result he suggested:
Sometimes it prints that error before the grant runs
It probably latched on to that since once the GRANT does run, it doesn't print anything
At least that would be my guess, could be a real issue

We need to:

  • make the output nicer (this error could confuse a real user, too, - not just AI)
  • validate it's not a real bug (there was another fialover which seemed to have worked so the misleading printout theory is strong)
  • if it's a real ssue, fix it

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions