Skip to content

Conversation

@tsivaprasad
Copy link
Contributor

@tsivaprasad tsivaprasad commented Dec 3, 2025

Summary

This PR implements automatic etcd mode reconfiguration of Control Plane host's etcd mode between server and client after initialization.

Changes

Automatic Reconfiguration

  • Mode Detection: Automatically detects when PGEDGE_ETCD_MODE environment variable changes
  • Server → Client: Safely demotes etcd server to client mode, removing from cluster membership
  • Client → Server: Promotes client to server mode, joining the etcd cluster with proper credentials

Testing

Manually validated:

  • Promoting a client host to server mode
1. Created a cluster with 6 nodes (3 etcd servers, 3 etcd clients) and verified that all nodes come up healthy. verified that the servers are listed correctly in the etcd cluster.
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
|  7322dc2897edcd9 | started | host-3 | https://192.168.64.2:2580 | https://192.168.64.2:2579 |      false |
| 6c021e21fc66bdea | started | host-2 | https://192.168.64.2:2480 | https://192.168.64.2:2479 |      false |
| b71f75320dc06a6c | started | host-1 | https://192.168.64.2:2380 | https://192.168.64.2:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+
2. Modified the compose file with host-4 as server with different ports
3. using command WORKSPACE_DIR=/Users/sivat/projects/control-plane/control-plane docker compose -f ./docker/control-plane-dev/docker-compose.yaml up -d host-4
4. Verified that host-4 comes up healthy and joins the etcd cluster
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
|  7322dc2897edcd9 | started | host-3 | https://192.168.64.2:2580 | https://192.168.64.2:2579 |      false |
| 3a97d6e09511ba61 | started | host-4 | https://192.168.64.2:2680 | https://192.168.64.2:2679 |      false |
| 6c021e21fc66bdea | started | host-2 | https://192.168.64.2:2480 | https://192.168.64.2:2479 |      false |
| b71f75320dc06a6c | started | host-1 | https://192.168.64.2:2380 | https://192.168.64.2:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

Respective Logs:

host-4-1 has been recreated
host-4-1  | 1:22PM INF got shutdown signal
host-4-1  | 1:22PM INF attempting to gracefully shut down
host-4-1  | 1:22PM INF shutting down scheduler service component=scheduler_service
host-4-1  | 1:22PM INF closing scheduled job watch component=scheduler_service
host-4-1  | 1:22PM INF shutting down instance monitors
host-4-1 exited with code 0
host-4-1  | 1:22PM INF checking etcd mode for reconfiguration modes_equal=false new_mode=server old_mode=client old_mode_empty=false
host-4-1  | 1:22PM INF etcd started as learner
host-4-1  | 1:22PM INF attempting to promote from learner to voting cluster member
host-4-1  | 1:22PM INF promotion successful
host-4-1  | 1:22PM INF waiting for cluster to be healthy
host-4-1  | 1:22PM INF starting scheduler service component=scheduler_service
host-4-1  | 1:22PM INF starting http server component=api_server host_port=0.0.0.0:3003
  • Demoting a server host to client mode
1. Rolled back the changes by bringing down host-4 and modifying the compose file to set host-4 as client
2. using command WORKSPACE_DIR=/Users/sivat/projects/control-plane/control-plane docker compose -f ./docker/control-plane-dev/docker-compose.yaml up -d host-4
3. Verified that host-4 comes up healthy and its in client mode
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
|  7322dc2897edcd9 | started | host-3 | https://192.168.64.2:2580 | https://192.168.64.2:2579 |      false |
| 6c021e21fc66bdea | started | host-2 | https://192.168.64.2:2480 | https://192.168.64.2:2479 |      false |
| b71f75320dc06a6c | started | host-1 | https://192.168.64.2:2380 | https://192.168.64.2:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

Respective Logs:

host-4-1 has been recreated
host-4-1  | 1:24PM INF got shutdown signal
host-4-1  | 1:24PM INF attempting to gracefully shut down
host-4-1  | 1:24PM INF shutting down scheduler service component=scheduler_service
host-4-1  | 1:24PM INF closing scheduled job watch component=scheduler_service
host-4-1  | 1:24PM INF shutting down instance monitors
host-4-1 exited with code 0
host-4-1  | 1:24PM INF checking etcd mode for reconfiguration modes_equal=false new_mode=client old_mode=server old_mode_empty=false
host-4-1  | 1:24PM INF starting http server component=api_server host_port=0.0.0.0:3003

PLAT-315

Summary by CodeRabbit

  • New Features

    • Automatic etcd mode reconfiguration: seamless client↔server transitions with credential migration, cluster membership updates, configuration updates, data cleanup, and service handover to minimize disruption.
  • Improvements

    • More robust etcd lifecycle handling (safer start/shutdown and clearer error contexts).
    • Consolidated credential provisioning for cleaner, safer server credential management.
  • Documentation

    • Added unreleased changelog entry noting the new automatic reconfiguration feature.

✏️ Tip: You can customize this high-level summary in your review settings.

@tsivaprasad tsivaprasad requested a review from Copilot December 3, 2025 17:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for reconfiguring etcd mode on Control Plane hosts, enabling operators to safely change hosts between server and client modes after initial deployment. The documentation includes detailed procedures, troubleshooting guidance, and best practices based on manual testing in a development environment.

Key Changes

  • Added complete step-by-step procedures for promoting clients to servers and demoting servers to clients
  • Documented environment variable configuration requirements and container restart workflows
  • Fixed a bug in rbac.go to ensure EtcdMode is persisted when writing host credentials

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
docs/using/etcd-reconfiguration.md New documentation file covering etcd mode reconfiguration procedures, prerequisites, troubleshooting, and best practices
server/internal/etcd/rbac.go Added missing EtcdMode assignment to ensure mode is preserved in generated config

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tsivaprasad tsivaprasad marked this pull request as draft December 4, 2025 14:44
@tsivaprasad tsivaprasad force-pushed the PLAT-315-ability-to-reconfigure-a-hosts-etcd-configuration branch 4 times, most recently from 6b6efb0 to 4024167 Compare January 1, 2026 17:25
@tsivaprasad tsivaprasad marked this pull request as ready for review January 1, 2026 17:54
@tsivaprasad tsivaprasad requested a review from Copilot January 5, 2026 11:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

Adds automatic etcd mode reconfiguration: new ChangeMode on EmbeddedEtcd and RemoteEtcd, Etcd interface extension, provider logic to detect and perform server↔client transitions with config and credential updates, and refactors certificate handling to support mode switches. (≤50 words)

Changes

Cohort / File(s) Summary
Mode transition implementations
server/internal/etcd/embedded.go, server/internal/etcd/remote.go
Added ChangeMode(ctx, mode) to EmbeddedEtcd and RemoteEtcd to perform server→client and client→server transitions respectively: start/shutdown flows, cluster member operations, generated-config updates, credential handling, data directory removal/creation, and detailed error wrapping.
Interface & provider orchestration
server/internal/etcd/interface.go, server/internal/etcd/provide.go
Extended Etcd interface with ChangeMode. Added newEtcdForMode and enhanced provideEtcd to detect mode changes, invoke ChangeMode with timeouts for transitions, and persist generated configuration when appropriate.
Credentials & RBAC refactor
server/internal/etcd/rbac.go
Extracted server certificate creation into helper addEtcdServerCredentials; updated credential writing to propagate EtcdMode, EtcdClient, and EtcdServer into generated config.
Changelog
changes/unreleased/Added-20260109-223536.yaml
Added unreleased changelog entry documenting automatic etcd client↔server mode reconfiguration.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Prov as Provider
    participant Emb as EmbeddedEtcd
    participant Cluster as Etcd Cluster
    participant Rem as RemoteEtcd

    App->>Prov: provideEtcd(requested: Client)
    Prov->>Prov: detect Server→Client transition
    Prov->>Emb: ChangeMode(ctx, Client)
    Emb->>Cluster: ListMembers()
    Cluster-->>Emb: members
    Emb->>Emb: update generated config (client endpoints)
    Emb->>Emb: shutdown embedded instance
    Emb->>Cluster: RemoveMember(self)
    Cluster-->>Emb: confirm removal
    Emb->>Rem: NewRemoteEtcd(endpoints)
    Rem->>Rem: Start remote client
    Rem-->>Prov: return RemoteEtcd
    Prov-->>App: Etcd (client mode)
Loading
sequenceDiagram
    participant App as Application
    participant Prov as Provider
    participant Rem as RemoteEtcd
    participant Cluster as Etcd Cluster
    participant Emb as EmbeddedEtcd

    App->>Prov: provideEtcd(requested: Server)
    Prov->>Prov: detect Client→Server transition
    Prov->>Rem: ChangeMode(ctx, Server)
    Rem->>Rem: start remote client (if needed)
    Rem->>Cluster: GetClusterLeader()
    Cluster-->>Rem: leader endpoint
    Rem->>Rem: build HostCredentials & addEtcdServerCredentials
    Rem->>Rem: shutdown remote client
    Rem->>Emb: NewEmbeddedEtcd(join via leader, creds)
    Emb->>Cluster: Join cluster
    Cluster-->>Emb: joined
    Emb->>Emb: update generated config to Server mode
    Emb-->>Prov: return EmbeddedEtcd
    Prov-->>App: Etcd (server mode)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nudge the cluster, swap my gear with glee,

From burrowed server to roaming client free.
I tuck the keys, rewrite the map,
Join friends or leave the gap.
A rabbit hops — modes now agree. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main feature: enabling etcd mode reconfiguration between server and client modes.
Description check ✅ Passed The PR description follows the template with Summary, Changes, Testing sections completed. All required checklist items are addressed: testing is documented, issue PLAT-315 is linked, and a changelog entry was added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
docs/using/etcd-reconfiguration.md (6)

44-44: Clarify "no configuration needed" — updating the environment variable is still a configuration change.

Line 44 states "No manual API calls or configuration needed!" but this slightly overstates the automation. Users must still update PGEDGE_ETCD_MODE in their environment. A more accurate phrasing would emphasize that the cluster operations (join, membership updates, credentials) are automatic, not that configuration is unnecessary.

Suggested revision: "No manual etcd API calls or cluster member operations needed — the Control Plane handles everything automatically!"


106-107: Expand RBAC troubleshooting guidance.

The "Permission denied" remedy mentions checking RBAC is enabled but doesn't explain how to verify this or what corrective steps to take. Consider adding:

  • How to check if RBAC is enabled on the cluster
  • Whether RBAC misconfiguration is the primary cause or a secondary check
  • What additional logs to inspect if credential provisioning fails

113-114: Clarify the behavior when membership removal fails during demotion.

The statement "System continues transition even if removal fails" is ambiguous. This suggests the host may transition to client mode while still being a voting member of the cluster — a potentially problematic state. Clarify:

  • Does the host become a client-mode node without cluster membership (desired outcome)?
  • Does the host remain a voting member but run in client mode (problematic)?
  • What does "continues transition" mean operationally?
  • Should users manually remove the member via etcdctl if automatic removal fails?

132-132: Clarify expected etcdctl member list status values.

The statement "All members should show STATUS=started" may be imprecise. The etcdctl member list command output format and status values can vary. Provide:

  • The exact expected output format or example
  • A list of acceptable status values (e.g., started, healthy, etc.)
  • What to do if members show a different status (e.g., unstarted)

1-145: Reduce exclamation marks for a more professional tone.

The documentation uses 7 exclamation marks, which is flagged as excessive by style analysis. For technical documentation, a more neutral and measured tone enhances credibility. Consider removing or replacing exclamation marks, especially in:

  • Line 44: "No manual API calls or configuration needed!"
  • Line 145: Summary closing (if it contains an exclamation)

Replace with more neutral punctuation (periods, em-dashes) or rephrase to emphasize facts rather than enthusiasm.


135-145: Consider adding safety and edge-case guidance.

The documentation focuses on the happy path but omits important safety considerations:

  • Backup recommendations: Should users back up etcd data before reconfiguring? If so, how?
  • Network failures: What happens if a host loses connectivity during the reconfiguration process? Can it recover gracefully?
  • Data handling: When promoting a client to server, what happens to any existing etcd data? Does it start fresh or reuse previous data?
  • Rollback procedures: If reconfiguration fails, how should users recover or revert?

Consider adding a "Safety and Backups" section addressing these scenarios, especially for production environments where these edge cases are critical.

server/internal/etcd/provide.go (1)

68-70: Context created unconditionally but only used in transition cases.

The 60-second timeout context is created for all code paths, but it's only necessary for ChangeMode calls. While not incorrect (the defer ensures cleanup), consider creating the context only within the transition cases to make intent clearer.

server/internal/etcd/embedded.go (1)

521-521: Return statement may use stale error value.

The err variable here is from Line 501 (remote.GetClient()). If that succeeded and all subsequent operations also succeeded, err would still be nil, which is correct. However, for clarity and to avoid confusion with shadowed or stale error variables, consider returning nil explicitly.

♻️ Suggested fix
-	return remote, err
+	return remote, nil
server/internal/host/host.go (1)

45-45: Consider adding port range validation.

While the current implementation is functional, adding validation to ensure HTTPPort falls within the valid range (1-65535) would improve robustness. This could be implemented at the configuration loading stage or in the service layer before persistence.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f218ca and 76d0185.

📒 Files selected for processing (9)
  • docs/using/etcd-reconfiguration.md
  • server/internal/etcd/embedded.go
  • server/internal/etcd/interface.go
  • server/internal/etcd/provide.go
  • server/internal/etcd/rbac.go
  • server/internal/etcd/remote.go
  • server/internal/host/host.go
  • server/internal/host/host_store.go
  • server/internal/host/service.go
🧰 Additional context used
🧬 Code graph analysis (5)
server/internal/host/service.go (1)
server/internal/config/config.go (1)
  • HTTP (112-121)
server/internal/etcd/rbac.go (2)
server/internal/config/config.go (3)
  • EtcdMode (186-186)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • HostCredentials (30-38)
server/internal/etcd/provide.go (5)
server/internal/config/config.go (4)
  • EtcdMode (186-186)
  • EtcdModeServer (189-189)
  • EtcdModeClient (190-190)
  • Config (193-215)
server/internal/config/manager.go (1)
  • Manager (15-20)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/etcd/embedded.go (1)
  • NewEmbeddedEtcd (44-50)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
server/internal/etcd/interface.go (1)
server/internal/config/config.go (1)
  • EtcdMode (186-186)
server/internal/etcd/embedded.go (4)
server/internal/config/config.go (6)
  • EtcdMode (186-186)
  • EtcdModeClient (190-190)
  • EtcdModeServer (189-189)
  • Config (193-215)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
server/internal/etcd/membership.go (1)
  • RemoveMember (34-52)
🪛 LanguageTool
docs/using/etcd-reconfiguration.md

[style] ~73-~73: Using many exclamation marks might seem excessive (in this case: 7 exclamation marks for a text that’s 2568 characters long)
Context: ...a Server to Client (Example - host-4) !!! warning "Quorum Check" Ensure at lea...

(EN_EXCESSIVE_EXCLAMATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (22)
server/internal/etcd/rbac.go (3)

54-56: LGTM — Good refactor to centralize server credential creation.

Extracting the server certificate creation into addEtcdServerCredentials reduces duplication and makes the code more maintainable, especially given this logic is now reused in RemoteEtcd.ChangeMode.


347-350: LGTM — Config propagation aligns with mode reconfiguration flow.

Propagating EtcdMode, EtcdClient, and EtcdServer from app config to generated config ensures configuration consistency during host credential writes.


358-381: LGTM — Helper function is well-structured.

The addEtcdServerCredentials helper correctly creates server certificates and assigns them to the credentials struct. Error handling is appropriate with contextual wrapping.

server/internal/etcd/interface.go (1)

54-54: LGTM — Interface extension is well-designed.

Returning (Etcd, error) is the correct approach since mode transitions switch between EmbeddedEtcd and RemoteEtcd implementations.

server/internal/etcd/provide.go (3)

32-42: LGTM — Clean factory function.

The newEtcdForMode helper provides a clear mapping from mode to implementation with appropriate error handling for invalid modes.


55-66: LGTM — Good diagnostic logging for mode transitions.

Logging both old and new modes with boolean flags helps with debugging reconfiguration issues.


72-100: LGTM — Mode transition logic is comprehensive.

The switch statement correctly handles:

  • Initial setup or no-change case (persisting config if initialized)
  • Server → Client transition via EmbeddedEtcd.ChangeMode
  • Client → Server transition via RemoteEtcd.ChangeMode
  • Explicit error for unsupported transitions
server/internal/etcd/remote.go (5)

254-257: LGTM — Mode validation is correct.

Correctly rejects transitions to any mode other than EtcdModeServer when called on a RemoteEtcd instance.


259-261: Starting remote etcd ensures connectivity before credential operations.

Good defensive approach to ensure the remote client is connected and cert service is initialized before attempting the transition.


270-280: Credentials construction reuses existing client certs.

The credentials are correctly assembled from existing client credentials plus newly-created server credentials via addEtcdServerCredentials.


292-303: Verify cluster rejoin behavior if embedded join fails after shutdown.

After r.Shutdown() closes the remote client, the subsequent embedded.Join() must succeed for the host to remain functional. If join fails, the host would be left in an inconsistent state (remote shut down, embedded not started).

Consider whether a rollback mechanism or additional error handling is needed for production resilience.


305-311: Config update after successful join is appropriate.

Updating the generated config only after successful join ensures configuration consistency.

server/internal/etcd/embedded.go (5)

56-58: LGTM — Idempotent start guard.

Returning early if already started prevents double-initialization issues and supports the mode change flow where Start may be called on an already-running instance.


299-299: LGTM — Nil assignment enables restart capability.

Setting e.etcd = nil after close allows the instance to be started again if needed, which aligns with the idempotent start guard.


451-454: LGTM — Mode validation is correct.

Correctly rejects transitions to any mode other than EtcdModeClient when called on an EmbeddedEtcd instance.


473-484: Good validation — prevents demotion when no other members exist.

Checking for remaining endpoints before proceeding prevents leaving the host in an unrecoverable state.


492-508: Transition order is correct for safety.

The sequence (shutdown embedded → start remote → remove member) ensures the remote client can communicate with remaining cluster members to perform the removal.

server/internal/host/service.go (1)

47-47: LGTM! HTTPPort initialization is consistent with the pattern.

The addition of HTTPPort from the configuration aligns with the other host fields being initialized in this struct literal.

server/internal/host/host.go (3)

45-45: HTTPPort field addition is correctly positioned.

The new field integrates cleanly into the Host struct and maintains consistency with the storage layer.


97-97: Round-trip conversion of HTTPPort is correct.

The fromStorage mapping properly extracts HTTPPort from the stored representation.


146-146: toStorage conversion correctly includes HTTPPort.

The symmetric mapping ensures HTTPPort persists correctly. The conversions in lines 97 and 146 form a proper round-trip.

server/internal/host/host_store.go (1)

24-24: HTTPPort field implementation is correct and properly initialized.

The field follows the established pattern and naming conventions. HTTPPort is always initialized from configuration (s.cfg.HTTP.Port) in service.go and is not exposed in API responses, so downstream API consumers will not encounter zero or missing values. No verification needed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @docs/using/etcd-reconfiguration.md:
- Around line 30-42: Update the "How It Works" / Etcd mode reconfiguration text
to include expected downtime and user-facing disruption details: add a short
paragraph after the numbered steps that explains whether service connections are
dropped during Stop/Restart, approximate duration of typical transitions (e.g.,
seconds to a few minutes), impact on cluster quorum and write availability while
a node transitions modes, and any recommended operator actions (e.g., schedule
maintenance window, drain connections before changing PGEDGE_ETCD_MODE, monitor
logs/health). Reference PGEDGE_ETCD_MODE and the existing "Client→Server" /
"Server→Client" bullets so readers know the guidance applies to that
reconfiguration flow.
- Around line 54-59: Replace the incorrect service name literal "host-4" with
the actual service name "control-plane-host-4" in the docker-compose/docker
commands shown (e.g., change "docker-compose up -d host-4" and any other
commands like "docker stop host-4" / "docker logs host-4" referenced in the
demotion procedure) so all occurrences (notably the two instances in the
reconfiguration and demotion steps) consistently use "control-plane-host-4".
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76d0185 and b4c2bb7.

📒 Files selected for processing (1)
  • docs/using/etcd-reconfiguration.md
🧰 Additional context used
🪛 LanguageTool
docs/using/etcd-reconfiguration.md

[style] ~69-~69: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 1836 characters long)
Context: ...a Server to Client (Example - host-4) !!! warning "Quorum Check" Ensure at lea...

(EN_EXCESSIVE_EXCLAMATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

@tsivaprasad tsivaprasad force-pushed the PLAT-315-ability-to-reconfigure-a-hosts-etcd-configuration branch from b4c2bb7 to 0e26eed Compare January 9, 2026 13:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @docs/using/etcd-reconfiguration.md:
- Around line 97-102: The docs incorrectly state that `etcdctl member list`
shows `STATUS=started`; change the example to use `etcdctl endpoint health
--cluster` (replace the `docker exec control-plane-host-1 etcdctl member list`
line with `docker exec control-plane-host-1 etcdctl endpoint health --cluster`)
and update the expected text from "All members should show `STATUS=started`" to
"All endpoints should show `is healthy: successfully committed proposal`" so the
documented command and expected output match etcd's actual output.

In @server/internal/etcd/remote.go:
- Around line 292-303: The code calls r.Shutdown() then attempts to start an
embedded server with NewEmbeddedEtcd(...).Join(...); if Join fails the process
loses its remote client and may be left without any etcd connection. Modify the
flow around r.Shutdown, NewEmbeddedEtcd, and embedded.Join so you either (a)
attempt to join the embedded server before permanently shutting down the remote
client or (b) if Join fails, automatically attempt to recover by restarting the
remote client (or rolling back by re-initializing r) and return a richer error;
include clear context in the returned error (mention r.Shutdown, embedded.Join
and JoinOptions) and log recovery attempts so operators can diagnose and recover
from failures.
🧹 Nitpick comments (1)
server/internal/etcd/provide.go (1)

72-91: Consider limiting config persistence to first initialization only.

When oldMode == newMode, the config update on lines 82-89 writes the same values that are already persisted. This could be optimized to only persist when oldMode == "" (first initialization).

♻️ Suggested optimization
 	switch {
 	case oldMode == "" || oldMode == newMode:
 		etcd, err := newEtcdForMode(newMode, cfg, logger)
 		if err != nil {
 			return nil, err
 		}
 		initialized, err := etcd.IsInitialized()
 		if err != nil {
 			return nil, err
 		}
-		if initialized {
+		if initialized && oldMode == "" {
 			generated.EtcdMode = appCfg.EtcdMode
 			generated.EtcdServer = appCfg.EtcdServer
 			generated.EtcdClient = appCfg.EtcdClient
 			if err := cfg.UpdateGeneratedConfig(generated); err != nil {
 				return nil, fmt.Errorf("failed to persist etcd configuration: %w", err)
 			}
 		}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4c2bb7 and 0e26eed.

📒 Files selected for processing (6)
  • docs/using/etcd-reconfiguration.md
  • server/internal/etcd/embedded.go
  • server/internal/etcd/interface.go
  • server/internal/etcd/provide.go
  • server/internal/etcd/rbac.go
  • server/internal/etcd/remote.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • server/internal/etcd/interface.go
  • server/internal/etcd/embedded.go
🧰 Additional context used
🧬 Code graph analysis (1)
server/internal/etcd/rbac.go (2)
server/internal/config/config.go (3)
  • EtcdMode (186-186)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • HostCredentials (30-38)
🪛 LanguageTool
docs/using/etcd-reconfiguration.md

[style] ~69-~69: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 1836 characters long)
Context: ...a Server to Client (Example - host-4) !!! warning "Quorum Check" Ensure at lea...

(EN_EXCESSIVE_EXCLAMATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
docs/using/etcd-reconfiguration.md (1)

1-115: Documentation is clear and well-structured.

The guide effectively explains the automatic reconfiguration workflow, provides clear procedures with verification steps, and includes important best practices for maintaining cluster quorum. The topology recommendations and warnings are appropriate.

server/internal/etcd/rbac.go (3)

52-57: Clean extraction of server credential creation.

The delegation to addEtcdServerCredentials improves code reuse and aligns with the new ChangeMode flows that also need to create server credentials.


347-350: Propagating etcd configuration to generated config.

This ensures the generated config reflects the current etcd mode and settings, which is essential for detecting mode changes on restart.


358-381: Well-structured credential helper.

The addEtcdServerCredentials function correctly creates the server certificate with appropriate DNS names and IP addresses, and follows the existing error handling patterns.

server/internal/etcd/provide.go (2)

32-42: Clean factory pattern for etcd mode selection.

The newEtcdForMode function provides a clear mapping between configuration modes and their implementations.


92-100: Mode transition logic is correct.

The switch cases correctly instantiate the appropriate etcd implementation based on the old mode (since that's what's currently running) and then invoke ChangeMode to transition to the new mode. The 60-second timeout provides adequate time for cluster operations.

server/internal/etcd/remote.go (3)

254-261: Mode validation and initialization are correct.

The method correctly validates that the only supported transition is from client to server mode, and ensures the remote client is started before proceeding with the mode change.


305-314: Config update correctly clears client settings.

The generated config is properly updated to reflect the new server mode, clearing client endpoints and preserving server settings.


270-276: No action needed—credentials are correctly sourced.

cfg.EtcdUsername and cfg.EtcdPassword are correctly read from cfg.Config(). The Config() method returns a combined configuration that merges user config, generated config, and defaults. When credentials are stored in the generated config (as in rbac.go lines 345–346), UpdateGeneratedConfig() triggers a reload that merges those credentials into the main config. Therefore, credentials set during the initial etcd join are available when ChangeMode calls cfg.Config().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @clustertest/host_test.go:
- Around line 238-250: The recreated container assigned to h.container isn't
registered with the test cleanup handler so the new container won't be
terminated or have logs captured; extract the existing cleanup logic from
NewHost into a helper method (e.g., host.registerCleanup(t) or
host.registerCleanupForContainer) that closes/terminates the container and
captures logs, then call that helper immediately after setting h.container =
newContainer in the recreate flow (replace the old t.Cleanup reference to the
terminated container with the new registration) and ensure the helper uses
t.Cleanup to register the cleanup for the currently assigned container.
🧹 Nitpick comments (1)
clustertest/host_test.go (1)

261-265: Consider using strings.Cut for cleaner env parsing.

The magic number 17 and manual slicing is fragile. Using strings.Cut improves readability and avoids index errors.

Suggested refactor
+import "strings"
+
 func (h *Host) GetEtcdMode(t testing.TB) string {
 	t.Helper()
 
 	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
 	defer cancel()
 
 	inspect, err := h.container.Inspect(ctx)
 	require.NoError(t, err, "should be able to inspect container")
 
 	for _, env := range inspect.Config.Env {
-		if len(env) > 17 && env[:17] == "PGEDGE_ETCD_MODE=" {
-			return env[17:]
+		if key, value, ok := strings.Cut(env, "="); ok && key == "PGEDGE_ETCD_MODE" {
+			return value
 		}
 	}
 
 	return "server"
 }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ba07dd and d838618.

📒 Files selected for processing (3)
  • clustertest/cluster_test.go
  • clustertest/etcd_mode_change_test.go
  • clustertest/host_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
clustertest/cluster_test.go (1)
api/apiv1/gen/control_plane/service.go (1)
  • Cluster (205-214)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (6)
clustertest/cluster_test.go (1)

97-101: LGTM!

The RefreshClient method follows the same pattern used in Add and Remove, providing a clean way to reinitialize the cluster client after mode mutations without adding/removing hosts.

clustertest/host_test.go (1)

189-209: ExtraEnv from original HostConfig is not preserved.

The original NewHost applies config.ExtraEnv overrides (line 85), but RecreateWithMode doesn't have access to these values. If a host was created with ExtraEnv and then recreated, those environment variables will be lost.

If this is intentional, consider documenting it. Otherwise, store ExtraEnv in the Host struct to preserve it across recreations.

clustertest/etcd_mode_change_test.go (4)

13-55: LGTM!

Well-structured end-to-end test with clear phases: setup, initial verification, action, and final verification. Appropriate use of require for preconditions and assert for post-action checks.


57-98: LGTM!

Consistent with TestPromoteClientToServer, testing the reverse mode transition. Good coverage of both directions.


100-116: LGTM!

Clean helper function. Minor note: the healthy-counting logic is duplicated in waitForHostsHealthy, but this is acceptable in test code for clarity.


118-150: LGTM!

Good polling implementation with appropriate error handling that allows for transient failures during mode transitions. The 5-second interval and configurable timeout are reasonable for integration tests.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @clustertest/host_test.go:
- Around line 199-208: The cleanup method on Host is orphaned and needs to be
invoked during mode transitions; update RecreateWithMode to call h.cleanup(t)
(or h.cleanup if signature adjustment needed) before recreating or restarting
etcd when the mode changes so the etcd data directory is removed; locate the
RecreateWithMode method and insert a call to the existing cleanup function at
the start of the mode-switch code path (or right before any code that assumes a
clean etcd data directory) to ensure old data is cleared.
- Around line 91-99: RecreateWithMode currently removes and recreates the
container but never clears the etcd data directory; call h.cleanup() between
h.removeContainer(t) and h.createContainer(t, newMode, nil) so the etcd data is
wiped before the new instance is created (i.e., change RecreateWithMode to
invoke h.cleanup() after removeContainer and before createContainer to ensure
proper re-initialization during mode transitions).
🧹 Nitpick comments (2)
clustertest/etcd_mode_change_test.go (1)

43-46: Consider consolidating redundant health-wait calls.

Both tests call waitForHostsHealthy(t, cluster, 4, ...) immediately followed by waitForClusterHealthy(t, cluster, ...). When the expected count equals the total number of hosts, these checks are semantically equivalent—both wait until all 4 hosts are healthy.

Consider either:

  1. Removing one of the calls, or
  2. Consolidating into a single helper that provides both the count check and per-host status logging

Also applies to: 89-92

clustertest/host_test.go (1)

235-251: Consider using strings package for clearer intent.

The string slicing logic is correct, but using strings.HasPrefix and strings.TrimPrefix would be more idiomatic and self-documenting:

Suggested refactor
+import "strings"
+
+const etcdModeEnvPrefix = "PGEDGE_ETCD_MODE="
+
 func (h *Host) GetEtcdMode(t testing.TB) string {
 	t.Helper()
 
 	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
 	defer cancel()
 
 	inspect, err := h.container.Inspect(ctx)
 	require.NoError(t, err, "should be able to inspect container")
 
 	for _, env := range inspect.Config.Env {
-		if len(env) > 17 && env[:17] == "PGEDGE_ETCD_MODE=" {
-			return env[17:]
+		if strings.HasPrefix(env, etcdModeEnvPrefix) {
+			return strings.TrimPrefix(env, etcdModeEnvPrefix)
 		}
 	}
 
 	return "server"
 }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d838618 and 320a22c.

📒 Files selected for processing (2)
  • clustertest/etcd_mode_change_test.go
  • clustertest/host_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
clustertest/etcd_mode_change_test.go (4)

13-58: LGTM - Well-structured promotion test.

The test correctly:

  • Sets up initial state with 3 servers and 1 client
  • Uses require for preconditions (initial state) and assert for post-conditions (final state)
  • Verifies both mode change and cluster health after the operation

60-104: LGTM - Demotion test mirrors the promotion test structure.

Good symmetry with the promotion test, covering the reverse operation.


124-156: LGTM - Polling helper is well-implemented.

Correct timeout handling and informative logging during the wait loop.


158-194: LGTM - Cluster health wait helper.

Good addition of per-host status logging on line 174 which aids debugging when hosts fail to become healthy.

clustertest/host_test.go (5)

64-73: LGTM - Host initialization refactored cleanly.

Good separation of concerns with the container creation extracted to a dedicated method.


75-89: LGTM - Simple lifecycle wrappers.

Clean delegation to the underlying container methods.


101-186: LGTM - Port allocation and container creation logic is sound.

Good design choices:

  • Preserving h.port (HTTP port) across mode changes maintains client connectivity
  • Allocating fresh peer/client ports for server mode avoids port conflicts
  • Logging container output on creation failure aids debugging

188-197: LGTM - Cleanup-friendly termination.

Using a dedicated context rather than t.Context() is appropriate here since the test context may already be cancelled during cleanup scenarios.


210-233: LGTM - Cleanup registration handles failure logging appropriately.

Good diagnostic support by logging container output when tests fail.

Comment on lines 91 to 99
func (h *Host) RecreateWithMode(t testing.TB, newMode EtcdMode) {
t.Helper()

tLogf(t, "recreating host %s with etcd mode %s", h.id, newMode)

h.removeContainer(t)
h.createContainer(t, newMode, nil)
h.registerCleanup(t)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing cleanup call before mode change - etcd data directory not cleared.

RecreateWithMode removes and recreates the container but does not invoke h.cleanup() to clear the etcd data directory. The cleanup() method exists at lines 199-208 but is never called anywhere.

For mode transitions to work correctly (server→client demotion or client→server promotion), the etcd data should be cleared so the node re-initializes in its new role. Stale etcd data may cause the new instance to fail or behave unexpectedly.

Proposed fix
 func (h *Host) RecreateWithMode(t testing.TB, newMode EtcdMode) {
 	t.Helper()
 
 	tLogf(t, "recreating host %s with etcd mode %s", h.id, newMode)
 
 	h.removeContainer(t)
+	h.cleanup(t)
 	h.createContainer(t, newMode, nil)
 	h.registerCleanup(t)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (h *Host) RecreateWithMode(t testing.TB, newMode EtcdMode) {
t.Helper()
tLogf(t, "recreating host %s with etcd mode %s", h.id, newMode)
h.removeContainer(t)
h.createContainer(t, newMode, nil)
h.registerCleanup(t)
}
func (h *Host) RecreateWithMode(t testing.TB, newMode EtcdMode) {
t.Helper()
tLogf(t, "recreating host %s with etcd mode %s", h.id, newMode)
h.removeContainer(t)
h.cleanup(t)
h.createContainer(t, newMode, nil)
h.registerCleanup(t)
}
🤖 Prompt for AI Agents
In @clustertest/host_test.go around lines 91 - 99, RecreateWithMode currently
removes and recreates the container but never clears the etcd data directory;
call h.cleanup() between h.removeContainer(t) and h.createContainer(t, newMode,
nil) so the etcd data is wiped before the new instance is created (i.e., change
RecreateWithMode to invoke h.cleanup() after removeContainer and before
createContainer to ensure proper re-initialization during mode transitions).

Comment on lines 199 to 208
func (h *Host) cleanup(t testing.TB) {
t.Helper()

etcdDir := filepath.Join(h.dataDir, "etcd")
if err := os.RemoveAll(etcdDir); err != nil && !os.IsNotExist(err) {
tLogf(t, "warning: failed to remove etcd directory: %s", err)
} else {
tLogf(t, "cleared etcd data directory for mode change")
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Orphaned method - cleanup() is defined but never called.

This method correctly removes the etcd data directory, but it's not invoked anywhere. See the comment on RecreateWithMode above—this should likely be called during mode transitions.

🤖 Prompt for AI Agents
In @clustertest/host_test.go around lines 199 - 208, The cleanup method on Host
is orphaned and needs to be invoked during mode transitions; update
RecreateWithMode to call h.cleanup(t) (or h.cleanup if signature adjustment
needed) before recreating or restarting etcd when the mode changes so the etcd
data directory is removed; locate the RecreateWithMode method and insert a call
to the existing cleanup function at the start of the mode-switch code path (or
right before any code that assumes a clean etcd data directory) to ensure old
data is cleared.

@jason-lynch
Copy link
Member

Could you please remove the .DS_Store file from this PR and add a .gitignore entry for .DS_Store? Then I think this is good to go.

@tsivaprasad tsivaprasad force-pushed the PLAT-315-ability-to-reconfigure-a-hosts-etcd-configuration branch from 5869f51 to 8ce5f10 Compare January 13, 2026 17:17
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @server/internal/etcd/embedded.go:
- Around line 492-512: The demotion path can leave the system inconsistent if
steps after e.Shutdown() fail; add recovery/rollback handling around
NewRemoteEtcd/remote.Start(), remote.GetClient(), RemoveMember(), and
os.RemoveAll() so partial failures restore the previous server state or perform
safe cleanup: specifically, if remote.Start() or remote.GetClient() fail,
attempt to restart the embedded node (call the inverse of e.Shutdown() or
e.Start equivalent) or ensure config reflects server mode; if RemoveMember()
fails, try to stop the remote client and restart embedded etcd (or persist a
clear failure state) so the node remains joinable; always ensure
os.RemoveAll(e.etcdDir()) only runs after successful cluster removal or, on its
failure, leave data dir intact and surface an actionable error; reference
e.Shutdown(), NewRemoteEtcd(), remote.Start(), remote.GetClient(),
RemoveMember(), os.RemoveAll(), e.etcdDir(), and cfg.HostID when implementing
these rollback and state-consistency steps.
- Line 521: The return uses the variable err instead of an explicit nil which is
a code smell; in the function that returns remote and err (the return statement
returning "remote, err") replace the second return value with nil (i.e., return
"remote, nil") after the config update check succeeds so the success path
explicitly returns nil rather than relying on err's current value.

In @server/internal/etcd/remote.go:
- Around line 292-303: The code shuts down the remote etcd (r.Shutdown()) before
attempting embedded.Join, so if Join fails you must recover by restarting or
re-initializing the remote to avoid leaving the system without any etcd; update
the failure path after embedded.Join to try to restart the remote (e.g., call
r.Start() or recreate the RemoteEtcd via the constructor used elsewhere and
start it), capture and combine any restart error with the original join error,
and ensure the function returns a clear error only if both join and restart
recovery fail so the system is not left without a running etcd instance.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 320a22c and 8ce5f10.

📒 Files selected for processing (6)
  • changes/unreleased/Added-20260109-223536.yaml
  • server/internal/etcd/embedded.go
  • server/internal/etcd/interface.go
  • server/internal/etcd/provide.go
  • server/internal/etcd/rbac.go
  • server/internal/etcd/remote.go
🧰 Additional context used
🧬 Code graph analysis (4)
server/internal/etcd/interface.go (1)
server/internal/config/config.go (1)
  • EtcdMode (186-186)
server/internal/etcd/remote.go (4)
server/internal/config/config.go (6)
  • EtcdMode (186-186)
  • EtcdModeServer (189-189)
  • EtcdModeClient (190-190)
  • Config (193-215)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (3)
  • Etcd (40-55)
  • HostCredentials (30-38)
  • JoinOptions (18-21)
server/internal/etcd/membership.go (1)
  • GetClusterLeader (62-73)
server/internal/etcd/embedded.go (1)
  • NewEmbeddedEtcd (44-50)
server/internal/etcd/rbac.go (2)
server/internal/config/config.go (3)
  • EtcdMode (186-186)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • HostCredentials (30-38)
server/internal/etcd/provide.go (5)
server/internal/config/config.go (4)
  • EtcdMode (186-186)
  • EtcdModeServer (189-189)
  • EtcdModeClient (190-190)
  • Config (193-215)
server/internal/config/manager.go (1)
  • Manager (15-20)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/etcd/embedded.go (1)
  • NewEmbeddedEtcd (44-50)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (11)
changes/unreleased/Added-20260109-223536.yaml (1)

1-3: LGTM!

The changelog entry appropriately documents the new automatic etcd mode reconfiguration feature.

server/internal/etcd/rbac.go (3)

52-57: LGTM!

Good refactoring to extract server credential creation into a reusable helper. The error propagation is correctly maintained.


344-356: LGTM!

Properly propagates etcd configuration (mode, client, server settings) from application config to generated config, which is essential for the reconfiguration feature to work correctly.


358-381: LGTM!

The extracted helper function is well-structured with appropriate parameters. The certificate SANs correctly include both local and external hostnames/IPs for proper TLS validation.

server/internal/etcd/interface.go (1)

54-54: LGTM!

The interface extension is well-designed. Returning (Etcd, error) rather than just error correctly accounts for the fact that mode transitions result in a different implementation type (EmbeddedEtcd ↔ RemoteEtcd).

server/internal/etcd/remote.go (1)

305-314: LGTM!

The generated config is correctly updated to reflect server mode, clearing client-specific settings and preserving server configuration.

server/internal/etcd/provide.go (2)

32-42: LGTM!

Clean factory function that centralizes the etcd instance creation based on mode.


68-100: LGTM!

The reconfiguration logic correctly handles:

  1. Fresh installations and restarts (oldMode empty or unchanged)
  2. Server → Client transitions via EmbeddedEtcd.ChangeMode
  3. Client → Server transitions via RemoteEtcd.ChangeMode

The 60-second timeout is appropriate for cluster operations.

server/internal/etcd/embedded.go (3)

56-58: LGTM!

Good idempotency guard to prevent double-starting the embedded etcd server.


297-301: LGTM!

Resetting e.etcd = nil after close correctly reflects the stopped state and enables the idempotency guard in Start() to function properly.


473-484: LGTM!

Correctly collects client endpoints from remaining cluster members, excluding the current host. The validation prevents demotion when this would leave no accessible cluster endpoints.

Comment on lines +492 to +512
if err := e.Shutdown(); err != nil {
return nil, err
}

remote := NewRemoteEtcd(e.cfg, e.logger)
if err := remote.Start(ctx); err != nil {
return nil, fmt.Errorf("failed to start remote client: %w", err)
}

remoteClient, err := remote.GetClient()
if err != nil {
return nil, fmt.Errorf("failed to get remote client: %w", err)
}

if err := RemoveMember(ctx, remoteClient, cfg.HostID); err != nil {
return nil, fmt.Errorf("failed to remove embedded etcd from cluster: %w", err)
}

if err := os.RemoveAll(e.etcdDir()); err != nil {
return nil, fmt.Errorf("failed to remove embedded etcd data dir: %w", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential inconsistent state on partial failure during demotion.

Similar to the client→server transition, if operations fail after e.Shutdown() (e.g., remote.Start(), RemoveMember(), or os.RemoveAll()), the system may be left in an inconsistent state. Consider adding recovery logic or documenting the manual recovery steps.

For example, if RemoveMember fails, the local embedded etcd is shut down and data directory still exists, but the remote client is running. On restart, the generated config will still show server mode (since config update happens later), potentially causing issues.

🤖 Prompt for AI Agents
In @server/internal/etcd/embedded.go around lines 492 - 512, The demotion path
can leave the system inconsistent if steps after e.Shutdown() fail; add
recovery/rollback handling around NewRemoteEtcd/remote.Start(),
remote.GetClient(), RemoveMember(), and os.RemoveAll() so partial failures
restore the previous server state or perform safe cleanup: specifically, if
remote.Start() or remote.GetClient() fail, attempt to restart the embedded node
(call the inverse of e.Shutdown() or e.Start equivalent) or ensure config
reflects server mode; if RemoveMember() fails, try to stop the remote client and
restart embedded etcd (or persist a clear failure state) so the node remains
joinable; always ensure os.RemoveAll(e.etcdDir()) only runs after successful
cluster removal or, on its failure, leave data dir intact and surface an
actionable error; reference e.Shutdown(), NewRemoteEtcd(), remote.Start(),
remote.GetClient(), RemoveMember(), os.RemoveAll(), e.etcdDir(), and cfg.HostID
when implementing these rollback and state-consistency steps.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
server/internal/etcd/embedded.go (2)

52-69: Start() idempotency guard can hide partial/failed start state

If start(ctx) sets e.etcd and then fails later (e.g., GetClient() / certificateService()), subsequent Start() calls will now return nil at Line 56 even though the instance isn’t actually usable.

Suggested fix (make the guard reflect “fully started”)
 func (e *EmbeddedEtcd) Start(ctx context.Context) error {
   e.mu.Lock()
   defer e.mu.Unlock()

-  if e.etcd != nil {
+  // Only treat as "already started" when the instance is fully usable.
+  if e.etcd != nil && e.client != nil && e.certSvc != nil {
     return nil // already started
   }

289-302: Shutdown() now nils e.etcd; Error() can panic after shutdown

With e.etcd = nil (Line 299), callers that still read e.Error() will hit a nil deref (return e.etcd.Err()).

Suggested fix (nil-safe Error())
 func (e *EmbeddedEtcd) Error() <-chan error {
-  return e.etcd.Err()
+  if e.etcd == nil {
+    ch := make(chan error)
+    close(ch)
+    return ch
+  }
+  return e.etcd.Err()
 }
🤖 Fix all issues with AI agents
In @server/internal/etcd/embedded.go:
- Around line 451-522: In ChangeMode the earlier assignment
generated.EtcdClient.Endpoints = endpoints is clobbered later by
generated.EtcdClient = cfg.EtcdClient (where cfg is the stale snapshot), so
ensure the final generated.EtcdClient contains the computed endpoints; e.g.,
after copying cfg into generated.EtcdClient (or instead of overwriting),
explicitly set generated.EtcdClient.Endpoints = endpoints so the computed
endpoints persist before calling e.cfg.UpdateGeneratedConfig; update code around
the generated.EtcdClient assignments in ChangeMode to preserve endpoints.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce5f10 and 6d3108b.

📒 Files selected for processing (1)
  • server/internal/etcd/embedded.go
🧰 Additional context used
🧬 Code graph analysis (1)
server/internal/etcd/embedded.go (4)
server/internal/config/config.go (3)
  • EtcdMode (186-186)
  • EtcdModeClient (190-190)
  • EtcdModeServer (189-189)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
server/internal/etcd/membership.go (1)
  • RemoveMember (34-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

Comment on lines 451 to 522
func (e *EmbeddedEtcd) ChangeMode(ctx context.Context, mode config.EtcdMode) (Etcd, error) {
if mode != config.EtcdModeClient {
return nil, fmt.Errorf("invalid mode transition from %s to %s", config.EtcdModeServer, mode)
}

if err := e.Start(ctx); err != nil {
return nil, err
}

cfg := e.cfg.Config()

embeddedClient, err := e.GetClient()
if err != nil {
return nil, err
}

// Get the full member list before removing this host
resp, err := embeddedClient.MemberList(ctx)
if err != nil {
return nil, fmt.Errorf("failed to list etcd members for server->client transition: %w", err)
}

var endpoints []string
for _, m := range resp.Members {
// Skip this host's member; we are about to remove it.
if m.Name == cfg.HostID {
continue
}
endpoints = append(endpoints, m.ClientURLs...)
}

if len(endpoints) == 0 {
return nil, fmt.Errorf("cannot demote etcd server on host %s: no remaining cluster members with client URLs", cfg.HostID)
}

generated := e.cfg.GeneratedConfig()
generated.EtcdClient.Endpoints = endpoints
if err := e.cfg.UpdateGeneratedConfig(generated); err != nil {
return nil, fmt.Errorf("failed to update generated config with client endpoints: %w", err)
}

if err := e.Shutdown(); err != nil {
return nil, err
}

remote := NewRemoteEtcd(e.cfg, e.logger)
if err := remote.Start(ctx); err != nil {
return nil, fmt.Errorf("failed to start remote client: %w", err)
}

remoteClient, err := remote.GetClient()
if err != nil {
return nil, fmt.Errorf("failed to get remote client: %w", err)
}

if err := RemoveMember(ctx, remoteClient, cfg.HostID); err != nil {
return nil, fmt.Errorf("failed to remove embedded etcd from cluster: %w", err)
}

if err := os.RemoveAll(e.etcdDir()); err != nil {
return nil, fmt.Errorf("failed to remove embedded etcd data dir: %w", err)
}

generated.EtcdMode = config.EtcdModeClient
generated.EtcdServer = config.EtcdServer{}
generated.EtcdClient = cfg.EtcdClient
if err := e.cfg.UpdateGeneratedConfig(generated); err != nil {
return nil, fmt.Errorf("failed to clear out etcd server settings in generated config: %w", err)
}

return remote, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

ChangeMode() overwrites the computed client endpoints near the end

You set generated.EtcdClient.Endpoints = endpoints (Line 487-490), but later do generated.EtcdClient = cfg.EtcdClient (Line 516), where cfg is a stale snapshot from before the endpoints update—this likely clobbers endpoints and persists the wrong client config.

Minimal fix
 generated.EtcdMode = config.EtcdModeClient
 generated.EtcdServer = config.EtcdServer{}
-generated.EtcdClient = cfg.EtcdClient
+// Keep the computed client endpoints written earlier in this transition.
 if err := e.cfg.UpdateGeneratedConfig(generated); err != nil {
   return nil, fmt.Errorf("failed to clear out etcd server settings in generated config: %w", err)
 }
🤖 Prompt for AI Agents
In @server/internal/etcd/embedded.go around lines 451 - 522, In ChangeMode the
earlier assignment generated.EtcdClient.Endpoints = endpoints is clobbered later
by generated.EtcdClient = cfg.EtcdClient (where cfg is the stale snapshot), so
ensure the final generated.EtcdClient contains the computed endpoints; e.g.,
after copying cfg into generated.EtcdClient (or instead of overwriting),
explicitly set generated.EtcdClient.Endpoints = endpoints so the computed
endpoints persist before calling e.cfg.UpdateGeneratedConfig; update code around
the generated.EtcdClient assignments in ChangeMode to preserve endpoints.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the EmbeddedEtcd struct definition and understand its fields
cd server/internal/etcd && head -100 embedded.go | grep -A 30 "type EmbeddedEtcd"

Repository: pgEdge/control-plane

Length of output: 720


🏁 Script executed:

# Search for the EmbeddedEtcd struct definition more comprehensively
rg "type EmbeddedEtcd struct" -A 15

Repository: pgEdge/control-plane

Length of output: 1005


🏁 Script executed:

# Check for mutex usage in the EmbeddedEtcd struct and methods
rg "e\.mu\b|RLock\(\)|Lock\(\)|Unlock\(\)" server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 131


🏁 Script executed:

# Look at the RemoveMember function to understand constraints and failure modes
rg "func RemoveMember" -A 30

Repository: pgEdge/control-plane

Length of output: 1996


🏁 Script executed:

# Check how other lifecycle methods (like Start, Shutdown) handle the mutex
rg "func \(e \*EmbeddedEtcd\) (Start|Shutdown)" -A 10 server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 560


🏁 Script executed:

# Look for cleanup patterns with remote client in similar code
rg "remote\." server/internal/etcd/embedded.go | head -20

Repository: pgEdge/control-plane

Length of output: 147


🏁 Script executed:

# Let's see the complete ChangeMode method to check if it acquires the mutex
rg "func \(e \*EmbeddedEtcd\) ChangeMode" -A 80 server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2439


🏁 Script executed:

# Look at the complete Shutdown method with mutex context
rg "func \(e \*EmbeddedEtcd\) Shutdown" -B 2 -A 20 server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 525


🏁 Script executed:

# Check if there are any tests for ChangeMode that might reveal the intended behavior
fd -e "_test.go" -e ".test.go" | xargs grep -l "ChangeMode" 2>/dev/null

Repository: pgEdge/control-plane

Length of output: 48


🏁 Script executed:

# Look for usage of ChangeMode in the codebase to understand the calling context
rg "\.ChangeMode\(" --type go

Repository: pgEdge/control-plane

Length of output: 213


🏁 Script executed:

# Check GetClient implementation and its mutex usage
rg "func \(e \*EmbeddedEtcd\) GetClient" -A 15 server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 491


🏁 Script executed:

# Look for error handling patterns and cleanup in the codebase
rg "defer.*Start\(\)|defer.*Shutdown\(\)" server/internal/etcd/

Repository: pgEdge/control-plane

Length of output: 46


🏁 Script executed:

# Check RemoteEtcd to see how it implements Start/Shutdown
rg "type RemoteEtcd struct" -A 10 server/internal/etcd/

Repository: pgEdge/control-plane

Length of output: 706


ChangeMode() has concurrency races and lacks cleanup after intermediate failures

The method has multiple issues:

  1. Race condition with concurrent access: GetClient() reads e.client and e.etcd without holding the mutex, while Shutdown() modifies these same fields without the lock. This creates a race window where GetClient() could read partially updated state during Shutdown().

  2. Orphaned remote client on failure: After remote.Start() succeeds (line 499), if RemoveMember() (line 508) or subsequent steps fail, the remote client remains running with no cleanup, leaving the instance in an inconsistent state with a stale config.

  3. Cluster size check happens too late: RemoveMember() validates cluster size (< 3 members) only after the remote client has started. This should be validated earlier to fail fast before mutation.

Fix by:

  • Holding e.mu for the entire transition (or explicitly document that callers must serialize ChangeMode calls).
  • Adding a defer remote.Shutdown() after remote.Start() to ensure cleanup on any subsequent error.
  • Moving the cluster viability check (validate size before starting remote) to prevent cascading failures.

@tsivaprasad tsivaprasad changed the title docs: add procedure for reconfiguring etcd mode (server ↔ client) feat: add procedure for reconfiguring etcd mode (server ↔ client) Jan 13, 2026
@tsivaprasad tsivaprasad changed the title feat: add procedure for reconfiguring etcd mode (server ↔ client) feat: enabled etcd mode (server ↔ client) reconfiguration Jan 13, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @server/internal/etcd/embedded.go:
- Around line 451-458: ChangeMode on EmbeddedEtcd has a race: it calls Start()
(which locks) then invokes GetClient() and Shutdown() without holding the mutex
while Shutdown and Error access/modify shared fields e.etcd and e.client
unsafely; to fix, add mutex protection around the entire mode transition in
ChangeMode (acquire the same mutex used in Start/Shutdown), and audit
unprotected accessors GetClient, Error, Shutdown (and any other methods touching
e.etcd or e.client) to either take the mutex for reads/writes or provide
copy-on-read semantics; reference methods: EmbeddedEtcd.ChangeMode,
EmbeddedEtcd.Start, EmbeddedEtcd.GetClient, EmbeddedEtcd.Shutdown,
EmbeddedEtcd.Error and shared fields e.etcd/e.client when making the change.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d3108b and d591f52.

📒 Files selected for processing (1)
  • server/internal/etcd/embedded.go
🧰 Additional context used
🧬 Code graph analysis (1)
server/internal/etcd/embedded.go (5)
server/internal/config/config.go (4)
  • EtcdMode (186-186)
  • EtcdModeClient (190-190)
  • EtcdModeServer (189-189)
  • Config (193-215)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/patroni/config.go (2)
  • Etcd (144-158)
  • Config (298-310)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
server/internal/etcd/membership.go (1)
  • RemoveMember (34-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (3)
server/internal/etcd/embedded.go (3)

56-59: LGTM!

The idempotent guard is correctly placed within the mutex lock and aligns with the Shutdown method setting e.etcd = nil.


297-300: LGTM!

Setting e.etcd = nil after close enables proper idempotent behavior with the Start() guard.


473-484: LGTM!

Good defensive check ensuring at least one remaining cluster member with client URLs exists before proceeding with demotion.

Comment on lines +451 to +458
func (e *EmbeddedEtcd) ChangeMode(ctx context.Context, mode config.EtcdMode) (Etcd, error) {
if mode != config.EtcdModeClient {
return nil, fmt.Errorf("invalid mode transition from %s to %s", config.EtcdModeServer, mode)
}

if err := e.Start(ctx); err != nil {
return nil, err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full embedded.go file to understand the structure and mutex usage
wc -l server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 100


🏁 Script executed:

# Get the EmbeddedEtcd struct definition and relevant methods
sed -n '1,100p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2769


🏁 Script executed:

# Look at the mutex field and how it's used
rg -n "mu.*sync" server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 90


🏁 Script executed:

# Search for the ChangeMode method and surrounding context
sed -n '440,480p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 1088


🏁 Script executed:

# Check the Start() and Shutdown() methods to see mutex usage
ast-grep --pattern $'func ($_ *EmbeddedEtcd) Start($_) {
  $$$
}'

Repository: pgEdge/control-plane

Length of output: 46


🏁 Script executed:

# Check the Error() method implementation
sed -n '300,310p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 264


🏁 Script executed:

# Search for all ChangeMode calls in the codebase
rg -n "ChangeMode" --type go -A 2 -B 2

Repository: pgEdge/control-plane

Length of output: 1871


🏁 Script executed:

# Get the full ChangeMode method (it continues beyond line 458)
sed -n '451,520p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2180


🏁 Script executed:

# Find the Shutdown method
rg -n "func.*Shutdown" server/internal/etcd/embedded.go -A 10

Repository: pgEdge/control-plane

Length of output: 383


🏁 Script executed:

# Check GetClient implementation
rg -n "func.*GetClient" server/internal/etcd/embedded.go -A 5

Repository: pgEdge/control-plane

Length of output: 222


🏁 Script executed:

# Look at the provide.go file context around ChangeMode call
sed -n '80,110p' server/internal/etcd/provide.go

Repository: pgEdge/control-plane

Length of output: 1077


Address missing mutex protection in ChangeMode and related methods.

ChangeMode creates a race condition: it calls e.Start() (which acquires and releases the mutex), then calls e.GetClient() and e.Shutdown() without holding the mutex. Since Shutdown() sets e.etcd = nil without protection, any concurrent call to Error() (which accesses e.etcd unsafely at line 304-306) would panic.

The broader issue is inconsistent synchronization in EmbeddedEtcd: Start() properly uses the mutex, but GetClient(), Error(), Shutdown(), and other methods access shared fields (e.etcd, e.client) without any lock, creating potential data races.

Either ensure at the caller level (in provide.go) that this instance receives no concurrent access during the transition, or add mutex protection to ChangeMode() and unprotected accessor methods throughout the class.

🤖 Prompt for AI Agents
In @server/internal/etcd/embedded.go around lines 451 - 458, ChangeMode on
EmbeddedEtcd has a race: it calls Start() (which locks) then invokes GetClient()
and Shutdown() without holding the mutex while Shutdown and Error access/modify
shared fields e.etcd and e.client unsafely; to fix, add mutex protection around
the entire mode transition in ChangeMode (acquire the same mutex used in
Start/Shutdown), and audit unprotected accessors GetClient, Error, Shutdown (and
any other methods touching e.etcd or e.client) to either take the mutex for
reads/writes or provide copy-on-read semantics; reference methods:
EmbeddedEtcd.ChangeMode, EmbeddedEtcd.Start, EmbeddedEtcd.GetClient,
EmbeddedEtcd.Shutdown, EmbeddedEtcd.Error and shared fields e.etcd/e.client when
making the change.

Comment on lines +492 to +518
if err := e.Shutdown(); err != nil {
return nil, err
}

remote := NewRemoteEtcd(e.cfg, e.logger)
if err := remote.Start(ctx); err != nil {
return nil, fmt.Errorf("failed to start remote client: %w", err)
}

remoteClient, err := remote.GetClient()
if err != nil {
return nil, fmt.Errorf("failed to get remote client: %w", err)
}

if err := RemoveMember(ctx, remoteClient, cfg.HostID); err != nil {
return nil, fmt.Errorf("failed to remove embedded etcd from cluster: %w", err)
}

if err := os.RemoveAll(e.etcdDir()); err != nil {
return nil, fmt.Errorf("failed to remove embedded etcd data dir: %w", err)
}

generated.EtcdMode = config.EtcdModeClient
generated.EtcdServer = config.EtcdServer{}
if err := e.cfg.UpdateGeneratedConfig(generated); err != nil {
return nil, fmt.Errorf("failed to clear out etcd server settings in generated config: %w", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial failure leaves system in inconsistent state.

After Shutdown() succeeds, failures in subsequent steps leave the system in an unrecoverable state:

  • If remote.Start() fails: embedded is shut down, no client available.
  • If RemoveMember() fails: running as client but still a cluster member.
  • If os.RemoveAll() fails: orphaned data, but still demoted.
  • If the final UpdateGeneratedConfig() fails: config doesn't reflect actual state.

Consider either:

  1. Reordering operations to fail fast before irreversible changes (e.g., validate remote connectivity first).
  2. Adding best-effort rollback or recovery logging.
  3. Documenting that manual intervention may be required on partial failure.

@tsivaprasad tsivaprasad merged commit 8d87ef9 into main Jan 13, 2026
3 checks passed
@tsivaprasad tsivaprasad deleted the PLAT-315-ability-to-reconfigure-a-hosts-etcd-configuration branch January 14, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants