Skip to content

Cases Where Removing a Member from the Cluster Fails #387

@Besroy

Description

@Besroy

Documenting scenarios where a member fails to remove itself from the cluster.
While the replace member operation completes successfully, the removed member may still believe it is part of the group until it eventually detects that it is no longer in the cluster. This behavior is observed in NuRaft.

Scenarios:

  1. Network Issue: The removed member does not receive the remove request.
  2. Member Down: The removed member is offline during the removal process, and the remove request is missed after it restarts.
  3. Baseline Resync (BR): A concurrency issue prevents the member from executing the removal process during BR. The root cause has not been further investigated, but related logs are provided below for reference.

leader:

[01/26/26 22:17:48.068722] [I] [60] [handle_join_leave.cxx:536:rm_srv_from_cluster] removed server 2139664025 from configuration and save the configuration to log store at 1041772 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.069032] [I] [60] [handle_join_leave.cxx:555:rm_srv_from_cluster] set srv_to_leave_, server 2139664025 will be removed from cluster, config 1041772 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.070181] [I] [76] [handle_commit.cxx:440:commit_conf] config at index 1041772 is committed, prev config log idx 512091 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.070406] [I] [76] [handle_commit.cxx:769:reconfigure] new config log idx 1041772, prev log idx 512091, cur config log idx 512091, prev log idx 510594 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.070411] [I] [76] [handle_commit.cxx:923:reconfigure] srv_to_leave_: 2139664025 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.070415] [I] [76] [handle_commit.cxx:927:reconfigure] srv_to_leave_ has snapshot context 0x58e053b8d790 and user context 0x58e052a7bbb0, destroy them [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.171668] [I] [76] [handle_commit.cxx:961:reconfigure] clearing uncommitted config at log 1041772, prev 512091 [group=4bd41042-ee43-48af-b046-57f050f0729c]
my id: 1989274623, leader: 1989274623, term: 1 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.226051] [I] [62] [handle_timeout.cxx:117:handle_hb_timeout] peer 2139664025 is not responding for 1 HBs since leave request [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:48.908713] [I] [60] [handle_snapshot_sync.cxx:342:handle_install_snapshot_resp] no snapshot sync context for this peer, drop the response [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:49.435731] [I] [63] [handle_timeout.cxx:117:handle_hb_timeout] peer 2139664025 is not responding for 2 HBs since leave request [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:49.435748] [I] [63] [handle_snapshot_sync.cxx:146:create_sync_snapshot_req] trying to sync snapshot with last index 1040000 to peer 2139664025, its last log idx 950000, my start index 1020001, my last log idx 1041772 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:50.636040] [I] [62] [handle_timeout.cxx:117:handle_hb_timeout] peer 2139664025 is not responding for 3 HBs since leave request [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:51.836236] [I] [63] [handle_timeout.cxx:117:handle_hb_timeout] peer 2139664025 is not responding for 4 HBs since leave request [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:53.036507] [I] [62] [handle_timeout.cxx:117:handle_hb_timeout] peer 2139664025 is not responding for 5 HBs since leave request [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:53.036514] [E] [62] [handle_timeout.cxx:122:handle_hb_timeout] force remove peer 2139664025 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:53.036517] [I] [62] [handle_join_leave.cxx:565:handle_join_leave_rpc_err] rpc failed for removing server (2139664025), will remove this server directly [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:53.036521] [I] [62] [handle_commit.cxx:1005:remove_peer_from_peers] server 2139664025 is removed from cluster [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:53.783182] [I] [62] [handle_join_leave.cxx:636:reset_srv_to_leave] clearing srv_to_leave_ [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:53.783326] [I] [60] [peer.cxx:102:handle_rpc_result] peer 2139664025 has been shut down, ignore response. [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:18:47.330364] [I] [19] [raft_server.cxx:393:apply_and_log_current_params] parameters: timeout 3000 - 6000, heartbeat 1200, leadership expiry 24000, max batch 64, backoff 250, snapshot distance 10000, enable randomized snapshot creation NO, log sync stop gap 0, use new joiner type NO, reserved logs 10000, client timeout 3000, auto forwarding OFF, API call type ASYNC, custom commit quorum size 0, custom election quorum size 0, snapshot receiver INCLUDED, leadership transfer wait time 0, grace period of lagging state machine 0, snapshot IO: ASYNC, parallel log appending: OFF, streaming mode max log gap 0, max bytes 0, full consensus mode: OFF [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:18:47.330368] [I] [19] [raft_server.cxx:351:update_rand_timeout] new timeout range: 3000 -- 6000 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/27/26 09:18:09.648526] [I] [74] [handle_user_cmd.cxx:145:flip_learner_flag] set learner flag to true for server 1002845171 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/27/26 09:18:09.648531] [I] [74] [handle_user_cmd.cxx:155:flip_learner_flag] copy new conf buf [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/27/26 09:18:09.649501] [I] [74] [handle_user_cmd.cxx:164:flip_learner_flag] request append entries [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/27/26 09:18:09.651030] [I] [76] [handle_commit.cxx:440:commit_conf] config at index 1041774 is committed, prev config log idx 1041772 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/27/26 09:18:09.651336] [I] [76] [handle_commit.cxx:769:reconfigure] new config log idx 1041774, prev log idx 1041772, cur config log idx 1041772, prev log idx 512091 [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/27/26 09:18:09.651351] [I] [76] [handle_commit.cxx:961:reconfigure] clearing uncommitted config at log 1041774, prev 1041772 [group=4bd41042-ee43-48af-b046-57f050f0729c]
my id: 1989274623, leader: 1989274623, term: 1 [group=4bd41042-ee43-48af-b046-57f050f0729c]

follower:

[01/26/26 22:17:49.369374] [I] [64] [handle_snapshot_sync.cxx:471:handle_snapshot_sync_req] save snapshot (idx 1040000, term 1) offset 0x0, first obj  [group=4bd41042-ee43-48af-b046-5
7f050f0729c]
[01/26/26 22:17:49.402110] [I] [64] [handle_commit.cxx:1012:pause_state_machine_execution] pause state machine execution, previously PAUSED, state machine SLEEPING, timeout 1000 ms [g
roup=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:49.402113] [I] [64] [handle_commit.cxx:1025:pause_state_machine_execution] waited 0 ms, state machine SLEEPING [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:17:58.696600] [I] [62] [handle_timeout.cxx:239:handle_election_timeout] stepping down (cycles left: 1), skip this election timeout event [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:18:03.810713] [I] [62] [handle_timeout.cxx:212:handle_election_timeout] no hearing further news from leader, remove this server from cluster and step down [group=4bd41042-ee43-48af-b046-57f050f0729c]
[01/26/26 22:18:03.828] [storage_mgr] [info] [62] [raft_repl_dev.cpp:2104:save_state] [traceID=n/a] [rdev0:4bd41042-ee43-48af-b046-57f050f0729c] Saved state in binary format (size=16 bytes): term=1, voted_for=-1, election_timer_allowed=false, catching_up=false, receiving_snapshot=true

expected behavior (copied from a successful env)

[01/28/26 01:00:28.608318] [I] [74] [handle_commit.cxx:948:reconfigure] peer 1256312583 cannot be found, no action for removing [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:00:28.608330] [I] [74] [handle_commit.cxx:996:reconfigure] new configuration: log idx 980495, prev log idx 980494
peer 396558301, DC ID 0, 88054bf5-be37-4bfc-8b76-2bb2c9b2ca74, voting member, regular member, 100
peer 1967661896, DC ID 0, 4d27e9ad-cc3b-4cbc-9cdd-a61720d9b5f5, voting member, regular member, 66
peer 1323286963, DC ID 0, 1d0ad783-8fb9-4908-af10-72cb04cf1af1, voting member, regular member, 66
my id: 1256312583, leader: 1967661896, term: 4 [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:00:32.649982] [I] [29] [handle_timeout.cxx:239:handle_election_timeout] stepping down (cycles left: 1), skip this election timeout event [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:00:36.667035] [I] [30] [handle_timeout.cxx:212:handle_election_timeout] no hearing further news from leader, remove this server from cluster and step down [group=9065d36e-3078-4d21-8902-c968c2dd3993]

[01/28/26 01:01:31.618439] [I] [19] [raft_server.cxx:466:shutdown] shutting down raft core [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618576] [I] [19] [raft_server.cxx:485:shutdown] sent stop signal to the commit thread. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618580] [I] [19] [raft_server.cxx:491:shutdown] cancelled all schedulers. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618593] [I] [19] [raft_server.cxx:501:shutdown] commit thread stopped. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618596] [I] [19] [raft_server.cxx:505:shutdown] all pending commit elements dropped. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618598] [I] [19] [raft_server.cxx:509:shutdown] all state machine watchers dropped. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618612] [I] [19] [raft_server.cxx:519:shutdown] reset all pointers. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618864] [I] [19] [raft_server.cxx:534:shutdown] joined terminated commit thread. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618888] [I] [19] [raft_server.cxx:541:shutdown] sent stop signal to background append thread. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618899] [I] [75] [handle_append_entries.cxx:147:append_entries_in_bg] bg append_entries thread terminated [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618924] [I] [19] [raft_server.cxx:549:shutdown] clean up auto-forwarding queue: 0 elems [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618928] [I] [19] [raft_server.cxx:553:shutdown] clean up auto-forwarding clients [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618930] [I] [19] [raft_server.cxx:556:shutdown] raft_server shutdown completed. [group=9065d36e-3078-4d21-8902-c968c2dd3993]

[01/28/26 01:00:36.667] [storage_mgr] [info] [30] [raft_repl_dev.cpp:2104:save_state] [traceID=n/a] [rdev0:9065d36e-3078-4d21-8902-c968c2dd3993] Saved state in binary format (size=16 bytes): term=4, voted_for=1967661896, election_timer_allowed=false, catching_up=false, receiving_snapshot=false
[01/28/26 01:00:55.918] [storage_mgr] [debug] [20] [raft_repl_dev.cpp:2465:get_cp_ctx] [traceID=n/a] [rdev0:9065d36e-3078-4d21-8902-c968c2dd3993] getting cp_ctx for raft repl dev 0x580c2a0cfa10, cp_lsn=980495, clsn=0, next_dsn=980486, cp string:CP=210: status=cp_trigger, enter_count=1
[01/28/26 01:00:55.918] [storage_mgr] [debug] [20] [raft_repl_dev.cpp:2456:cp_flush] [traceID=n/a] [rdev0:9065d36e-3078-4d21-8902-c968c2dd3993] cp flush in raft repl dev, lsn=980495, clsn=0, next_dsn=980486, cp string:CP=210: status=cp_flushing, enter_count=0
[01/28/26 01:01:31.618] [storage_mgr] [info] [19] [raft_repl_service.cpp:715:gc_repl_devs] ReplDev group_id=9065d36e-3078-4d21-8902-c968c2dd3993 was destroyed, shutting down the raft group in delayed fashion now
[01/28/26 01:01:31.618] [storage_mgr] [info] [19] [manager_impl.cpp:372:leave_group] Leaving group [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618439] [I] [19] [raft_server.cxx:466:shutdown] shutting down raft core [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618576] [I] [19] [raft_server.cxx:485:shutdown] sent stop signal to the commit thread. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618580] [I] [19] [raft_server.cxx:491:shutdown] cancelled all schedulers. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618593] [I] [19] [raft_server.cxx:501:shutdown] commit thread stopped. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618596] [I] [19] [raft_server.cxx:505:shutdown] all pending commit elements dropped. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618598] [I] [19] [raft_server.cxx:509:shutdown] all state machine watchers dropped. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618612] [I] [19] [raft_server.cxx:519:shutdown] reset all pointers. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618864] [I] [19] [raft_server.cxx:534:shutdown] joined terminated commit thread. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618888] [I] [19] [raft_server.cxx:541:shutdown] sent stop signal to background append thread. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618899] [I] [75] [handle_append_entries.cxx:147:append_entries_in_bg] bg append_entries thread terminated [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618924] [I] [19] [raft_server.cxx:549:shutdown] clean up auto-forwarding queue: 0 elems [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618928] [I] [19] [raft_server.cxx:553:shutdown] clean up auto-forwarding clients [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618930] [I] [19] [raft_server.cxx:556:shutdown] raft_server shutdown completed. [group=9065d36e-3078-4d21-8902-c968c2dd3993]
[01/28/26 01:01:31.618] [storage_mgr] [info] [19] [raft_repl_dev.cpp:2181:permanent_destroy] [traceID=n/a] [rdev0:9065d36e-3078-4d21-8902-c968c2dd3993] Permanent destroy for raft repl dev group_id=9065d36e-3078-4d21-8902-c968c2dd3993
[01/28/26 01:01:31.656] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:2465:get_cp_ctx] [traceID=n/a] [rdev0:9065d36e-3078-4d21-8902-c968c2dd3993] getting cp_ctx for raft repl dev 0x580c2a0cfa10, cp_lsn=980495, clsn=0, next_dsn=980486, cp string:CP=211: status=cp_trigger, enter_count=1
[01/28/26 01:01:31.658] [storage_mgr] [info] [19] [replication_state_machine.cpp:843:destroy_snapshot_sb] Snapshot context superblk destroyed for group_id=9065d36e-3078-4d21-8902-c968c2dd3993
[01/28/26 01:01:31.658] [storage_mgr] [info] [19] [replication_state_machine.cpp:284:on_destroy] replica destroyed, cleared pg=0 resources with group_id=9065d36e-3078-4d21-8902-c968c2dd3993
[01/28/26 01:01:31.665] [storage_mgr] [info] [19] [manager_impl.cpp:390:leave_group] Finished leaving: [group=9065d36e-3078-4d21-8902-c968c2dd3993]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions