fix(distributed): implement cluster manager startup loops and state telemetry by adolago · Pull Request #798 · adolago/rustible

adolago · 2026-02-23T09:27:23Z

Summary

This change replaces startup and state-reporting stubs in the distributed cluster manager with concrete, testable behavior.

What changed

Implemented ClusterManager::start() startup initialization to:
- initialize peer connectivity state and baseline peer health cache,
- provision a fresh outbound message channel per lifecycle start,
- spawn background loops for outbound message draining and health/heartbeat telemetry reconciliation.
Implemented ClusterManager::stop() lifecycle cleanup to:
- flip running state,
- abort spawned background tasks,
- reset peer connection states and pending message counters,
- clear known leader state.
Added deterministic background processing:
- message drain loop updates peer liveness metadata, clears pending queue counters, and updates leader state from heartbeat traffic,
- health loop reconciles peer health status from connection state/heartbeat freshness.
Reworked cluster_state() to report meaningful data from available runtime state:
- leader now uses observed leader (heartbeat-derived) with local-controller fallback while running,
- total_load now aggregates pending outbound peer queue load,
- capacity/health aggregation remains computed from known peer telemetry.
Fixed broadcast deadlock risk by cloning peer IDs before awaiting send_message().

Root cause

ClusterManager::start() and cluster_state() still contained TODO stubs, so no background processing existed for message queue draining or health reconciliation, and cluster telemetry always returned placeholder leader/load values.

Validation

Targeted tests for the distributed cluster module pass:

cargo test --features distributed distributed::cluster::tests:: --lib -- --test-threads=1

Result:

6 passed, 0 failed

Issue

Closes #788

…elemetry

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce36221fbb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-23T09:41:33Z

src/distributed/cluster.rs

+                        if matches!(peer.state, PeerConnectionState::Connected)
+                            && now.saturating_duration_since(peer.last_seen) > stale_after
+                        {
+                            peer.state = PeerConnectionState::Failed {
+                                reason: "health timeout".to_string(),


Preserve sendability after idle health timeouts

After about 3 * heartbeat_interval, this loop marks any idle connected peer as Failed, but send_message later only permits Connected peers (src/distributed/cluster.rs:442-446). Since no background path in this lifecycle promotes timed-out peers back to Connected, a quiet cluster can fall into a persistent PeerNotConnected state where later heartbeats or work dispatches cannot be queued.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-23T09:41:33Z

src/distributed/cluster.rs

        for peer in peers.values_mut() {
            peer.state = PeerConnectionState::Disconnected;
+            peer.pending_messages = 0;
        }


Reset peer health cache during shutdown

stop() resets connection state in self.peers but does not reconcile or clear self.peer_info, while cluster telemetry is computed from peer_info in cluster_state(). If callers inspect state after shutdown, they can still see previously healthy peers and capacity values from stale cache entries, which makes stopped-node telemetry inaccurate.

Useful? React with 👍 / 👎.

fix(distributed): implement cluster manager startup loops and state t…

ce36221

…elemetry

chatgpt-codex-connector bot reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(distributed): implement cluster manager startup loops and state telemetry#798

fix(distributed): implement cluster manager startup loops and state telemetry#798
adolago wants to merge 1 commit intomainfrom
issue-788-cluster-manager

adolago commented Feb 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 23, 2026

Uh oh!

chatgpt-codex-connector bot Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

adolago commented Feb 23, 2026

Summary

What changed

Root cause

Validation

Issue

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant