Skip to content

Comments

fix(distributed): implement cluster manager startup loops and state telemetry#798

Open
adolago wants to merge 1 commit intomainfrom
issue-788-cluster-manager
Open

fix(distributed): implement cluster manager startup loops and state telemetry#798
adolago wants to merge 1 commit intomainfrom
issue-788-cluster-manager

Conversation

@adolago
Copy link
Owner

@adolago adolago commented Feb 23, 2026

Summary

This change replaces startup and state-reporting stubs in the distributed cluster manager with concrete, testable behavior.

What changed

  • Implemented ClusterManager::start() startup initialization to:
    • initialize peer connectivity state and baseline peer health cache,
    • provision a fresh outbound message channel per lifecycle start,
    • spawn background loops for outbound message draining and health/heartbeat telemetry reconciliation.
  • Implemented ClusterManager::stop() lifecycle cleanup to:
    • flip running state,
    • abort spawned background tasks,
    • reset peer connection states and pending message counters,
    • clear known leader state.
  • Added deterministic background processing:
    • message drain loop updates peer liveness metadata, clears pending queue counters, and updates leader state from heartbeat traffic,
    • health loop reconciles peer health status from connection state/heartbeat freshness.
  • Reworked cluster_state() to report meaningful data from available runtime state:
    • leader now uses observed leader (heartbeat-derived) with local-controller fallback while running,
    • total_load now aggregates pending outbound peer queue load,
    • capacity/health aggregation remains computed from known peer telemetry.
  • Fixed broadcast deadlock risk by cloning peer IDs before awaiting send_message().

Root cause

ClusterManager::start() and cluster_state() still contained TODO stubs, so no background processing existed for message queue draining or health reconciliation, and cluster telemetry always returned placeholder leader/load values.

Validation

Targeted tests for the distributed cluster module pass:

cargo test --features distributed distributed::cluster::tests:: --lib -- --test-threads=1

Result:

  • 6 passed, 0 failed

Issue

Closes #788

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce36221fbb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +295 to +299
if matches!(peer.state, PeerConnectionState::Connected)
&& now.saturating_duration_since(peer.last_seen) > stale_after
{
peer.state = PeerConnectionState::Failed {
reason: "health timeout".to_string(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve sendability after idle health timeouts

After about 3 * heartbeat_interval, this loop marks any idle connected peer as Failed, but send_message later only permits Connected peers (src/distributed/cluster.rs:442-446). Since no background path in this lifecycle promotes timed-out peers back to Connected, a quiet cluster can fall into a persistent PeerNotConnected state where later heartbeats or work dispatches cannot be queued.

Useful? React with 👍 / 👎.

Comment on lines 245 to 248
for peer in peers.values_mut() {
peer.state = PeerConnectionState::Disconnected;
peer.pending_messages = 0;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset peer health cache during shutdown

stop() resets connection state in self.peers but does not reconcile or clear self.peer_info, while cluster telemetry is computed from peer_info in cluster_state(). If callers inspect state after shutdown, they can still see previously healthy peers and capacity values from stale cache entries, which makes stopped-node telemetry inaccurate.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster manager startup/state are stubbed (leader/load telemetry missing)

1 participant