A lightweight, extensible group membership and failure detection system written in Go.
Implements a SWIM-style protocol with a pluggable detection mode (currently a Ping/Ack variant; Gossip stubbed for extension), suspicion propagation, piggybacked membership updates, and operational tooling for live cluster introspection.
- Membership tracking with incarnation numbers
- Failure detection via randomized periodic Ping/Ack probing (
PingAckMode) - Suspicion phase before declaring failure using a tunable multi-report policy (
utils.SuspicionManager) - Piggyback dissemination of recent membership deltas (
utils.MembershipList.AddRecentUpdate) - Structured statuses: Alive → Suspected → Failed (
utils.MemberStatus) - Time–bounded retention of confirmed failures
- Local HTTP control server for safe automation (
NewControlServer) - Interactive CLI ticker summarizing membership health (
StartCLI) - Remote / local monitoring script:
scripts/monitor_node.sh - Network abstraction with handler registry & artificial drop-rate injection (
utils.NetworkLayer)
| Component | Responsibility |
|---|---|
Controller (Controller) |
Orchestrates network, detector mode, suspicion manager |
Network Layer (utils.NetworkLayer) |
UDP messaging, handler dispatch, bounded queue |
Membership Store (utils.MembershipList) |
Thread-safe state, random peer sampling, update window |
Failure Detection (detectors.PingAckManager) |
Periodic direct + indirect probes, ACK correlation |
Suspicion Manager (utils.SuspicionManager) |
Aggregates suspicion reports, escalates to failure |
| CLI / Control Plane | HTTP endpoints + periodic status logging |
| Scripts | Cluster ops (log generation, monitoring) |
Data flow (Ping/Ack):
- Periodic selection of a random target
- Direct ping → wait for ACK
- On timeout: indirect probes via k helper nodes
- On continued silence: declare Suspected (
PingAckManager.declareSuspicion) - Suspicion reports aggregated → promote to Failed after timeout or quorum
- Updates piggybacked onto outgoing protocol messages
go build -o mp2-node .
./mp2-node -port 8080 -is-introducer
./mp2-node -port 8081 -introducer 127.0.0.1:8080
./mp2-node -cmd list_mem
./mp2-node -cmd list_self
./mp2-node -cmd display_suspects
./scripts/monitor_node.sh localhost 8080
| Flag | Description |
|---|---|
-port |
UDP listen port (default 8080) |
-introducer |
Introducer ip:port to join |
-is-introducer |
Start as seed node |
-mode |
gossip (stub) or pingack |
-cmd |
One-off control client command |
-control-port |
Override local HTTP control port (defaults to port+10000) |
-foreground |
Skip daemonization |
| Endpoint | Purpose |
|---|---|
/list_mem |
Current membership view |
/list_self |
Local node identity |
/display_suspects |
Active suspicion entries |
/join?introducer=IP:PORT |
Force join |
/leave |
Voluntary leave (graceful) |
/switch |
Switch detection mode / suspicion toggle (future extension) |
/display_protocol |
Active mode |
Served on 127.0.0.1:<control-port>.
| State | Trigger | Notes |
|---|---|---|
| Alive | Heartbeats / pings observed | Normal operation |
| Suspected | Timeout & insufficient ACKs | SuspicionTimeout, quorum-based escalation |
| Failed | Suspicion confirmed | Retained temporarily for convergence |
See PingAckManager.performSWIMProtocolPeriod and SuspicionManager.loop.
- Incarnation numbering avoids stale overwrites
- Indirect probing reduces false positives under transient network loss
- Batching: recent updates window limits redundant payload growth
- Separation of suspicion vs. failure lowers incorrect failure declarations
- Deterministic local-only control API avoids exposing cluster mutation externally
- Activate full gossip-based dissemination engine
- Adaptive probe intervals based on recent stability
- Metrics / Prometheus exporter
- Encryption / Auth for control plane
- Pluggable transport (QUIC / TCP fallback)
go vet ./...
go test ./... # (Add tests; current suite minimal)
Simulate packet loss (future flag hook):
- Introduce adjustable drop rate in
NetworkLayer.
| Script | Description |
|---|---|
scripts/log_generator.sh |
Generate large logs across hosts |
scripts/monitor_node.sh |
Interactive suspicion dashboard |
- Controller startup:
Controller.Start - Membership lifecycle:
MembershipList - Failure detection loop:
PingAckManager.performSWIMProtocolPeriod - Suspicion escalation:
SuspicionManager - Types & statuses:
Member,MemberStatus