Distributed system simulator for edge computing scenarios. This project demonstrates node selection algorithms, state synchronization, batch processing with failure handling, and consensus protocols relevant to edge and distributed systems.
- Node simulation: 20–50 nodes with labels, capacity, health (Online/Offline/Degraded), and configurable failure rate
- State management: Distributed key-value store with versioning, quorum replication, and conflict resolution (last-write-wins)
- Batch operations: Deploy, config updates, and health checks across node groups with worker pool, timeouts, and retries
- Algorithms: Least-loaded and label-affinity selection, spread across zones, gossip-based state sync, heartbeat failure detection, Raft-lite leader election
- Network simulation: Configurable latency (50–500 ms) and partition scenarios
cluster-sim/
├── pkg/
│ ├── node/ # Node representation, manager, selection strategies
│ ├── state/ # Distributed store, replication, consensus
│ ├── batch/ # Batch processor, worker pool, scheduler
│ ├── network/ # Latency and partition simulation
│ ├── protocol/ # Gossip, heartbeat, leader election
│ └── metrics/ # Metrics collection and reporting
├── scenarios/ # Batch deploy, node failure, partition, state sync
├── main.go
├── go.mod
└── README.md
- Go 1.21+
# Build
go build -o cluster-sim .
# or
make build
# Run a scenario
go run main.go simulate batch-deploy --nodes 30
go run main.go simulate node-failure --failure-rate 0.1
go run main.go simulate network-partition --partition-time 30s
go run main.go simulate state-sync
# Benchmark
go run main.go benchmark state-sync --iterations 100
go run main.go benchmark batch-deploy --iterations 50
# Visualize cluster state (refreshes every 1s)
go run main.go visualize --refresh 1sUsing the Makefile:
make build
make run-simulate-batch
make run-simulate-failure
make run-simulate-partition
make run-simulate-sync
make run-benchmark
make run-visualize
make test- Least-loaded: Sort nodes by utilization (CPU + memory) and pick the N least loaded. Reduces hotspots.
- Label-affinity: Filter nodes that match all required labels (e.g.
env=prod,zone=zone-a). Used for placement constraints. - Spread: Round-robin across zones/regions so replicas are spread for availability.
- Random with constraints: Random choice among nodes that satisfy min capacity and optional labels.
- Versioned store: Each key has a logical version; writes increment it. Used for conflict detection.
- Quorum: Write succeeds when at least W = N/2+1 replicas accept. Reads can use ONE, QUORUM, or ALL.
- Conflict resolution: Last-write-wins using version (and timestamp) when merging state from gossip or after partition heal.
- Epidemic propagation: Periodically each node picks a small random set of peers (fanout) and exchanges state.
- Merge: Received state is merged locally; higher version wins. Ensures eventual consistency across the cluster.
- Convergence: With enough rounds, all participating nodes converge to the same set of keys (under no new writes and no partitions).
- Roles: Follower, Candidate, Leader.
- Election: On timeout, node becomes Candidate, increments term, votes for itself, and requests votes from others. Majority wins.
- Heartbeat: Leader sends heartbeats; followers reset timeout. Used to detect leader failure and trigger re-election.
- Worker pool: Fixed number of workers consume node jobs from a channel.
- Per-node timeout: Each node operation is bounded by a timeout; slow/failed nodes don’t block others.
- Retries: Configurable retries with exponential backoff for transient failures.
- Aggregation: Results are aggregated into succeeded/failed/timeout counts and per-node status.
- Latency: Random delay between min and max (e.g. 50–500 ms) per logical message.
- Partitions: Nodes can be split into disjoint partitions; communication only within the same partition. Simulates split-brain and recovery.
| Scenario | What it demonstrates |
|---|---|
| batch-deploy | Spread selection, deploy workload, 2 nodes failed mid-run, retries and final status |
| node-failure | Random failures, leader election, workload migration from failed nodes |
| network-partition | Split cluster into two partitions, write in one, heal partition, state reconciliation |
| state-sync | Write 1000 keys, gossip to 5 replicas, convergence time, conflict resolution |
Batch Deployment Scenario
Nodes in cluster: 30, Selected (spread): 10
Simulated failure of node-10 and node-11 during deployment.
--- Result ---
Total time: 1.5s
Succeeded: 8
Failed: 2
Timeout: 0
Per-node status:
node-07: Succeeded
node-10: Failed (node not healthy)
...
Node Failure Scenario
Starting with 30 healthy nodes.
Failed nodes: [node-28 node-22 node-06]
Leader elected: node-01
State after failure: Online=27 Offline=3 Degraded=0
Workload migration: from failed nodes -> node-30 (example target)
State Synchronization Scenario
Replicas: 5, writing 1000 keys.
Wrote 1000 keys in 514µs.
Convergence: 5/5 stores have 1000 keys after 2.0s.
Conflict resolution (merge): conflict-key = value-b
go test ./pkg/... -v
make testUnit tests cover:
- Node add/remove workload, utilization, state updates, capacity errors
- Selector: least-loaded, label-affinity, spread, random
- State store: set/get, versioning, merge
- Batch processor: success, failure handling, retries
The pkg/metrics package provides a collector and reporter for:
- Performance: Batch completion time, state-sync latency, gossip convergence, election duration
- Reliability: Batch success rate, consistency rate, failure detection and recovery time
- Resource: Utilization, memory, goroutine count (if wired in)
github.com/google/uuid– unique IDs for jobs/workloadsgithub.com/olekukonko/tablewriter– CLI tables for visualize and reports
Use as needed for learning and portfolio demonstration of distributed systems concepts in Go.