Skip to content

abhicodes11/Cluster-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cluster-sim

Distributed system simulator for edge computing scenarios. This project demonstrates node selection algorithms, state synchronization, batch processing with failure handling, and consensus protocols relevant to edge and distributed systems.

Features

  • Node simulation: 20–50 nodes with labels, capacity, health (Online/Offline/Degraded), and configurable failure rate
  • State management: Distributed key-value store with versioning, quorum replication, and conflict resolution (last-write-wins)
  • Batch operations: Deploy, config updates, and health checks across node groups with worker pool, timeouts, and retries
  • Algorithms: Least-loaded and label-affinity selection, spread across zones, gossip-based state sync, heartbeat failure detection, Raft-lite leader election
  • Network simulation: Configurable latency (50–500 ms) and partition scenarios

Project structure

cluster-sim/
├── pkg/
│   ├── node/          # Node representation, manager, selection strategies
│   ├── state/         # Distributed store, replication, consensus
│   ├── batch/         # Batch processor, worker pool, scheduler
│   ├── network/       # Latency and partition simulation
│   ├── protocol/      # Gossip, heartbeat, leader election
│   └── metrics/       # Metrics collection and reporting
├── scenarios/         # Batch deploy, node failure, partition, state sync
├── main.go
├── go.mod
└── README.md

Requirements

  • Go 1.21+

Build and run

# Build
go build -o cluster-sim .
# or
make build

# Run a scenario
go run main.go simulate batch-deploy --nodes 30
go run main.go simulate node-failure --failure-rate 0.1
go run main.go simulate network-partition --partition-time 30s
go run main.go simulate state-sync

# Benchmark
go run main.go benchmark state-sync --iterations 100
go run main.go benchmark batch-deploy --iterations 50

# Visualize cluster state (refreshes every 1s)
go run main.go visualize --refresh 1s

Using the Makefile:

make build
make run-simulate-batch
make run-simulate-failure
make run-simulate-partition
make run-simulate-sync
make run-benchmark
make run-visualize
make test

Algorithms and concepts

1. Node selection

  • Least-loaded: Sort nodes by utilization (CPU + memory) and pick the N least loaded. Reduces hotspots.
  • Label-affinity: Filter nodes that match all required labels (e.g. env=prod, zone=zone-a). Used for placement constraints.
  • Spread: Round-robin across zones/regions so replicas are spread for availability.
  • Random with constraints: Random choice among nodes that satisfy min capacity and optional labels.

2. State management

  • Versioned store: Each key has a logical version; writes increment it. Used for conflict detection.
  • Quorum: Write succeeds when at least W = N/2+1 replicas accept. Reads can use ONE, QUORUM, or ALL.
  • Conflict resolution: Last-write-wins using version (and timestamp) when merging state from gossip or after partition heal.

3. Gossip protocol

  • Epidemic propagation: Periodically each node picks a small random set of peers (fanout) and exchanges state.
  • Merge: Received state is merged locally; higher version wins. Ensures eventual consistency across the cluster.
  • Convergence: With enough rounds, all participating nodes converge to the same set of keys (under no new writes and no partitions).

4. Leader election (Raft-lite)

  • Roles: Follower, Candidate, Leader.
  • Election: On timeout, node becomes Candidate, increments term, votes for itself, and requests votes from others. Majority wins.
  • Heartbeat: Leader sends heartbeats; followers reset timeout. Used to detect leader failure and trigger re-election.

5. Batch processing

  • Worker pool: Fixed number of workers consume node jobs from a channel.
  • Per-node timeout: Each node operation is bounded by a timeout; slow/failed nodes don’t block others.
  • Retries: Configurable retries with exponential backoff for transient failures.
  • Aggregation: Results are aggregated into succeeded/failed/timeout counts and per-node status.

6. Network simulation

  • Latency: Random delay between min and max (e.g. 50–500 ms) per logical message.
  • Partitions: Nodes can be split into disjoint partitions; communication only within the same partition. Simulates split-brain and recovery.

Scenarios

Scenario What it demonstrates
batch-deploy Spread selection, deploy workload, 2 nodes failed mid-run, retries and final status
node-failure Random failures, leader election, workload migration from failed nodes
network-partition Split cluster into two partitions, write in one, heal partition, state reconciliation
state-sync Write 1000 keys, gossip to 5 replicas, convergence time, conflict resolution

Example outputs

Batch deployment

Batch Deployment Scenario
Nodes in cluster: 30, Selected (spread): 10

Simulated failure of node-10 and node-11 during deployment.

--- Result ---
Total time: 1.5s
Succeeded: 8
Failed: 2
Timeout: 0
Per-node status:
  node-07: Succeeded
  node-10: Failed (node not healthy)
  ...

Node failure

Node Failure Scenario
Starting with 30 healthy nodes.
Failed nodes: [node-28 node-22 node-06]
Leader elected: node-01
State after failure: Online=27 Offline=3 Degraded=0
Workload migration: from failed nodes -> node-30 (example target)

State sync

State Synchronization Scenario
Replicas: 5, writing 1000 keys.
Wrote 1000 keys in 514µs.
Convergence: 5/5 stores have 1000 keys after 2.0s.
Conflict resolution (merge): conflict-key = value-b

Testing

go test ./pkg/... -v
make test

Unit tests cover:

  • Node add/remove workload, utilization, state updates, capacity errors
  • Selector: least-loaded, label-affinity, spread, random
  • State store: set/get, versioning, merge
  • Batch processor: success, failure handling, retries

Metrics (conceptual)

The pkg/metrics package provides a collector and reporter for:

  • Performance: Batch completion time, state-sync latency, gossip convergence, election duration
  • Reliability: Batch success rate, consistency rate, failure detection and recovery time
  • Resource: Utilization, memory, goroutine count (if wired in)

Dependencies

  • github.com/google/uuid – unique IDs for jobs/workloads
  • github.com/olekukonko/tablewriter – CLI tables for visualize and reports

License

Use as needed for learning and portfolio demonstration of distributed systems concepts in Go.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published