A Kubernetes operator in Go that deploys workloads across multiple nodes in batches, with configurable concurrency, retries, and status reporting.
- Custom Resource (BatchWorkload): Define desired replicas, node selection (labels or explicit names), container image, command, batch size, and timeout.
- Reconciliation loop: Watches BatchWorkload resources and deploys Pods on selected nodes.
- Batch processing: Worker-pool pattern with configurable
spec.batchSizefor concurrent node processing. - Error handling: Exponential backoff retries, status conditions, and manual retry via annotation
batch.example.com/retry. - Status reporting: Phase (Pending/Running/Succeeded/Failed), per-node status, and Kubernetes Events.
- Go 1.21+
- Kubernetes cluster (e.g. kind, minikube) and
kubectlconfigured - Optional: controller-gen for regenerating CRD/manifests
batch-workload-operator/
├── api/v1alpha1/
│ └── batchworkload_types.go # CRD types and validation
├── controllers/
│ └── batchworkload_controller.go # Reconciliation logic
├── pkg/
│ ├── batch/processor.go # Batch processing with worker pool
│ ├── node/selector.go # Node selection (labels / names)
│ └── workload/deployer.go # Pod deployment and wait
├── config/
│ ├── crd/bases/ # CRD YAML
│ ├── rbac/ # RBAC (ClusterRole, RoleBinding)
│ └── samples/ # Example BatchWorkloads
├── main.go
├── Makefile
└── README.md
make installThis applies:
config/crd/bases/batch.example.com_batchworkloads.yamlconfig/rbac/role.yaml(ClusterRole + ClusterRoleBinding + namespace)config/samples/(example BatchWorkloads)
make runOr build and run:
make build
./bin/managerEnsure your kubeconfig targets the cluster where you installed the CRD (e.g. kubectl config use-context kind-kind).
Example using node labels:
apiVersion: batch.example.com/v1alpha1
kind: BatchWorkload
metadata:
name: test-workload
spec:
replicas: 5
nodeSelector:
region: us-west
type: edge
image: nginx:latest
command: ["nginx", "-g", "daemon off;"]
batchSize: 2
timeout: 60sOr with explicit node names:
apiVersion: batch.example.com/v1alpha1
kind: BatchWorkload
metadata:
name: my-workload
spec:
replicas: 3
nodeNames:
- node-1
- node-2
- node-3
image: busybox:latest
command: ["sleep", "3600"]
batchSize: 2
timeout: 120sApply:
kubectl apply -f config/samples/batch.example.com_v1alpha1_batchworkload.yamlkubectl get batchworkloads
kubectl get bwl
kubectl describe batchworkload test-workload
kubectl get podsStatus fields include:
status.phase: Pending | Running | Succeeded | Failedstatus.totalNodes,status.succeededNodes,status.failedNodes,status.pendingNodesstatus.nodeStatus[]: per-node phase, message, pod name
To retry a failed or succeeded workload:
kubectl annotate batchworkload test-workload batch.example.com/retry=$(date +%s) --overwriteThe controller will clear status and reconcile again.
Deleting a BatchWorkload removes the finalizer after cleaning up owned Pods:
kubectl delete batchworkload test-workload| Target | Description |
|---|---|
make manifests |
Generate CRD and RBAC (controller-gen) |
make generate |
Generate code (e.g. deepcopy) |
make install |
Apply CRD, RBAC, and samples |
make run |
Run the operator locally |
make build |
Build bin/manager |
make test |
Run tests |
make test-coverage |
Run tests and open coverage report |
| Field | Type | Description |
|---|---|---|
spec.replicas |
int32 | Number of instances (nodes) to deploy (1–1000). |
spec.nodeSelector |
map[string]string | Label selector for target nodes. |
spec.nodeNames |
[]string | Explicit node names (overrides nodeSelector). |
spec.image |
string | Container image to run. |
spec.command |
[]string | Command to execute. |
spec.args |
[]string | Arguments to the command. |
spec.batchSize |
int32 | Nodes processed concurrently (1–100). |
spec.timeout |
string | Per-node operation timeout (e.g. 60s, 5m). |
status.phase |
string | Pending / Running / Succeeded / Failed. |
status.nodeStatus |
[]NodeStatus | Per-node phase, message, pod name. |
Run unit tests:
go test ./...With coverage:
make test-coverageTests include:
- Node selector: Label selector, explicit names, limit, readiness filter.
- Batch processor: Concurrency, timeout, context cancel, retries.
- Workload deployer: List/delete pods, deploy (with fake client).
- Controller: Reconcile not found, finalizer addition, conditions, parseDuration.
- Metrics server:
:8080(Prometheus metrics from controller-runtime). - Health/ready probes:
:8081.
Owned by abhicodes11.