Batch Workload Operator

A Kubernetes operator in Go that deploys workloads across multiple nodes in batches, with configurable concurrency, retries, and status reporting.

Features

Custom Resource (BatchWorkload): Define desired replicas, node selection (labels or explicit names), container image, command, batch size, and timeout.
Reconciliation loop: Watches BatchWorkload resources and deploys Pods on selected nodes.
Batch processing: Worker-pool pattern with configurable spec.batchSize for concurrent node processing.
Error handling: Exponential backoff retries, status conditions, and manual retry via annotation batch.example.com/retry.
Status reporting: Phase (Pending/Running/Succeeded/Failed), per-node status, and Kubernetes Events.

Prerequisites

Go 1.21+
Kubernetes cluster (e.g. kind, minikube) and kubectl configured
Optional: controller-gen for regenerating CRD/manifests

Project Structure

batch-workload-operator/
├── api/v1alpha1/
│   └── batchworkload_types.go   # CRD types and validation
├── controllers/
│   └── batchworkload_controller.go # Reconciliation logic
├── pkg/
│   ├── batch/processor.go        # Batch processing with worker pool
│   ├── node/selector.go         # Node selection (labels / names)
│   └── workload/deployer.go     # Pod deployment and wait
├── config/
│   ├── crd/bases/               # CRD YAML
│   ├── rbac/                    # RBAC (ClusterRole, RoleBinding)
│   └── samples/                 # Example BatchWorkloads
├── main.go
├── Makefile
└── README.md

Setup and Usage

1. Install CRD and RBAC

make install

This applies:

config/crd/bases/batch.example.com_batchworkloads.yaml
config/rbac/role.yaml (ClusterRole + ClusterRoleBinding + namespace)
config/samples/ (example BatchWorkloads)

2. Run the operator locally

make run

Or build and run:

make build
./bin/manager

Ensure your kubeconfig targets the cluster where you installed the CRD (e.g. kubectl config use-context kind-kind).

3. Create a BatchWorkload

Example using node labels:

apiVersion: batch.example.com/v1alpha1
kind: BatchWorkload
metadata:
  name: test-workload
spec:
  replicas: 5
  nodeSelector:
    region: us-west
    type: edge
  image: nginx:latest
  command: ["nginx", "-g", "daemon off;"]
  batchSize: 2
  timeout: 60s

Or with explicit node names:

apiVersion: batch.example.com/v1alpha1
kind: BatchWorkload
metadata:
  name: my-workload
spec:
  replicas: 3
  nodeNames:
    - node-1
    - node-2
    - node-3
  image: busybox:latest
  command: ["sleep", "3600"]
  batchSize: 2
  timeout: 120s

Apply:

kubectl apply -f config/samples/batch.example.com_v1alpha1_batchworkload.yaml

4. Check status

kubectl get batchworkloads
kubectl get bwl
kubectl describe batchworkload test-workload
kubectl get pods

Status fields include:

status.phase: Pending | Running | Succeeded | Failed
status.totalNodes, status.succeededNodes, status.failedNodes, status.pendingNodes
status.nodeStatus[]: per-node phase, message, pod name

5. Manual retry after failure

To retry a failed or succeeded workload:

kubectl annotate batchworkload test-workload batch.example.com/retry=$(date +%s) --overwrite

The controller will clear status and reconcile again.

6. Delete

Deleting a BatchWorkload removes the finalizer after cleaning up owned Pods:

kubectl delete batchworkload test-workload

Makefile Targets

Target	Description
`make manifests`	Generate CRD and RBAC (controller-gen)
`make generate`	Generate code (e.g. deepcopy)
`make install`	Apply CRD, RBAC, and samples
`make run`	Run the operator locally
`make build`	Build `bin/manager`
`make test`	Run tests
`make test-coverage`	Run tests and open coverage report

Custom Resource: BatchWorkload

Field	Type	Description
`spec.replicas`	int32	Number of instances (nodes) to deploy (1–1000).
`spec.nodeSelector`	map[string]string	Label selector for target nodes.
`spec.nodeNames`	[]string	Explicit node names (overrides nodeSelector).
`spec.image`	string	Container image to run.
`spec.command`	[]string	Command to execute.
`spec.args`	[]string	Arguments to the command.
`spec.batchSize`	int32	Nodes processed concurrently (1–100).
`spec.timeout`	string	Per-node operation timeout (e.g. `60s`, `5m`).
`status.phase`	string	Pending / Running / Succeeded / Failed.
`status.nodeStatus`	[]NodeStatus	Per-node phase, message, pod name.

Testing

Run unit tests:

go test ./...

With coverage:

make test-coverage

Tests include:

Node selector: Label selector, explicit names, limit, readiness filter.
Batch processor: Concurrency, timeout, context cancel, retries.
Workload deployer: List/delete pods, deploy (with fake client).
Controller: Reconcile not found, finalizer addition, conditions, parseDuration.

Metrics and health

Metrics server: :8080 (Prometheus metrics from controller-runtime).
Health/ready probes: :8081.

License

Owned by abhicodes11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Batch Workload Operator

Features

Prerequisites

Project Structure

Setup and Usage

1. Install CRD and RBAC

2. Run the operator locally

3. Create a BatchWorkload

4. Check status

5. Manual retry after failure

6. Delete

Makefile Targets

Custom Resource: BatchWorkload

Testing

Metrics and health

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
api/v1alpha1		api/v1alpha1
config		config
controllers		controllers
hack		hack
pkg		pkg
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
go.mod		go.mod
main.go		main.go

abhicodes11/Batch-workload-operator

Folders and files

Latest commit

History

Repository files navigation

Batch Workload Operator

Features

Prerequisites

Project Structure

Setup and Usage

1. Install CRD and RBAC

2. Run the operator locally

3. Create a BatchWorkload

4. Check status

5. Manual retry after failure

6. Delete

Makefile Targets

Custom Resource: BatchWorkload

Testing

Metrics and health

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages