Skip to content

abhicodes11/Batch-workload-operator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Batch Workload Operator

A Kubernetes operator in Go that deploys workloads across multiple nodes in batches, with configurable concurrency, retries, and status reporting.

Features

  • Custom Resource (BatchWorkload): Define desired replicas, node selection (labels or explicit names), container image, command, batch size, and timeout.
  • Reconciliation loop: Watches BatchWorkload resources and deploys Pods on selected nodes.
  • Batch processing: Worker-pool pattern with configurable spec.batchSize for concurrent node processing.
  • Error handling: Exponential backoff retries, status conditions, and manual retry via annotation batch.example.com/retry.
  • Status reporting: Phase (Pending/Running/Succeeded/Failed), per-node status, and Kubernetes Events.

Prerequisites

  • Go 1.21+
  • Kubernetes cluster (e.g. kind, minikube) and kubectl configured
  • Optional: controller-gen for regenerating CRD/manifests

Project Structure

batch-workload-operator/
├── api/v1alpha1/
│   └── batchworkload_types.go   # CRD types and validation
├── controllers/
│   └── batchworkload_controller.go # Reconciliation logic
├── pkg/
│   ├── batch/processor.go        # Batch processing with worker pool
│   ├── node/selector.go         # Node selection (labels / names)
│   └── workload/deployer.go     # Pod deployment and wait
├── config/
│   ├── crd/bases/               # CRD YAML
│   ├── rbac/                    # RBAC (ClusterRole, RoleBinding)
│   └── samples/                 # Example BatchWorkloads
├── main.go
├── Makefile
└── README.md

Setup and Usage

1. Install CRD and RBAC

make install

This applies:

  • config/crd/bases/batch.example.com_batchworkloads.yaml
  • config/rbac/role.yaml (ClusterRole + ClusterRoleBinding + namespace)
  • config/samples/ (example BatchWorkloads)

2. Run the operator locally

make run

Or build and run:

make build
./bin/manager

Ensure your kubeconfig targets the cluster where you installed the CRD (e.g. kubectl config use-context kind-kind).

3. Create a BatchWorkload

Example using node labels:

apiVersion: batch.example.com/v1alpha1
kind: BatchWorkload
metadata:
  name: test-workload
spec:
  replicas: 5
  nodeSelector:
    region: us-west
    type: edge
  image: nginx:latest
  command: ["nginx", "-g", "daemon off;"]
  batchSize: 2
  timeout: 60s

Or with explicit node names:

apiVersion: batch.example.com/v1alpha1
kind: BatchWorkload
metadata:
  name: my-workload
spec:
  replicas: 3
  nodeNames:
    - node-1
    - node-2
    - node-3
  image: busybox:latest
  command: ["sleep", "3600"]
  batchSize: 2
  timeout: 120s

Apply:

kubectl apply -f config/samples/batch.example.com_v1alpha1_batchworkload.yaml

4. Check status

kubectl get batchworkloads
kubectl get bwl
kubectl describe batchworkload test-workload
kubectl get pods

Status fields include:

  • status.phase: Pending | Running | Succeeded | Failed
  • status.totalNodes, status.succeededNodes, status.failedNodes, status.pendingNodes
  • status.nodeStatus[]: per-node phase, message, pod name

5. Manual retry after failure

To retry a failed or succeeded workload:

kubectl annotate batchworkload test-workload batch.example.com/retry=$(date +%s) --overwrite

The controller will clear status and reconcile again.

6. Delete

Deleting a BatchWorkload removes the finalizer after cleaning up owned Pods:

kubectl delete batchworkload test-workload

Makefile Targets

Target Description
make manifests Generate CRD and RBAC (controller-gen)
make generate Generate code (e.g. deepcopy)
make install Apply CRD, RBAC, and samples
make run Run the operator locally
make build Build bin/manager
make test Run tests
make test-coverage Run tests and open coverage report

Custom Resource: BatchWorkload

Field Type Description
spec.replicas int32 Number of instances (nodes) to deploy (1–1000).
spec.nodeSelector map[string]string Label selector for target nodes.
spec.nodeNames []string Explicit node names (overrides nodeSelector).
spec.image string Container image to run.
spec.command []string Command to execute.
spec.args []string Arguments to the command.
spec.batchSize int32 Nodes processed concurrently (1–100).
spec.timeout string Per-node operation timeout (e.g. 60s, 5m).
status.phase string Pending / Running / Succeeded / Failed.
status.nodeStatus []NodeStatus Per-node phase, message, pod name.

Testing

Run unit tests:

go test ./...

With coverage:

make test-coverage

Tests include:

  • Node selector: Label selector, explicit names, limit, readiness filter.
  • Batch processor: Concurrency, timeout, context cancel, retries.
  • Workload deployer: List/delete pods, deploy (with fake client).
  • Controller: Reconcile not found, finalizer addition, conditions, parseDuration.

Metrics and health

  • Metrics server: :8080 (Prometheus metrics from controller-runtime).
  • Health/ready probes: :8081.

License

Owned by abhicodes11.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published