Add Kubernetes autoscaling support #418

ArsalanAnwer0 · 2026-01-24T04:39:46Z

Implement HorizontalPodAutoscaler (HPA) for 8 services to enable dynamic scaling based on CPU/memory utilization.

Changes

Add shared HPA template (buttercup.hpa) in _helpers.tpl to eliminate duplication
Add HPA support for fuzzer-bot, coverage-bot, build-bot, pov-reproducer, tracer-bot, patcher, seed-gen, and task-downloader
Update deployment templates to support HPA (conditional replica counts)
Add autoscaling configuration to service values files (disabled by default)
Add memory based scaling for memory intensive services (fuzzer-bot, coverage-bot, tracer-bot)
Document scale-down stabilization window choices per service

Configuration

Autoscaling is disabled by default. Enable per service in values.yaml:

fuzzer-bot:
  autoscaling:
    enabled: true

Stabilization Windows

600s (coverage-bot, tracer-bot, patcher): Long-running analysis/patching tasks that should not be interrupted
300s (fuzzer-bot, pov-reproducer, seed-gen, build-bot, task-downloader): Shorter-lived tasks that tolerate faster scale down

Prerequisites

Requires metrics server installed in the cluster.

Resolves #348

- Implement HPA resources for 8 services (fuzzer-bot, coverage-bot, build-bot, pov-reproducer, tracer-bot, patcher, seed-gen, task-downloader) - Add autoscaling configuration to service values files (disabled by default) - Update deployment templates to support conditional replica counts when HPA is enabled - Adjust resource requests for better HPA accuracy (fuzzer-bot, coverage-bot, pov-reproducer, tracer-bot) - Add autoscaling documentation to global values.yaml Resolves trailofbits#348

dguido

PR Review: Add Kubernetes autoscaling support

Thanks for adding HPA support! The implementation is solid, but there are a few items to address before merging.

🔴 Blocking: Resource Request Increases Need Justification

The PR silently increases resource requests, which will affect cluster capacity even when HPA is disabled:

Service	CPU Request	Memory Request
coverage-bot	250m → 500m	256Mi → 6Gi (24x increase)
fuzzer-bot	—	256Mi → 1536Mi (6x)
pov-reproducer	100m → 500m	1Gi → 3Gi (3x)
tracer-bot	—	256Mi → 1536Mi (6x)

The PR description mentions "adjust resource requests for better HPA accuracy" but these are significant changes that need justification:

Action required: Either:

Add justification for these values (OOM observations, profiling data, etc.), or
Split resource request changes into a separate PR so they can be reviewed independently

🟡 Suggestions

Consider a shared HPA template

All 8 hpa.yaml files are nearly identical (272 lines of duplication). Consider a helper template in _helpers.tpl:

{{- define "common.hpa" -}}
{{- if and .Values.enabled .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
...
{{- end }}
{{- end }}

Not blocking, but would reduce maintenance burden.

Add memory-based scaling for memory-intensive services

Only pov-reproducer has targetMemoryUtilizationPercentage. Given fuzzer-bot, coverage-bot, and tracer-bot are memory-intensive, they might benefit from memory-based scaling too.

Document the scale-down timing differences

Some services use 600s stabilization (coverage-bot, tracer-bot, patcher), others use 300s. A brief comment in the values files explaining why would help future maintainers.

🟢 What looks good

Uses autoscaling/v2 (current stable API) ✓
Correctly gates replica count with {{- if not .Values.autoscaling.enabled }} ✓
Disabled by default (safe rollout) ✓
Condition and .Values.enabled .Values.autoscaling.enabled is correct ✓
Well-structured behavior policies with stabilization windows ✓

Optional improvements (not blocking)

Consider adding PodDisruptionBudgets for critical services when scaling down
Update deployment docs with metrics-server setup instructions for different cluster types

- Revert resource request increases to original values - Extract shared HPA template into _helpers.tpl to reduce duplication - Add memory-based scaling for memory-intensive services - Document scale-down stabilization window choices

hbrodin · 2026-01-26T09:42:38Z

Hi @ArsalanAnwer0,
Thanks for your interest in Buttercup!

Could you please explain the use case you are targeting and how you've seen these changes benefit that?

When we've talked about this in the past it was mostly as a way of scaling up and down build-bot and potentially patcher. We believe a metric based on pending work would more accurately reflect when scaling needs to happen. I.e. if there are three pending patch requests, we could scale to three patchers to work in parallel. When they complete, the patcher count could go down to 1 (or 0?). Same for builder, it is only used at the beginning of the processing but then you could benefit from having more builder and scale them down once all builds are complete. Fuzzers could potentially benefit from this current change, but you would likely scale up to max as soon as you start processing (at least for well-written harnesses). The main benefit there would be to scale down when idle.

Also, when scaling down one would need to be carful to not kill a pod that's doing long-running useful work (e.g. patcher) but instead one of the idle ones.

Another thing that complicates this is that a lot of the work is done by docker-in-docker and not e.g. build-bot itself. So, the CPU usage of the build-bot doesn't accurately reflect the load on the system.

Happy to hear your thoughts on this and your experience with auto-scaling for Buttercup.

ArsalanAnwer0 requested review from hbrodin, michaelbrownuc, ret2libc and reytchison as code owners January 24, 2026 04:39

dguido requested changes Jan 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kubernetes autoscaling support #418

Add Kubernetes autoscaling support #418

Uh oh!

ArsalanAnwer0 commented Jan 24, 2026 •

edited

Loading

Uh oh!

dguido left a comment

Uh oh!

hbrodin commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Kubernetes autoscaling support #418

Are you sure you want to change the base?

Add Kubernetes autoscaling support #418

Uh oh!

Conversation

ArsalanAnwer0 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Configuration

Stabilization Windows

Prerequisites

Uh oh!

dguido left a comment

Choose a reason for hiding this comment

PR Review: Add Kubernetes autoscaling support

🔴 Blocking: Resource Request Increases Need Justification

🟡 Suggestions

Consider a shared HPA template

Add memory-based scaling for memory-intensive services

Document the scale-down timing differences

🟢 What looks good

Optional improvements (not blocking)

Uh oh!

hbrodin commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArsalanAnwer0 commented Jan 24, 2026 •

edited

Loading