Skip to content

Conversation

@Liam-DeVoe
Copy link
Collaborator

This doesn't actually incorporate estimator state right now. It just hardcodes an estimator of 1.0 for all nodes. But the groundwork and tests is all here for when we do plug in an estimator. (we weren't using estimator state before, so this doesn't change the status quo).

@Liam-DeVoe Liam-DeVoe requested a review from Zac-HD June 16, 2025 05:18
@Zac-HD
Copy link
Owner

Zac-HD commented Jun 16, 2025

Hmm. Given that this is non-stationary and we have little information at the start (who knows if old estimators are any good?), it feels like trying to split in advance might be the wrong call.

Since we only do this among worker processes on a single machine, what if we instead started with round-robin allocation, and then occasionally rebalanced via multiprocessing.queue or similar? Something like "pick three processes; the one with highest-value targets donates some not-quite-top targets to the one with lowest value" (or two, if there are only two); should be cheap-ish to keep things reasonably current with a multiprocessing queue and it can be async in the sense of only checking when we'd have switched targets anyway. Bit annoying if this means an extra collection step, but we've wanted to make that incremental anyway to amortize startup costs...

@Liam-DeVoe
Copy link
Collaborator Author

Liam-DeVoe commented Jun 16, 2025

I think we want the initial distribution to still be via distribute_nodes, and then we globally rebalance whenever:

  • a worker has no remaining targets (due to finding failures)
  • or the actual observed behaviors_per_second value of a worker (summed across its targets) drifts too far from its initial estimator

where a global rebalance is re-solving the distribute_nodes bin packing problem, with an added constraint that we don't want to change the current distribution too much (once we have estimators for switching cost and worker lifetime, we can use those here, to quantify how much change is worthwhile)

@Zac-HD
Copy link
Owner

Zac-HD commented Jun 17, 2025

Disagree.

  • a global balance implies that all workers need to stop at the same time, which means we're going to have a period of idle time while n-1 workers wait for the slowest one to stop.
  • like crash-only software, why add a clean startup mechanism when the runtime mechanism you need anyway could do the job?

I'm imagining a hub-and-spoke design here, where each worker occasionally sends the 'hub' process a dict[nodeid, fingerprints_per_second] it currently has, and the hub replies with a tuple[list[nodeid], list[nodeid]] of tests to start/stop fuzzing (or maybe start/suspend/stop). Once we've brought up all the estimators properly1, we should be able to balance value-of-compute across workers to within a few percent pretty easily.

It also has the substantial benefit that there's a straightforward path to extend this from balancing processes on a single host, to working via the database for an entire fleet 🙂

Footnotes

  1. amortizing startup cost seems very important to me, and drops out of long-running workloads where this rebalancing matters anyway.

@Liam-DeVoe
Copy link
Collaborator Author

Liam-DeVoe commented Jun 17, 2025

a global balance implies that all workers need to stop at the same time

not necessarily; each worker can check in with the hub (by reading from a shared multiprocessing.Manager dict or similar) for its new list of nodes whenever it would have switched targets anyway. We'll have the possibility of two workers fuzzing the same node simultaneously, but that's fine

why add a clean startup mechanism when the runtime mechanism you need anyway could do the job?

the runtime mechanism might take a while to balance, and we have the estimators right there! might as well initialize a gradient descent with a good guess

I'm in agreement with the hub and spokes design, but I'm not convinced about "pick 3 and redistribute among those 3", when we could do "pick all n and redistribute among those n". (and I now think the hub should just continuously rebalance on an interval, rather than waiting for some condition)

@Zac-HD
Copy link
Owner

Zac-HD commented Jun 17, 2025

the runtime mechanism might take a while to balance, and we have the estimators right there! might as well initialize a gradient descent with a good guess

synthesis: we start each worker with an empty set of tests, and whenever they have zero runnable tests (e.g. startup, or after finding lots of failures) they ask the manager for a new set.

I'm in agreement with the hub and spokes design, but I'm not convinced about "pick 3 and redistribute among those 3", when we could do "pick all n and redistribute among those n". (and I now think the hub should just continuously rebalance on an interval, rather than waiting for some condition)

Fair enough; I was probably leaning too heavily on "power of two random choices" and thinking about the multi-node future version - but "just works well for a single node" is the better goal for now.

@Liam-DeVoe
Copy link
Collaborator Author

haven't quite filled in all the estimator details yet, but this structure is what I was thinking of

@Liam-DeVoe
Copy link
Collaborator Author

I basically went all the way to just doing the full bayes thing here, incorporating startup costs in our estimators, etc.

I'm going to split off a PR from this for just the hub and worker structure with no fancy estimators, and then keep this as a dependent (-ing?) PR which adds the estimators.

Comment on lines +76 to +80
(
_targets(("a", (1, 1), 0), ("b", (2, 2), 0), ("c", (3, 3), 0)),
2,
{("a", "c"), ("b",)},
),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the greedy solution seems pretty bad here - don't we want (a, b), (c,)?

Copy link
Collaborator Author

@Liam-DeVoe Liam-DeVoe Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a, b), (c) would be ideal yeah. This particular case will be helped by iterating highest -> lowest. In general I'm not too worried about suboptimal greedy solutions since I would expect the rebalancing to fix things up eventually. (though if the rebalancing has failure modes in common configurations then that's bad of course, and I'm glad you're pointing this out).

@Liam-DeVoe Liam-DeVoe mentioned this pull request Jul 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants