Use a `distribute_nodes` function #154

Liam-DeVoe · 2025-06-16T05:18:50Z

This doesn't actually incorporate estimator state right now. It just hardcodes an estimator of 1.0 for all nodes. But the groundwork and tests is all here for when we do plug in an estimator. (we weren't using estimator state before, so this doesn't change the status quo).

Zac-HD · 2025-06-16T16:48:01Z

Hmm. Given that this is non-stationary and we have little information at the start (who knows if old estimators are any good?), it feels like trying to split in advance might be the wrong call.

Since we only do this among worker processes on a single machine, what if we instead started with round-robin allocation, and then occasionally rebalanced via multiprocessing.queue or similar? Something like "pick three processes; the one with highest-value targets donates some not-quite-top targets to the one with lowest value" (or two, if there are only two); should be cheap-ish to keep things reasonably current with a multiprocessing queue and it can be async in the sense of only checking when we'd have switched targets anyway. Bit annoying if this means an extra collection step, but we've wanted to make that incremental anyway to amortize startup costs...

Liam-DeVoe · 2025-06-16T19:47:27Z

I think we want the initial distribution to still be via distribute_nodes, and then we globally rebalance whenever:

a worker has no remaining targets (due to finding failures)
or the actual observed behaviors_per_second value of a worker (summed across its targets) drifts too far from its initial estimator

where a global rebalance is re-solving the distribute_nodes bin packing problem, with an added constraint that we don't want to change the current distribution too much (once we have estimators for switching cost and worker lifetime, we can use those here, to quantify how much change is worthwhile)

Zac-HD · 2025-06-17T00:50:22Z

Disagree.

a global balance implies that all workers need to stop at the same time, which means we're going to have a period of idle time while n-1 workers wait for the slowest one to stop.
like crash-only software, why add a clean startup mechanism when the runtime mechanism you need anyway could do the job?

I'm imagining a hub-and-spoke design here, where each worker occasionally sends the 'hub' process a dict[nodeid, fingerprints_per_second] it currently has, and the hub replies with a tuple[list[nodeid], list[nodeid]] of tests to start/stop fuzzing (or maybe start/suspend/stop). Once we've brought up all the estimators properly¹, we should be able to balance value-of-compute across workers to within a few percent pretty easily.

It also has the substantial benefit that there's a straightforward path to extend this from balancing processes on a single host, to working via the database for an entire fleet 🙂

amortizing startup cost seems very important to me, and drops out of long-running workloads where this rebalancing matters anyway. ↩

Liam-DeVoe · 2025-06-17T01:17:51Z

a global balance implies that all workers need to stop at the same time

not necessarily; each worker can check in with the hub (by reading from a shared multiprocessing.Manager dict or similar) for its new list of nodes whenever it would have switched targets anyway. We'll have the possibility of two workers fuzzing the same node simultaneously, but that's fine

why add a clean startup mechanism when the runtime mechanism you need anyway could do the job?

the runtime mechanism might take a while to balance, and we have the estimators right there! might as well initialize a gradient descent with a good guess

I'm in agreement with the hub and spokes design, but I'm not convinced about "pick 3 and redistribute among those 3", when we could do "pick all n and redistribute among those n". (and I now think the hub should just continuously rebalance on an interval, rather than waiting for some condition)

Zac-HD · 2025-06-17T05:15:20Z

the runtime mechanism might take a while to balance, and we have the estimators right there! might as well initialize a gradient descent with a good guess

synthesis: we start each worker with an empty set of tests, and whenever they have zero runnable tests (e.g. startup, or after finding lots of failures) they ask the manager for a new set.

I'm in agreement with the hub and spokes design, but I'm not convinced about "pick 3 and redistribute among those 3", when we could do "pick all n and redistribute among those n". (and I now think the hub should just continuously rebalance on an interval, rather than waiting for some condition)

Fair enough; I was probably leaning too heavily on "power of two random choices" and thinking about the multi-node future version - but "just works well for a single node" is the better goal for now.

Liam-DeVoe · 2025-06-17T08:50:11Z

haven't quite filled in all the estimator details yet, but this structure is what I was thinking of

Liam-DeVoe · 2025-06-18T21:03:49Z

I basically went all the way to just doing the full bayes thing here, incorporating startup costs in our estimators, etc.

I'm going to split off a PR from this for just the hub and worker structure with no fancy estimators, and then keep this as a dependent (-ing?) PR which adds the estimators.

src/hypofuzz/bayes.py

src/hypofuzz/hypofuzz.py

tests/test_bayes.py

Zac-HD · 2025-06-30T04:29:07Z

tests/test_bayes.py

+        (
+            _targets(("a", (1, 1), 0), ("b", (2, 2), 0), ("c", (3, 3), 0)),
+            2,
+            {("a", "c"), ("b",)},
+        ),


Hmm, the greedy solution seems pretty bad here - don't we want (a, b), (c,)?

(a, b), (c) would be ideal yeah. This particular case will be helped by iterating highest -> lowest. In general I'm not too worried about suboptimal greedy solutions since I would expect the rebalancing to fix things up eventually. (though if the rebalancing has failure modes in common configurations then that's bad of course, and I'm glad you're pointing this out).

Liam-DeVoe added 3 commits June 16, 2025 00:48

Merge branch 'next' into allocate

59b15e9

Merge branch 'master' into allocate

5f79b6a

write and use distribute_nodes

f3cc5f3

Liam-DeVoe requested a review from Zac-HD June 16, 2025 05:18

format

65c5a2c

Merge branch 'master' into allocate

3de3d4a

Liam-DeVoe added 3 commits June 17, 2025 03:15

Merge branch 'master' into allocate

0b1b30b

Merge branch 'master' into allocate

1294110

new worker hub structure

7604805

Liam-DeVoe added 2 commits June 18, 2025 04:04

Merge branch 'master' into allocate

9d217d2

more work on estimators

0fbe956

Liam-DeVoe force-pushed the allocate branch from b32fc46 to 0fbe956 Compare June 18, 2025 09:31

Liam-DeVoe added 2 commits June 19, 2025 01:39

Merge branch 'next' into allocate

606232b

Merge branch 'master' into allocate

4290542

Liam-DeVoe mentioned this pull request Jun 19, 2025

New hub and worker architecture #165

Merged

Liam-DeVoe added 3 commits June 28, 2025 01:15

Merge branch 'master' into allocate

efaa2bf

resolve merge conflicts

1d5351b

fix valid_nodeids

80fbab4

Zac-HD reviewed Jun 30, 2025

View reviewed changes

address some review comments

816c050

Liam-DeVoe mentioned this pull request Jul 27, 2025

Better worker view ux #212

Merged

Use a distribute_nodes function #154

Are you sure you want to change the base?

Use a distribute_nodes function #154

Uh oh!

Conversation

Liam-DeVoe commented Jun 16, 2025

Uh oh!

Zac-HD commented Jun 16, 2025

Uh oh!

Liam-DeVoe commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD commented Jun 17, 2025

Footnotes

Uh oh!

Liam-DeVoe commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD commented Jun 17, 2025

Uh oh!

Liam-DeVoe commented Jun 17, 2025

Uh oh!

Liam-DeVoe commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Zac-HD Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Liam-DeVoe Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use a `distribute_nodes` function #154

Use a `distribute_nodes` function #154

Liam-DeVoe commented Jun 16, 2025 •

edited

Loading

Liam-DeVoe commented Jun 17, 2025 •

edited

Loading

Liam-DeVoe Jul 4, 2025 •

edited

Loading