-
Notifications
You must be signed in to change notification settings - Fork 18
Description
As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.
Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.
Here's the problem.
- Let X be the maximum number of machines a pool can create.
- Let Y be the number of workflows running concurrently.
- The API for integrating Github with Azure 1ES involves pushing M jobs in a workflow and Azure will randomly assign machines to do those jobs. If X >= M then every job can run concurrently. Otherwise, some random subset will run.
- Our networking perf jobs require a pair of machines, so if a workflow defines Z perf scenarios, netperf will generate M = 2 * Z jobs to request machines from Azure.
Right now, we have multiple pools and 2* Z < X for all pools + scenarios. This isn't an issue.
But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where Y * (2 * Z) > X
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.
To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.
Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...
Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...
Metadata
Metadata
Assignees
Labels
Type
Projects
Status