Possibility of deadlock causing timeout errors

As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.

Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.

Here's the problem.
- Let **X** be the maximum number of machines a pool can create. 
- Let **Y** be the number of workflows running concurrently.
- The API for integrating Github with Azure 1ES involves pushing M jobs in a workflow and Azure will **randomly** assign machines to do those jobs. If **X** >= M then every job can run concurrently. Otherwise, some random subset will run.
- Our networking perf jobs require a **pair** of machines, so if a workflow defines **Z** perf scenarios, netperf will generate M = 2 * **Z** jobs to request machines from Azure.

Right now, we have multiple pools and 2* **Z** < **X** for all pools + scenarios. This isn't an issue.

But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where **Y** * (2 * **Z**) > **X** 
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.

To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.

Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...

Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possibility of deadlock causing timeout errors #407

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possibility of deadlock causing timeout errors #407

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions