Skip to content

Possibility of deadlock causing timeout errors #407

@ProjectsByJackHe

Description

@ProjectsByJackHe

As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.

Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.

Here's the problem.

  • Let X be the maximum number of machines a pool can create.
  • Let Y be the number of workflows running concurrently.
  • The API for integrating Github with Azure 1ES involves pushing M jobs in a workflow and Azure will randomly assign machines to do those jobs. If X >= M then every job can run concurrently. Otherwise, some random subset will run.
  • Our networking perf jobs require a pair of machines, so if a workflow defines Z perf scenarios, netperf will generate M = 2 * Z jobs to request machines from Azure.

Right now, we have multiple pools and 2* Z < X for all pools + scenarios. This isn't an issue.

But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where Y * (2 * Z) > X
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.

To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.

Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...

Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...

Metadata

Metadata

Labels

P2azureSpecific to Azure environmentbugSomething isn't working

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions