Skip to content

Vine: Tasks go Idle on Large Runs (intermittent) #4325

@dthain

Description

@dthain

Transcribing discussion from earlier today:

@cmoore24-24 reports that running DV5 with Coffea+TaskVine, (intermittently) a large number of workers will be idle even when there are many tasks in the queue. If the workers are idle for too long, then they (correctly) request to be released from the manager, which grants permission.

@btovar is able to reproduce this for large workloads running MDV5 on VineReduce. The system will be idle for ~5 minutes at a time, and then tasks start to dispatch again. One difference is that this workflow involves a large number of temporary files. When idle workers (correctly) request to be released, the manager (correctly) denies the request because they are storing temp files.

Known facts:

  • always happens for MDV5 scale up runs, but can take a few hours to develop.
  • gdb tells us that the manager is stuck in cleanup_worker at some point.
  • only occurs in workflows that cancel running tasks.

Hypothesis:

  • Cancellation of tasks in various states is not updating data structures correctly.
  • Possibly an intrinsic problem in cleanup_worker iteration/removal.

Ways forward:

  • valgrind a large run from the beginning.
  • create a test that aggressively creates/cancels a large number of tasks - valgrind that
  • track the state changes of tasks, and verify correct transitions are occuring

Metadata

Metadata

Assignees

Labels

TaskVinebugFor modifications that fix a flaw in the code.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions