-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Transcribing discussion from earlier today:
@cmoore24-24 reports that running DV5 with Coffea+TaskVine, (intermittently) a large number of workers will be idle even when there are many tasks in the queue. If the workers are idle for too long, then they (correctly) request to be released from the manager, which grants permission.
@btovar is able to reproduce this for large workloads running MDV5 on VineReduce. The system will be idle for ~5 minutes at a time, and then tasks start to dispatch again. One difference is that this workflow involves a large number of temporary files. When idle workers (correctly) request to be released, the manager (correctly) denies the request because they are storing temp files.
Known facts:
- always happens for MDV5 scale up runs, but can take a few hours to develop.
gdbtells us that the manager is stuck incleanup_workerat some point.- only occurs in workflows that cancel running tasks.
Hypothesis:
- Cancellation of tasks in various states is not updating data structures correctly.
- Possibly an intrinsic problem in
cleanup_workeriteration/removal.
Ways forward:
- valgrind a large run from the beginning.
- create a test that aggressively creates/cancels a large number of tasks - valgrind that
- track the state changes of tasks, and verify correct transitions are occuring