-
Notifications
You must be signed in to change notification settings - Fork 124
Open
Labels
TaskVinebugFor modifications that fix a flaw in the code.For modifications that fix a flaw in the code.
Description
Transcribing discussion from earlier today:
@cmoore24-24 reports that some runs of DV5 using Coffea and TaskVine result in (some) workers not responding to keepalive messages. The manager eventually notices and (correctly) disconnects them. But why are they not responding?
Hypotheses:
- The machine itself could be under heavy load and the worker gets swapped out. If this is the case, it is not a taskvine problem, but we should still understand why.
- Workers running in HTCondor can be suspended and appear nonresponsive. Also not a taskvine problem. But the worker factory should be generating a submit file that turns suspension into preemption. This can be verified by looking at the condor user log file produced by the factory.
- The worker could be (incorrectly) stuck in a housekeeping task such as measuring task sandboxes, or in attempting to kill a process that refuses to be killed. The former shouldn't happen because the measuring code is time-bounded. The latter could happen if the kernel refuses to terminate a process that is (for example) unkillably stuck in distributed filesystem access.
Going forward:
@JinZhou5042 will configure the factory to send worker debug files to a known location in the shared filesystem, and then we should be able to see which are stuck and why.
Metadata
Metadata
Assignees
Labels
TaskVinebugFor modifications that fix a flaw in the code.For modifications that fix a flaw in the code.