Skip to content

Vine: Workers Fail Keepalive Check #4326

@dthain

Description

@dthain

Transcribing discussion from earlier today:

@cmoore24-24 reports that some runs of DV5 using Coffea and TaskVine result in (some) workers not responding to keepalive messages. The manager eventually notices and (correctly) disconnects them. But why are they not responding?

Hypotheses:

  • The machine itself could be under heavy load and the worker gets swapped out. If this is the case, it is not a taskvine problem, but we should still understand why.
  • Workers running in HTCondor can be suspended and appear nonresponsive. Also not a taskvine problem. But the worker factory should be generating a submit file that turns suspension into preemption. This can be verified by looking at the condor user log file produced by the factory.
  • The worker could be (incorrectly) stuck in a housekeeping task such as measuring task sandboxes, or in attempting to kill a process that refuses to be killed. The former shouldn't happen because the measuring code is time-bounded. The latter could happen if the kernel refuses to terminate a process that is (for example) unkillably stuck in distributed filesystem access.

Going forward:
@JinZhou5042 will configure the factory to send worker debug files to a known location in the shared filesystem, and then we should be able to see which are stuck and why.

Metadata

Metadata

Assignees

Labels

TaskVinebugFor modifications that fix a flaw in the code.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions