-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
I'm currently observing runners that get stuck after this output:
[INFO ] SLURM #9281: Running on gpuc2 in `/software-local/cirunner/slurmactiond/nvidia/3`
[INFO ] Runner 20 (slurm-nvidia-9281) connected for target nvidia
[DEBUG] SLURM #9281: Re-using existing Github Actions Runner installation
[INFO ] SLURM #9281: Generating runner registration token for celerity
[DEBUG] SLURM #9281: Sending GitHub API request POST https://api.github.com/orgs/celerity/actions/runners/registration-token
[INFO ] SLURM #9281: Unregistering previous runner if present
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: # Runner removal
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: Cannot connect to server, because config files are missing. Skipping removing runner from the server.
[DEBUG] SLURM #9281: config.sh: Does not exist. Skipping Removing .credentials
[DEBUG] SLURM #9281: config.sh: Does not exist. Skipping Removing .runner
[DEBUG] SLURM #9281: config.sh:
[INFO ] SLURM #9281: Registering runner slurm-nvidia-9281
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: --------------------------------------------------------------------------------
[DEBUG] SLURM #9281: config.sh: | ____ _ _ _ _ _ _ _ _ |
[DEBUG] SLURM #9281: config.sh: | / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
[DEBUG] SLURM #9281: config.sh: | | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
[DEBUG] SLURM #9281: config.sh: | | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
[DEBUG] SLURM #9281: config.sh: | \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
[DEBUG] SLURM #9281: config.sh: | |
[DEBUG] SLURM #9281: config.sh: | Self-hosted runner registration |
[DEBUG] SLURM #9281: config.sh: | |
[DEBUG] SLURM #9281: config.sh: --------------------------------------------------------------------------------
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: # Authentication
[DEBUG] SLURM #9281: config.sh:
This could be detected in slurmactiond through a timeout between runner_connected and runner_listening.
We should consider the failure mode of a runner getting stuck during job execution as well.
Metadata
Metadata
Assignees
Labels
No labels