Skip to content

Periodically check if a runner is stuck #9

@fknorr

Description

@fknorr

I'm currently observing runners that get stuck after this output:

[INFO ] SLURM #9281: Running on gpuc2 in `/software-local/cirunner/slurmactiond/nvidia/3`
[INFO ] Runner 20 (slurm-nvidia-9281) connected for target nvidia
[DEBUG] SLURM #9281: Re-using existing Github Actions Runner installation
[INFO ] SLURM #9281: Generating runner registration token for celerity
[DEBUG] SLURM #9281: Sending GitHub API request POST https://api.github.com/orgs/celerity/actions/runners/registration-token
[INFO ] SLURM #9281: Unregistering previous runner if present
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: # Runner removal
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: Cannot connect to server, because config files are missing. Skipping removing runner from the server.
[DEBUG] SLURM #9281: config.sh: Does not exist. Skipping Removing .credentials
[DEBUG] SLURM #9281: config.sh: Does not exist. Skipping Removing .runner
[DEBUG] SLURM #9281: config.sh:
[INFO ] SLURM #9281: Registering runner slurm-nvidia-9281
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: --------------------------------------------------------------------------------
[DEBUG] SLURM #9281: config.sh: |        ____ _ _   _   _       _          _        _   _                      |
[DEBUG] SLURM #9281: config.sh: |       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
[DEBUG] SLURM #9281: config.sh: |      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
[DEBUG] SLURM #9281: config.sh: |      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
[DEBUG] SLURM #9281: config.sh: |       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
[DEBUG] SLURM #9281: config.sh: |                                                                              |
[DEBUG] SLURM #9281: config.sh: |                       Self-hosted runner registration                        |
[DEBUG] SLURM #9281: config.sh: |                                                                              |
[DEBUG] SLURM #9281: config.sh: --------------------------------------------------------------------------------
[DEBUG] SLURM #9281: config.sh:
[DEBUG] SLURM #9281: config.sh: # Authentication
[DEBUG] SLURM #9281: config.sh:

This could be detected in slurmactiond through a timeout between runner_connected and runner_listening.

We should consider the failure mode of a runner getting stuck during job execution as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions