Skip to content

Pyxis does not remove data directory on job requeue #161

@shapovalovts

Description

@shapovalovts

When a job finishes then unnamed container data is removed automatically (pyxis calls "enroot remove ..."). This works well. But when we requeue a running job with unnamed container then the deletion does not really happen. In this case job simply fails, because enroot does not like that the directory exists.

According to the logs "enroot remove" is called on requeuing, but fails and the directory remains:


taras@ts-tr-u24-enroot:~$ ssh node001 ls -l /home/taras/.local/share/enroot
ls: cannot access '/home/taras/.local/share/enroot': No such file or directory

taras@ts-tr-u24-enroot:~$ sbatch --container-image=busybox --wrap="sleep 3600"
Submitted batch job 2
taras@ts-tr-u24-enroot:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2      defq     wrap    taras  R       0:03      1 node001
taras@ts-tr-u24-enroot:~$ 

taras@ts-tr-u24-enroot:~$ ssh node001 ls -l /home/taras/.local/share/enroot/
total 0
drwxrwxr-x 14 taras taras 166 Mar 13 18:57 pyxis_2.4294967291
taras@ts-tr-u24-enroot:~$ 
taras@ts-tr-u24-enroot:~$ scontrol requeue 2
taras@ts-tr-u24-enroot:~$ ssh node001 ls -l /home/taras/.local/share/enroot/
total 0
drwxrwxr-x 4 taras taras 28 Mar 13 18:58 pyxis_2.4294967291
taras@ts-tr-u24-enroot:~$ 

From slurmd log:

[2025-03-13T18:58:20.616] [2.batch] pyxis: running enroot command: "enroot remove -f pyxis_2.4294967291"
...
[2025-03-13T18:58:21.180] [2.batch] error: pyxis: child 5795 failed with error code: 1
[2025-03-13T18:58:21.180] [2.batch] pyxis: failed to remove container filesystem: pyxis_2.4294967291

The config:

[root@node001 ~]# grep -v '^#\|^$' /etc/enroot/enroot.conf
ENROOT_SQUASH_OPTIONS      -noI -noD -noF -noX -no-duplicates
ENROOT_MOUNT_HOME          no
ENROOT_CONFIG_PATH         ${HOME}/.config/enroot
ENROOT_RESTRICT_DEV        yes
ENROOT_ROOTFS_WRITABLE     no
ENROOT_ZSTD_OPTIONS        -1
ENROOT_TRANSFER_RETRIES    5
ENROOT_CONNECT_TIMEOUT     60
ENROOT_TRANSFER_TIMEOUT    1200
ENROOT_MAX_CONNECTIONS     10
[root@node001 ~]# 

Reproduced on Ubuntu 24.04 and Rocky 9, Slurm 24.05.6/pyxis 0.20.0. Detailed slurmd log is attached.

slurmd.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions