-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
When a job finishes then unnamed container data is removed automatically (pyxis calls "enroot remove ..."). This works well. But when we requeue a running job with unnamed container then the deletion does not really happen. In this case job simply fails, because enroot does not like that the directory exists.
According to the logs "enroot remove" is called on requeuing, but fails and the directory remains:
taras@ts-tr-u24-enroot:~$ ssh node001 ls -l /home/taras/.local/share/enroot
ls: cannot access '/home/taras/.local/share/enroot': No such file or directory
taras@ts-tr-u24-enroot:~$ sbatch --container-image=busybox --wrap="sleep 3600"
Submitted batch job 2
taras@ts-tr-u24-enroot:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 defq wrap taras R 0:03 1 node001
taras@ts-tr-u24-enroot:~$
taras@ts-tr-u24-enroot:~$ ssh node001 ls -l /home/taras/.local/share/enroot/
total 0
drwxrwxr-x 14 taras taras 166 Mar 13 18:57 pyxis_2.4294967291
taras@ts-tr-u24-enroot:~$
taras@ts-tr-u24-enroot:~$ scontrol requeue 2
taras@ts-tr-u24-enroot:~$ ssh node001 ls -l /home/taras/.local/share/enroot/
total 0
drwxrwxr-x 4 taras taras 28 Mar 13 18:58 pyxis_2.4294967291
taras@ts-tr-u24-enroot:~$
From slurmd log:
[2025-03-13T18:58:20.616] [2.batch] pyxis: running enroot command: "enroot remove -f pyxis_2.4294967291"
...
[2025-03-13T18:58:21.180] [2.batch] error: pyxis: child 5795 failed with error code: 1
[2025-03-13T18:58:21.180] [2.batch] pyxis: failed to remove container filesystem: pyxis_2.4294967291
The config:
[root@node001 ~]# grep -v '^#\|^$' /etc/enroot/enroot.conf
ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
ENROOT_MOUNT_HOME no
ENROOT_CONFIG_PATH ${HOME}/.config/enroot
ENROOT_RESTRICT_DEV yes
ENROOT_ROOTFS_WRITABLE no
ENROOT_ZSTD_OPTIONS -1
ENROOT_TRANSFER_RETRIES 5
ENROOT_CONNECT_TIMEOUT 60
ENROOT_TRANSFER_TIMEOUT 1200
ENROOT_MAX_CONNECTIONS 10
[root@node001 ~]#
Reproduced on Ubuntu 24.04 and Rocky 9, Slurm 24.05.6/pyxis 0.20.0. Detailed slurmd log is attached.
Metadata
Metadata
Assignees
Labels
No labels