-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Hello,
I ran a container interactively using srun to do some development. Then I wanted to start a job using sbatch based on that container. I now have the problem, that the environment variables inside the container differ when running the container using sbatch instead of srun.
This is the srun command I used:
#!/bin/bash -eux
srun \
--account students \
--container-image /data/pytorch.sqsh \
--container-mounts /data:/data \
--container-name test_container1 \
--container-writable \
--partition gpu-interactive \
--cpus-per-task 64 \
--mem 64gb \
--gpus 1 \
--time 8:00:00 \
--nodelist gx01 \
python training.py
the file pytorch.sqsh was created by running: enroot import -o pytorch.sqsh 'docker://nvcr.io#nvidia/pytorch:25.05-py3'
I created the following sbatch file based on the srun script:
#!/bin/bash -eux
#SBATCH --job-name=finetuning
#SBATCH --output=finetuning_%j.log
#SBATCH --error=finetuning_%j.err
#SBATCH --account=students
#SBATCH --container-image=/data/pytorch.sqsh
#SBATCH --container-mounts=/data:/data
#SBATCH --container-name=test_container1
#SBATCH --container-writable
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=64
#SBATCH --mem=64gb
#SBATCH --gpus=1
#SBATCH --time=48:00:00
#SBATCH --nodelist=gx01
python training.py
When running training.py pytorch complains, that the environment variable WORLD_SIZE is not set. By running env in both scenarios I could verify that WORLD_SIZE is indeed only set when using srun. What could be the reason for the different environment variables? The only thing that differs is the partition I am running the job on. Could that be the reason for a difference in environment variables?