Skip to content

Question: Different environment variables set when using srun or sbatch #167

@Sohn123

Description

@Sohn123

Hello,
I ran a container interactively using srun to do some development. Then I wanted to start a job using sbatch based on that container. I now have the problem, that the environment variables inside the container differ when running the container using sbatch instead of srun.

This is the srun command I used:

#!/bin/bash -eux
srun \
        --account students \
        --container-image /data/pytorch.sqsh \
        --container-mounts /data:/data \
        --container-name test_container1 \
        --container-writable \
        --partition gpu-interactive \
        --cpus-per-task 64 \
        --mem 64gb \
        --gpus 1 \
        --time 8:00:00 \
        --nodelist gx01 \
       python training.py

the file pytorch.sqsh was created by running: enroot import -o pytorch.sqsh 'docker://nvcr.io#nvidia/pytorch:25.05-py3'

I created the following sbatch file based on the srun script:

#!/bin/bash -eux
#SBATCH --job-name=finetuning
#SBATCH --output=finetuning_%j.log
#SBATCH --error=finetuning_%j.err
#SBATCH --account=students
#SBATCH --container-image=/data/pytorch.sqsh
#SBATCH --container-mounts=/data:/data
#SBATCH --container-name=test_container1
#SBATCH --container-writable
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=64
#SBATCH --mem=64gb
#SBATCH --gpus=1
#SBATCH --time=48:00:00
#SBATCH --nodelist=gx01

python training.py

When running training.py pytorch complains, that the environment variable WORLD_SIZE is not set. By running env in both scenarios I could verify that WORLD_SIZE is indeed only set when using srun. What could be the reason for the different environment variables? The only thing that differs is the partition I am running the job on. Could that be the reason for a difference in environment variables?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions