Skip to content

how to run Slurm job across multiple node #164

@GKarbon

Description

@GKarbon

Hi,

I'm tring to run job on 2 nodes with Slurm and Pyxis, pmix is available.
The container is working well when running on single node.

Submit script

#!/usr/bin/bash
#SBATCH -N 2
#SBATCH --gpus-per-node=8
#SBATCH -J job
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

srun  --container-mounts=./input \
        --container-workdir=./input \
        --container-image=runtime.sqsh \
        mpirun -np 8 bin_std

Error output

Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

After searching, I tried adding export OMPI_MCA_plm=^slurm, now error is

--------------------------------------------------------------------------
The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[7357,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/slurm/plm_slurm_module.c(475)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

What is the best practice to do it?
Do I need to install srun inside the container?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions