-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
Hi,
I'm tring to run job on 2 nodes with Slurm and Pyxis, pmix is available.
The container is working well when running on single node.
Submit script
#!/usr/bin/bash
#SBATCH -N 2
#SBATCH --gpus-per-node=8
#SBATCH -J job
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none
export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
srun --container-mounts=./input \
--container-workdir=./input \
--container-image=runtime.sqsh \
mpirun -np 8 bin_stdError output
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
After searching, I tried adding export OMPI_MCA_plm=^slurm, now error is
--------------------------------------------------------------------------
The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[7357,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/slurm/plm_slurm_module.c(475)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
What is the best practice to do it?
Do I need to install srun inside the container?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels