This is the README file for the
PyLauncher
by
Victor Eijkhout eijkhout@tacc.utexas.edu
The pylauncher is a python-based parametric job launcher, that is, a utility for executing many small runs in one batch submission. On many batch-based cluster computers this is a better strategy than submitting many small individual small jobs.
The latest version of the pylauncher is always available from the repository: https://github.com/TACC/pylauncher
The only required sources for running are pylaucher.py and hostlist.py (if the latter is already installed on your system you don't even need that.)
The pylauncher is used from inside a (Slurm or PBS or SGE or whatever) job script. In your job script make sure that pylauncher.py is where python can find it, then:
python my_pylauncher_script.py
where the script contains
import pylauncher
pylauncher.ClassicLauncher( commandlinesfile )
Here ClassicLauncher is the simplest launcher -- more sophisticated launchers are discussed below, and commandlinesfile is a file containing one commandline per line.
Your file of commandlines can be simple, containing a program invocation for a sequence of parameters:
./my_program 1
./my_program 2
./my_program 3
We will refer to these lines as "tasks".
Tasks can be more complicated:
mkdir run1 && cd run1 && ./my_program 1 > out1
mkdir run2 && cd run2 && ./my_program 2 > out2
# et cetera
(Blank lines and comment lines, recognizable by the hash symbol are ignored in this file.)
In the simplest case, pylauncher assign one task per core, and as cores finish their tasks, they receive new tasks until the file of commandlines is exhausted.
Launchers, including the classic launcher, can take several options.
By default, each task gets one core.
The cores option can be used to assign more cores per task.
For example:
ClassicLauncher( "commandlines",cores=4 )
assigns four cores per task. You can use that for multi-threaded tasks; alternatively, this means that the tasks get four times as much memory.
If you want each task to have all the cores (and the memory)
of a node, use cores="node".
If specifying a uniform core count is limiting,
you can specify cores="file".
In this case the commandlines have the core count as prefix:
1,./simple_program
5,./medium_program
16,./big_program
You can limit taskruntime: taskmaxruntime=60 for one minute.
By default, pylauncher outputs some statistics at the end of the run.
For purposes of tracing and debugging you can add a debug option.
Minimally, debug="job" outputs job progress.
For more output, debug="job+host+exec".
The job also produces a file queuestate.
In cases where your batch job is killed for exceeding its time limit
this file can be used to restart the job.
Each launcher run also generates a work directory.
The name by default includes the job id,
giving something like pylauncher_tmp_1234567.
- The option
workdir="my_own_tmp_name"can be used to specify non-default names. - The work directory contains (among much more) files
out0, out1, out2et cetera that contain the standard out and error streams of the tasks.
Let me first stress that in 95 percent of cases the ClassicLauncher is the right choices. Here are some exceptional use cases.
The ClassicLauncher, through the cores option,
can handle multi-threaded parallelism but not MPI parallelism.
If your tasks involve an MPI program, use:
pylauncher.IbrunLauncher( "commandlines",cores=10 )
For tasks that need a GPU, use the GPULauncher.
Since the number of GPUs is not easily detected by the launcher,
you need to specify an option gpuspernode=3 or whatever the number is.
Pylauncher uses a short delay between starting tasks.
This prevents excessive file system activity.
You can shorten this delay by an option delay=.1.
Still, it takes some time to fill up all the cores.
For this there is an option schedule="block8"
which groups tasks in blocks of 8 that are started together.
If tasks are dynamically generated by another process, you can use
job = pylauncher.DynamicLauncher() # no commandline file!
job.append( "./my_program" ) # any number of times
job.finish() # declare no more tasks
job.tick() # delay, and process tasks
If your batch job was killed for exceeding runtime,
or because of a hardware failure,
you can use the queuestate file to restart the job
and execute only the tasks that did not finish.
If you are a TACC or XSEDE/Access user, please submit a ticket in the respective ticket system. Otherwise, create a ticket in the github repo. You can also mail me, putting "pylauncher" somewhere in the subject.