Using srun on Avoca

srun

The srun command is used to execute applications on Avoca. The slurm module, loaded by default, ensures that the srun command is included in your path. Once an allocation (either portion of a block or a whole block if requesting 512 or more nodes) has been assigned to your job (by SLURM), it is initialised and then your SLURM submission script should call srun to execute your program.

The simplest usage for srun is:

srun <your program> <your program args>

A more comprehensive usage for srun is:

srun --ntasks=<total number of MPI tasks> --ntasks-per-node=<num of MPI tasks per node> <your program> <your program args>

where:

  • <num of MPI tasks per node> - is the number of MPI processes/tasks you wish to have spread across each 16 CPU-core node in your allocation.
  • <total number of MPI tasks> - is the total number of MPI processes/tasks you want the application to be run across. Use this if you don't want the application using the maximum number of tasks available to the allocation, where the maximum is ntasks-per-node multiplied by the number of nodes you request at the top of your SLURM script.

If not specified, the default value for this option will be 16 MPI processes per node, with each MPI process having 1GB of RAM and up to four threads. The value that you should be using depends on the type of application that you are executing (what type of parallel code it is). See below for further dtails.

Parallel applications

MPI applications

Applications that have been written using purely-MPI code will need to use a value for the --ntasks-per-node option that is determined by the amount of memory that each MPI process of the job requires. All possible values of ntasks-per-node are listed below along with the corresponding amount of memory per MPI process/task:

Number of MPI processes per node Memory per MPI process
64 256MB
32 512MB
16 1GB
8 2GB
4 4GB
2 8GB
1 16GB

 

As mentioned, the default is 16 and you should probably start with that. This arrangement is the closest thing to Virtual-Node mode from the Blue Gene/P. Moving to 32 tasks-per-node can quite often give you a performance improvement but this leaves each of your MPI processes with only 512MB RAM. Some applications may see further performance improvements by moving to 64 MPI tasks per node, but this further halves the memory per process down to 256MB. If you find that you are receiving out of memory type errors you may need to decrease the number of MPI tasks per node, which increases the amount of memory per MPI task. Unfortunately the ideal value of ntasks-per-node will vary from application to application. And it may even be the case that the same application when running different datasets, requires a different value of ntasks-per-node.

As an example, let's assume that we have a purely MPI application, that is going to run with a dataset that means each MPI process requires 1GB of RAM. This forces us to use 16 MPI processes per node. Using more than 16 isn't an option because it leaves us with too little memory per MPI process and using fewer is just wasting resources - cores will go unused on each node which will mean we're using more SUs than we should be for a given computation. Below is a sample SLURM script that could be used for this case:

#!/bin/sh
# Here we're asking for two days
#SBATCH --time=2-0
# We want 32 nodes (512 CPU cores)
#SBATCH --nodes=32
#SBATCH --job-name="purempi_example"
#SBATCH --output="purempi.out"

srun --ntasks-per-node=16 ~/bin/mpiheat/heat_2d ~/data/mpihead/initial2d-512.in

Multithreaded applications (pthreads, OpenMP)

Applications that have been parallelised using POSIX Threads (pthreads) or OpenMP can be run on a single Blue Gene/Q node. A single node comprises 16 cores, and each of those cores can execute four threads in hardware. In this way your multithreaded application can utilise up to 64 threads. The entire memory of your single node is also available to your multithreaded application - all 16GB. You must run with --ntasks-per-node=1 for this type of application. Below is a sample SLURM script that could be used for a multithreaded application:

#!/bin/sh
# Here we're asking for 2 hours
#SBATCH --time=2:0:0
# We want only one node because we've got a simple multi-threaded program
#SBATCH --nodes=1
#SBATCH --job-name="threaded_example"
#SBATCH --output="threaded.out"

srun --ntasks-per-node=1 ~/bin/threadedpi/calc

 

Hybrid applications: mixed MPI & multithreading

Hybrid applications are those that use both MPI and multithreading. They use MPI to communicate between nodes and some form of multithreading within nodes. This method of programming, although far less common today is almost certainly going to be the way forward for future high performance computing. Just as with pure MPI applications, the amount of memory that your application requires per node will put a constraint on the possible values of --ntasks-per-node that you can use. For example, if your application and dataset require 4GB per multi-threaded node, this will limit you to either 1, 2 or 4 MPI processes per node. The appropriate value for --ntasks-per-node will be application dependent and will depend on how many threads the multithreaded component of the application can scale to. At the extreme end of the scale, if the multithreaded component of your application can scale to 64 threads, you can run with 1 MPI process per node. It may be quite possible that your application sits somewhere in the middle, and may work best with 4 MPI processes per node, each of those processes running 8 (or perhaps 16 threads). Once you have selected (and set) the value of --ntasks-per-node you will also need to communicate to your application the number of threads it should spawn on each node. This may be done by setting the OMP_NUM_THREADS environment variable prior to calling srun, or it may even be a command line argument to your application. Note that the total number of threads per node must be less than or equal to 64, so for example ntasks-per-node multiplied by OMP_NUM_THREADS has a maximum possible value of 64.

As an example, let's assume that we have a hybrid application, that uses MPI and OpenMP and that we know each node requires 4GB of RAM because of the size of our dataset. This means that we are left with the choice of 1,2, or 4 MPI processes per node, because going higher than this would leave us with too little memory per node. Let us also assume that we know the multithreading in this application scales well to eight threads per node. This leads us to a choice of 4 MPI processes per node with each of those using 8 threads. Below is a sample SLURM script that could be used for this case:

#!/bin/sh
# Here we're asking for one day and 12 hours
#SBATCH --time=1-12
# We want 16 nodes (256 CPU cores)
#SBATCH --nodes=16
#SBATCH --job-name="hybrid_example"
#SBATCH --output="hybrid.out"

export OMP_NUM_THREADS=8
srun --ntasks-per-node=4 ~/bin/hybridwave/calc_wave ~/data/hybridwave/wave128.in

 

As another example, let's take the application NAMD.... Below is a sample SLURM script that could be used for this case:

#!/bin/sh
# Here we're asking for one day
#SBATCH --time=1-0
# We want 4 nodes (64 CPU cores)
#SBATCH --nodes=4
#SBATCH --job-name="hybrid_example"
#SBATCH --output="hybrid.out"

srun --ntasks-per-node=XXX NAMD

 

Serial applications

You may run a single-threaded application on only one node and it will get access to the entire 16GB of RAM on that node. This may be a viable model to work in if your application requires a lot of memory (up to 16GB obviously). Below is a sample SLURM script that could be used for such a case:

#!/bin/sh
# Here we're asking for one day
#SBATCH --time=1-0
# We want 1 node (16 CPU cores)
#SBATCH --nodes=1
#SBATCH --job-name="serial_example"
#SBATCH --output="singlethread.out"

srun --ntasks-per-node=1 ~/bin/serialcombiner/comb_huge ~/data/serialcombiner/huge128K.in

 

If you have single threaded applications that do not need to use 16GB of RAM you can still run them in the mode above but it may not be the most efficient use of your SUs. At the moment there is no way to run multiple single-threaded programs in a single node. We are looking into this however and hope to have a solution to this problem in the future.