The x86 queueing systems
Background
To do work on Bruce and Merri you have to submit jobs to a queue. If resources are available then your job should run shortly afterwards, as long as you have quota available on your project. Whilst your jobs are running you will have the processors and memory dedicated to you that you requested for the amount of time that you have requested. Most of the time, you will use a short script, traditionally called a pbs-script and often so named, that has many of these settings.
To submit the resource requests to the queue, the 'qsub' command is used. In conjunction with the pbs-script, the command is of the following form:
qsub [command line options] pbs-script
To view the job queue, either of the following commands can be used:
qstat
or
showq
Job script generator
To simplify the task of writing job submission scripts we provide an interactive job script generator.
Summary
Broadly speaking, you are likely to be running one of three sorts of jobs, single cpu; SMP (or multithreaded) or MPI parallel. Some scheduler directives belong in some types of job script but not others. While there are a number of extra directives possible, here are some simple examples.
Single CPU
#!/bin/bash #PBS -l procs=1 #PBS -l walltime=01:00:00 #PBS -l pvmem=1gb cd $PBS_O_WORKDIR my-app
SMP Job (also called multithreaded, OpenMP)
#!/bin/bash #PBS -l walltime=01:00:00 #PBS -q smp #PBS -l mem=22gb cd $PBS_O_WORKDIR my-app -n8
MPI Parallel Job
#!/bin/bash #PBS -l procs=16 #PBS -l walltime=01:00:00 #PBS -l pvmem=1gb cd $PBS_O_WORKDIR mpiexec my-MPI-app
In all of the above, you may need to add some optional commands, here are some that are commonly used -
| Option | Example | Notes |
|---|---|---|
| Setup the environment for the application | module load namd | Must come after all the $PBS directives |
| Name your job | #PBS -N MyJob | Can make keeping track of a number of jobs easier. If you leave it out, the scheduler will give your job a number. |
| Send emails advising of progress |
#PBS -M name@example.com #PBS -m abe |
-M determines where the email is sent, -m determines what events trigger an email, a for a aborted job, 'b' when it begins to run and 'e' when it finishes. If you omit the -M option the email will be sent to the address that is registered with your VLSCI account (note: this may not work at other computing facilities). |
| Set the memory needs | You can leave out the mem or pvmem commands but the defaults are quite small and may not be appropriate for your job. There is a whole page devoted to telling the scheduler about your memory needs. | |
| Wall time |
#PBS -l walltime=01:00:00 |
The format is hh:mm:ss and, sadly, you do need two digits in both the minutes and seconds field. Set a time that you are reasonably sure will be enough for your job to run to completion, the scheduler will kill your job if its still running when this time expires. If you leave this directive out, the the scheduler will assume sixty minutes so you almost certainly need it. |
Special notes about SMP (or OpenMP or Multithreaded jobs)
| Issue | Notes |
|---|---|
| Don't use procs= or nodes= | SMP jobs always get all 8 cores on the node assigned. |
| #PBS -l mem=23gb | Don't use pvmem=, its only for MPI or single core jobs. In practice, suitable values to assign to mem are 23, 43 and 147 gb. Other numbers are not really useful as our nodes have 24, 48 or 144Gbytes. |
| Setting number of threads | Most OpenMP jobs can work out how many cores are assigned and use them. If you need to set it, typically you set the OMP_NUM_THREADS environment variable or pass -n 8 to it. |
Further details of resource requests
As part of submitting a job you will need to specify the amount of various resources you will need. Whilst there are default values you will mostly want to pick appropriate ones for your jobs!
Walltime
|
Default value |
1 hour |
|---|---|
|
Maximum value |
720 hours (30 days) |
This tells the queuing system the maximum length of time your job will run for as measured by your watch, and is specified in the form hours:minutes:seconds.
Through this you are making a bargain with the queuing system, you are telling it that the job can last no longer than this amount and so the queuing system can use it to fill gaps on the system of that amount of time, thus letting you jump the queue. The trade-off is that should your job overrun this time then it will be killed by the queuing system to make sure the schedule is not disrupted, so it is very important to make sure you always build in a margin of error in your estimate.
If you do find you have a job that is going to overrun through circumstances beyond your control please email help@vlsci.unimelb.edu.au and we will extend the job for you as long as we can get to it before it expires!
Examples:
|
A ten minute test job |
walltime=0:10:0 |
|---|---|
|
A 15 hour job |
walltime=15:0:0 |
|
A two day job |
walltime=48:0:0 |
Processors
|
Default value |
1 core |
|---|---|
|
Maximum value |
512 cores (more on request) |
|
Maximum value |
320 cores (more on request) |
If you are only using a single threaded (single processor) application then the default of a single CPU will be fine. However, for people running parallel applications (either MPI or SMP/OpenMP) they will need to adjust this to suit their needs. An MPI application can scale over the entire machine, whereas an SMP/OpenMP application can only run on a single node but can use multiple processors on that node.
-
To request multiple CPUs for an MPI application use the
procsrequest to specify the number of processor cores you want. -
To request a particular layout of those CPUs use the (separate)
tpnrequest to say how many tasks per node are required. - For a pure SMP program please submit a request to the "smp" queue without specifying any procs.
Examples:
|
A single CPU job |
procs=1 |
|---|---|
|
An MPI job using 256 CPUs across the cluster |
procs=256 |
|
An MPI application using all the CPUs on 8 nodes |
procs=64,tpn=8 |
Caveats
- If you request a particular layout and the system is busy then you may need to wait longer for that configuration to appear than if you just specify the processors you want. Normal MPI programs are quite happy with this.
-
We do not recommended using the old
nodesrequests for CPUs and layouts as it is ambiguous.
Memory
|
Default value |
1024MB (1GB) |
|---|---|
|
Maximum value |
146944MB (143.5GB) |
|
Maximum value |
102400MB (1000GB) |
To specify the amount of memory you need per process you need to specify the pvmem resource, either in megabytes (mb) or gigabytes (gb). The queuing system will set some kernel limits for your process that will mean it won't be able to allocate more memory than requested here so it is important to ensure this limit is large enough for the job.
It is important to realise that the nodes on bruce and merri have various memory sizes:
-
Bruce has:
- 110 nodes with 8 cores and 24GB (3GB per core)
- 28 nodes with 8 cores and 48GB (6GB per core)
- 6 nodes with 8 cores and 144GB (18GB per core)
-
Merri has:
- 44 nodes with 8 cores and 48GB (6GB per core)
- 36 nodes with 8 cores and 96GB (12GB per core)
- 3 nodes with 16 cores and 1024GB (64GB per core)
If you request too much memory per processor for too many nodes your job may not be able to run!
Examples:
|
A job needing 2GB of RAM per process |
pvmem=2gb |
|---|---|
|
A job needing 32GB of RAM per process |
pvmem=32gb |
Caveat
- Do not specify pvmem for SMP jobs, please use the mem attribute as specified below.
SMP applications
SMP applications use only a single node, but all the processors on the node. Because of this we have a separate method of submitting them as they have to request memory differently.
To submit an SMP job you need to submit it to the "smp" queue. If your job needs less than 24GB of memory in total then you do not need to specify how much memory it needs in the script. However, if your job needs at least 24GB of memory (or more), then you should specify how much you need in total using the "mem" atttribute, as show below.
#PBS -q smp
#PBS -l mem=40gb
or on the unix command line as:
$ qsub -q smp -l mem=40gb
Job Scripts
Quick Example
Here is an example (it asks 16 cores or processes, will run for up to one hour, needs 1Gbyte ram per process and runs, under mpi, an application called MyTestJob):
#!/bin/bash
#PBS -l procs=16
#PBS -l walltime=01:00:00
#PBS -l pvmem=1gb
#PBS -N MyTestJob
cd $PBS_O_WORKDIR
module load app-name/version
mpiexec My-MPI-app
Annotated Script
Here is an annotated script with an explanation of the common options. Please copy, paste and edit to suit your needs.
#!/bin/bash
## Use bash shell so environment is set correctly when logging in to nodes.
## A PBS script is just a normal script that has the special keyword #PBS at
## the beginning of a line that sets PBS parameters.
## Anything after #PBS is a parameter that is passed to the 'qsub' command.
## The command line version of the parameter overrides the value in the script.
## E.g. #PBS -l walltime=100:00:00 is overridden by 'qsub -l walltime=1:00:00'
## An exception is when running in interactive mode, e.g. 'qsub -I'.
## In this case, the script #PBS parameters are ignored, so parameters
## must be given on the command line.
## To comment out #PBS parameters, just change the first 4 characters of the
## line to something other than #PBS. Generally it is easiest to insert a
## space, e.g. #PBS --> # PBS.
## Below are some common parameters.
## Default values are set (via '#PBS' ...), variants are provided in the commented out form ('# PBS' ...).
# To give your job a name, replace "MyJob" with an appropriate name
#PBS -N MyJob
# Select the number of processors.
# For Serial Jobs (the default setting):
#PBS -l procs=1
# For Parallel Jobs: ie. To reserve 16 process on any node.
# PBS -l procs=16
# For Parallel Jobs with a set number of tasks per node (tpn).
# (There are 8 processing cores per node).
# PBS -l procs=16, tpn=8
# For a symmetrical multiprocessing (SMP) job,
# there is a dedicated queue that takes care of the settings.
# PBS -q smp
# Memory requirements
# For non-SMP jobs: use memory per process (not the total memory)
# E.g. for 1 Gbyte per process (the default amount)
#PBS -l pvmem=1gb
# For SMP jobs: use the total memory required (default 24 Gbyte)
# PBS -l mem=24gb
# Set your maximum acceptable walltime=hours:minutes:seconds
# Default 1 hour
#PBS -l walltime=1:00:00
# To receive an email:
# - job is aborted: 'a'
# - job begins execution: 'b'
# - job terminates: 'e'
# Note: you may want to specify a -M option to provide an email address.
# PBS -m abe
## Rest of the script
# On the nodes, change the running script directory to your execution directory
# (Leave as is)
cd $PBS_O_WORKDIR
# Load modules.
# Do this in the script so paths are set correctly on each node.
# E.g. load (set paths for) gcc and gcc compiled open-mpi
module load gcc openmpi-gcc
# Command to run a job, either mpi or serial :
# For mpi its mpiexec
# Serial or single cpu app, only need the app name and any options.
# Usage: mpiexec ./program
# Usage: ./program
# E.g. get the host name of each node of each process.
# Output will be stored in job_name.osequence (unless you've otherwise redirected it)
mpiexec hostname
By default this will create a file MyJob.oXXXX (where XXXX is the job's sequence number). In this file there will be a single name, the name of the node that the script ran on: E.g. bruce036.
To test on more processors, for example, try (replace 'script-name' with name you give the above script):
qsub -l procs=20 script-name