Topics dealt with here:
- Service charge for memory usage
- Telling the system how much memory you need (ie when to use pvmem= and when to use mem= , most important!)
- Working out how much memory you need
Service charge for memory usage on the x86 clusters
Memory in HPC systems is just as valuable (and expensive) as other vital resources such as CPU and disk space. Jobs which request a lot of memory per core can starve other cores of memory, making them unusable. Consequently, jobs which use more than a certain amount of memory are charged for that usage. Note that this charge only applies to the x86 clusters (merri and bruce) but not the Blue Gene (avoca).
The charge is paid when your job requests an amount of memory per CPU core above a 16GB threshold. That is to say, if your job requests less than or equal to 16GB of memory per core then you will not incur the charge. However, if your job requests more than 16GB per core then a charge will be applied of 1/8 SU, per additional GB, per hour. Here is a formula which calculates the charge based on the number of cores requested, the amount of memory requested per core, and the number of hours for which the job runs:
mem_charge(cores, memory_per_core, hours) = (memory_per_core - 16GB) * cores * hours * 1/8
Here is a worked example: an x86 job requests 128 cores with 24GB per core and runs for three days:
mem_charge(128, 24, 3*24) = (24 - 16) * 128 * (3 * 24) * 1/8
= 9,216 SU
The base charge for that job is (128 * 3 * 24) = 9,216 SU so the addition of an extra 9,216 SU is a significant impact!
How to tell the system how much memory a job needs?
You must tell the (x86) systems if you need more than 1GByte memory per process
Memory is a limited resource, if we tell the scheduler how much each job needs, it can fit jobs into slots on the system most efficiently. If you don't define your memory needs, the scheduler assumes you need the default figure, 1GB. That's not a lot and many applications need more. If you actually try and use more than the scheduler thinks you should, it will kill your job. On the other hand, if you ask for (a lot) more than you need, you make it harder for the scheduler to squeeze you in and you waste resources. So its important you get it right.
First, its important to realize that memory can be defined in two different ways (ie pvmem=x and mem=y), they are not interchangeable.
- For MPI parallel jobs, you tell the system how much memory you need per process (or 'core' or 'cpu' depending on how the terms are used). Thats because (good) MPI applications can spread the load across all the processes running. The job itself is almost always launched (at the end of the script) with "mpiexec". So, if your total job might need 100GB ram and you are using 50 processors, the number you need is 2000MegBytes, lets call it 2500 to be safe, use #PBS -l pvmem=2500mb. Here is another example -
# Here is a parallel job using 32 cores, a total of 64GBytes Ram or 2GBytes per core
#PBS -l procs=32
#PBS -l pvmem=2gb
#PBS -l walltime=10:00:00
- Single cpu jobs work almost exactly the same as the individual mpi processes. But now, of course, you are talking about the total memory of the job. If the job still needs 100GB, its #PBS -l pvmem=100gb and we know that the scheduler will need to search a bit harder to find that for you. That translates to possibly a longer wait in the queue. Here is another example -
# Here is a serial or single cpu job using 4GBytes Ram
#PBS -l procs=1
#PBS -l pvmem=4gb
#PBS -l walltime=10:00:00
- SMP or multithreaded applications are a different thing altogether. These sort of jobs must run on only one node and might use two through to eight cores. The job it self starts on just one core and is told (usually with a command line switch) how many cpus it should spread across. All processes share the one block of memory. In this model, you specify how much memory you need in total. Because this is a different model, you use a different PBS command to define the memory, its mem. Importantly, you also need to submit these sort of jobs to a different queue, the smp queue, and you need to tell the scheduler how many cores to use on the node. Additionally, you usually need to tell the binary how many cores to use, this had better match the number of core you asked the scheduler for. So an extract from your PBS script may look like this (assuming the job needs 100G and 8 processors) -
#PBS -l mem=100gb
#PBS -q smp
- To summarise, pvmem is used with parallel or single CPU jobs and refers to the amount of memory allocated per processor. The alternative directive, mem, refers to the total memory allocated to the job and should be used only for smp or multithreaded jobs where all processors need be on the one node and you submit to the smp queue. Please don't mix these directives, results will almost certainly be disappointing !
How much memory do I need ?
This a lot harder question in many cases. With some applications and some problems it is just a case of knowing how big a memory structure you are going to build but more often than not, some experimenting may be necessary. One approach is to run a small version of the job (or even a short version of the real thing) and see what memory is being used. As long as you expect its going to remain stable through out the job run you can add a safety margin and away you go ! When your job is running (on the x86 machines) there is a command you can use to see whats happening on the compute nodes, first you need determine what node (or nodes) its running on, using, for example,
checkjob [jobid]and then ask ssh to run the
whatsoncommand on that node. Suppose its on bruce036 for example
ssh bruce036 whatson [enter]
Look in the column under VSZ and you will see the number of K Bytes the process is actually using. Add something to be safe and express in
pvmem=3gbtype statement in your PBS script.