Skip to main content

Cheatsheet

Cheatsheet

PBS to SLURM

NCC is replacing the existing PBS solution with SLURM. While the two systems are very similar, the name of the commands and job specification are slightly different.

For instance, a PBS GPU job script such as:

#!/bin/bash
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -q debug
#PBS -N HiThere
#PBS -j oe

cd $PBS_O_WORKDIR
module load cuda/8.0

./hello_world

must be updated to SLURM format:

#!/bin/bash
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --gres=gpu
#SBATCH -p gpu-small
#SBATCH --qos=debug
#SBATCH --job-name=HiThere

source /etc/profile
module load cuda/8.0

./hello_world

Note that cd $PBS_O_WORKDIR disappears in the SLURM script because SLURM will by default start your job script in your submission directory.

For almost all instructions, there is a one-to-one mapping between PBS and SLURM as defined in the Reference below.

Partitions

In your job script, you must specify one of the following partitions:

PartitionDefaultsLimits per jobAvailable GPUsRestricted to
ug-gpu-smallcpu=1,mem=2G/cpugpu≥1,cpu≤4,mem≤28G40undergraduates
tpg-gpu-smallcpu=1,mem=2G/cpugpu≥1,cpu≤4,mem≤28G40taught PG
res-gpu-smallcpu=1,mem=2G/cpugpu≥1,cpu≤4,mem≤28G48research PG & staff
res-gpu-largecpu=1,mem=2G/cpugpu≥1,cpu≤16,mem≤28G4research PG & staff
cpucpu=1,mem=2G/cpunone*0everyone

* On the CPU partition, a job can span multiple nodes however there is a limit of 32 cores and 60G memory per node.

QOS

In your job script, you can specify the following QOS:

QOSMax jobs per userDefault WalltimeMax walltimeComment
debug30 minutes2 hours
short30 minutes2 days
long-high-prio130 minutes7 days
long-low-prio30 minutes7 daysJob might be preempted
long-cpu30 minutes14 dayscpu partition only

With the move to SLURM, we are experimenting with preemption. Please read the preemption information if you are using the long-low-prio QOS to avoid data loss.

Reference

Those instructions were heavily inspired by: https://genomedk.fogbugz.com/?W6

User commandsPBS/TorqueSLURM
Job submissionqsub <job script>sbatch <job script>
Job submissionqsub -q debug -l nodes=2:ppn=16 -l mem=64g <job script>sbatch -p debug -N 2 -c 16 –mem=64g <job script>
Job deletionqdel <job_id>scancel <job_id>
Job deletionqdel ALLscancel -u <user>
List jobsqstat [-u user]squeue [-u user] [-l for long format]
Job statusqstat -f <job_id>scontrol show job <job_id>
Job holdqhold <job_id>scontrol hold <job_id>
Job releaseqrls <job_id>scontrol release <job_id>
Node statuspbsnodes -lsinfo -N -l
Interactive jobqsub -I -l nodes=1:ppn=1 /bin/bashsrun -N 1 -c 1 –pty /bin/bash
X GUIsview
EnvironmentPBS/TorqueSLURM
Job ID$PBS_JOBID$SLURM_JOBID
Node list (entry per core)$PBS_NODEFILE$PBS_NODEFILE (still supported)
Slurm node list$SLURM_JOB_NODELIST (new format)
Submit directory$PBS_O_WORKDIR$SLURM_SUBMIT_DIR
Job SpecificationPBS/TorqueSLURM
Script directive#PBS#SBATCH
Queue-q <queue>-p <partition>
Node count-l nodes=<count>-N <min[-max]>
Cores(cpu) per node-l ppn=<count>-c <count>
Memory size-l mem=16384–mem=16g OR –mem-per-cpu=2g
Wall clock limit-l walltime=<hh:mm:ss>-t <days-hh:mm:ss>
GPU count-l gpus=X–gres=gpu:X
Pascal GPU count-l gpus=X:PASCAL–gres=gpu:pascal:X
Kepler GPU count-l gpus=X:KEPLER–gres=gpu:kepler:X
Standard output file-o <file_name>-o <file_name>
Standard error file-e <file_name>-e <file_name>
Combine stdout/err-j oe(use -o without -e) [standard behaviour]
Direct output to directory-o <directory>-o “directory/slurm-%j.out”
Event notification-m abe–mail-type=[BEGIN, END, FAIL, REQUEUE, or ALL]
Email address-M <address>–mail-user=<address>
Job name-N <name>–job-name=<name>
Node sharingonly for same userfor all users if not –exclusive
Node sharing–exclusive OR –shared
Job dependency-W depend=afterok:<jobid>–depend=C:<jobid>
Node preference–nodelist=<nodes> AND/OR –exclude=<nodes>