Cheatsheet

PBS to SLURM

NCC is replacing the existing PBS solution with SLURM. While the two systems are very similar, the name of the commands and job specification are slightly different.

For instance, a PBS GPU job script such as:

#!/bin/bash
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -q debug
#PBS -N HiThere
#PBS -j oe

cd $PBS_O_WORKDIR
module load cuda/8.0

./hello_world

must be updated to SLURM format:

#!/bin/bash
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --gres=gpu
#SBATCH -p gpu-small
#SBATCH --qos=debug
#SBATCH --job-name=HiThere

source /etc/profile
module load cuda/8.0

./hello_world

Note that cd $PBS_O_WORKDIR disappears in the SLURM script because SLURM will by default start your job script in your submission directory.

For almost all instructions, there is a one-to-one mapping between PBS and SLURM as defined in the Reference below.

Partitions

In your job script, you must specify one of the following partitions:

Partition	Defaults	Limits per job	Available GPUs	Restricted to
ug-gpu-small	cpu=1,mem=2G/cpu	gpu≥1,cpu≤4,mem≤28G	40	undergraduates
tpg-gpu-small	cpu=1,mem=2G/cpu	gpu≥1,cpu≤4,mem≤28G	40	taught PG
res-gpu-small	cpu=1,mem=2G/cpu	gpu≥1,cpu≤4,mem≤28G	48	research PG & staff
res-gpu-large	cpu=1,mem=2G/cpu	gpu≥1,cpu≤16,mem≤28G	4	research PG & staff
cpu	cpu=1,mem=2G/cpu	none*	0	everyone

* On the CPU partition, a job can span multiple nodes however there is a limit of 32 cores and 60G memory per node.

QOS

In your job script, you can specify the following QOS:

QOS	Max jobs per user	Default Walltime	Max walltime	Comment
debug		30 minutes	2 hours
short		30 minutes	2 days
long-high-prio	1	30 minutes	7 days
long-low-prio		30 minutes	7 days	Job might be preempted
long-cpu		30 minutes	14 days	cpu partition only

With the move to SLURM, we are experimenting with preemption. Please read the preemption information if you are using the long-low-prio QOS to avoid data loss.

Reference

Those instructions were heavily inspired by: https://genomedk.fogbugz.com/?W6

User commands	PBS/Torque	SLURM
Job submission	qsub <job script>	sbatch <job script>
Job submission	qsub -q debug -l nodes=2:ppn=16 -l mem=64g <job script>	sbatch -p debug -N 2 -c 16 –mem=64g <job script>
Job deletion	qdel <job_id>	scancel <job_id>
Job deletion	qdel ALL	scancel -u <user>
List jobs	qstat [-u user]	squeue [-u user] [-l for long format]
Job status	qstat -f <job_id>	scontrol show job <job_id>
Job hold	qhold <job_id>	scontrol hold <job_id>
Job release	qrls <job_id>	scontrol release <job_id>
Node status	pbsnodes -l	sinfo -N -l
Interactive job	qsub -I -l nodes=1:ppn=1 /bin/bash	srun -N 1 -c 1 –pty /bin/bash
X GUI		sview

Environment	PBS/Torque	SLURM
Job ID	$PBS_JOBID	$SLURM_JOBID
Node list (entry per core)	$PBS_NODEFILE	$PBS_NODEFILE (still supported)
Slurm node list		$SLURM_JOB_NODELIST (new format)
Submit directory	$PBS_O_WORKDIR	$SLURM_SUBMIT_DIR

Job Specification	PBS/Torque	SLURM
Script directive	#PBS	#SBATCH
Queue	-q <queue>	-p <partition>
Node count	-l nodes=<count>	-N <min[-max]>
Cores(cpu) per node	-l ppn=<count>	-c <count>
Memory size	-l mem=16384	–mem=16g OR –mem-per-cpu=2g
Wall clock limit	-l walltime=<hh:mm:ss>	-t <days-hh:mm:ss>
GPU count	-l gpus=X	–gres=gpu:X
Pascal GPU count	-l gpus=X:PASCAL	–gres=gpu:pascal:X
Kepler GPU count	-l gpus=X:KEPLER	–gres=gpu:kepler:X
Standard output file	-o <file_name>	-o <file_name>
Standard error file	-e <file_name>	-e <file_name>
Combine stdout/err	-j oe	(use -o without -e) [standard behaviour]
Direct output to directory	-o <directory>	-o “directory/slurm-%j.out”
Event notification	-m abe	–mail-type=[BEGIN, END, FAIL, REQUEUE, or ALL]
Email address	-M <address>	–mail-user=<address>
Job name	-N <name>	–job-name=<name>
Node sharing	only for same user	for all users if not –exclusive
Node sharing		–exclusive OR –shared
Job dependency	-W depend=afterok:<jobid>	–depend=C:<jobid>
Node preference		–nodelist=<nodes> AND/OR –exclude=<nodes>