Cheatsheet
Cheatsheet
Cheatsheet
PBS to SLURM
NCC is replacing the existing PBS solution with SLURM. While the two systems are very similar, the name of the commands and job specification are slightly different.
For instance, a PBS GPU job script such as:
#!/bin/bash
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -q debug
#PBS -N HiThere
#PBS -j oe
cd $PBS_O_WORKDIR
module load cuda/8.0
./hello_world
must be updated to SLURM format:
#!/bin/bash
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --gres=gpu
#SBATCH -p gpu-small
#SBATCH --qos=debug
#SBATCH --job-name=HiThere
source /etc/profile
module load cuda/8.0
./hello_world
Note that cd $PBS_O_WORKDIR
disappears in the SLURM script because SLURM will by default start your job script in your submission directory.
For almost all instructions, there is a one-to-one mapping between PBS and SLURM as defined in the Reference below.
Partitions
In your job script, you must specify one of the following partitions:
Partition | Defaults | Limits per job | Available GPUs | Restricted to |
---|---|---|---|---|
ug-gpu-small | cpu=1,mem=2G/cpu | gpu≥1,cpu≤4,mem≤28G | 40 | undergraduates |
tpg-gpu-small | cpu=1,mem=2G/cpu | gpu≥1,cpu≤4,mem≤28G | 40 | taught PG |
res-gpu-small | cpu=1,mem=2G/cpu | gpu≥1,cpu≤4,mem≤28G | 48 | research PG & staff |
res-gpu-large | cpu=1,mem=2G/cpu | gpu≥1,cpu≤16,mem≤28G | 4 | research PG & staff |
cpu | cpu=1,mem=2G/cpu | none* | 0 | everyone |
* On the CPU partition, a job can span multiple nodes however there is a limit of 32 cores and 60G memory per node.
QOS
In your job script, you can specify the following QOS:
QOS | Max jobs per user | Default Walltime | Max walltime | Comment |
---|---|---|---|---|
debug | 30 minutes | 2 hours | ||
short | 30 minutes | 2 days | ||
long-high-prio | 1 | 30 minutes | 7 days | |
long-low-prio | 30 minutes | 7 days | Job might be preempted | |
long-cpu | 30 minutes | 14 days | cpu partition only |
With the move to SLURM, we are experimenting with preemption. Please read the preemption information if you are using the
long-low-prio
QOS to avoid data loss.
Reference
Those instructions were heavily inspired by: https://genomedk.fogbugz.com/?W6
User commands | PBS/Torque | SLURM |
---|---|---|
Job submission | qsub <job script> | sbatch <job script> |
Job submission | qsub -q debug -l nodes=2:ppn=16 -l mem=64g <job script> | sbatch -p debug -N 2 -c 16 –mem=64g <job script> |
Job deletion | qdel <job_id> | scancel <job_id> |
Job deletion | qdel ALL | scancel -u <user> |
List jobs | qstat [-u user] | squeue [-u user] [-l for long format] |
Job status | qstat -f <job_id> | scontrol show job <job_id> |
Job hold | qhold <job_id> | scontrol hold <job_id> |
Job release | qrls <job_id> | scontrol release <job_id> |
Node status | pbsnodes -l | sinfo -N -l |
Interactive job | qsub -I -l nodes=1:ppn=1 /bin/bash | srun -N 1 -c 1 –pty /bin/bash |
X GUI | sview |
Environment | PBS/Torque | SLURM |
---|---|---|
Job ID | $PBS_JOBID | $SLURM_JOBID |
Node list (entry per core) | $PBS_NODEFILE | $PBS_NODEFILE (still supported) |
Slurm node list | $SLURM_JOB_NODELIST (new format) | |
Submit directory | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
Job Specification | PBS/Torque | SLURM |
---|---|---|
Script directive | #PBS | #SBATCH |
Queue | -q <queue> | -p <partition> |
Node count | -l nodes=<count> | -N <min[-max]> |
Cores(cpu) per node | -l ppn=<count> | -c <count> |
Memory size | -l mem=16384 | –mem=16g OR –mem-per-cpu=2g |
Wall clock limit | -l walltime=<hh:mm:ss> | -t <days-hh:mm:ss> |
GPU count | -l gpus=X | –gres=gpu:X |
Pascal GPU count | -l gpus=X:PASCAL | –gres=gpu:pascal:X |
Kepler GPU count | -l gpus=X:KEPLER | –gres=gpu:kepler:X |
Standard output file | -o <file_name> | -o <file_name> |
Standard error file | -e <file_name> | -e <file_name> |
Combine stdout/err | -j oe | (use -o without -e) [standard behaviour] |
Direct output to directory | -o <directory> | -o “directory/slurm-%j.out” |
Event notification | -m abe | –mail-type=[BEGIN, END, FAIL, REQUEUE, or ALL] |
Email address | -M <address> | –mail-user=<address> |
Job name | -N <name> | –job-name=<name> |
Node sharing | only for same user | for all users if not –exclusive |
Node sharing | –exclusive OR –shared | |
Job dependency | -W depend=afterok:<jobid> | –depend=C:<jobid> |
Node preference | –nodelist=<nodes> AND/OR –exclude=<nodes> |