The NVIDIA CUDA Centre (NCC) GPU system is a shared computing facility provided by the Department of Computer Science. We primarily support staff and students in Computer Science, however we will consider requests for access from staff (and students with staff support) in other departments. Support requests should be sent to [email protected], and again we will prioritise support for Computer Science. NCC now serves dual purposes, first and foremost it is a research GPU cluster, but it now serves a secondary purpose as a teaching resource by offering Jupyter notebooks. If you are only using Jupyter, most of the information in this guide is irrelevant to you, so just skip down to the section on Jupyter notebooks and don’t concern yourself with the rest of the information provided here.
The system relies on the scheduling system called SLURM and all activity must be through SLURM. If you don’t use SLURM to run your code, your account will be blocked by an administrator. This guide explains how to use SLURM on NCC. Please note that this system does not come with large persistent storage and files should be transferred in when starting a job and then out after the job. The file store is not backed up. Some of the compute servers are currently not on UPS therefore we recommend that you have implement some kind of regular (e.g. hourly) checkpointing/resume on any long jobs (all head nodes and storage are on UPS).
We recently held a workshop on how to use NCC. The presentation slides are available here for reference (note that the slides are only visible on the university network).
The system is comprised of two head nodes, twelve GPU compute servers and six CPU blades. In total, we have 82 GPUs across the GPU compute servers and 246 CPU threads available on the CPU blades.
Requesting Access to the System
In order to connect to the system, you must have an account. If you do not already have an account, you can request one by email to [email protected]. When requesting an account, please provide:
- your name
- your CIS username
- your email address
- supervisor name
- your status (e.g. L3 student, MSc taught student, MSc research student, PhD student, visitor, staff).
If you are a member of staff or student not in Computer Science, please provide rationale as to the nature of your research needs. Such students will also need support of their supervisor in their application, which can be provided by copying the supervisor into the account request email.
If you publish data from or research made possible through NCC, we would appreciate the following acknowledgement:
- This work has used Durham University’s NCC cluster. NCC has been purchased through Durham University’s strategic investment funds, and is installed and maintained by the Department of Computer Science.
We are hosting a Jupyterhub server on ncc1, which will be used for teaching in a variety of modules. If you are a student, your lecturers will inform you if they are using Jupyter notebooks, and will provide you with a URL for that module if appropriate. In order to access the server you either need to be on the University network or using the VPN (see connection details below). Students should request VPN access and provide the details of the module and lecturer. You only need to do this once, not for every module that is using Jupyter. Note that the VPN is secured with Multi Factor Authentication. Alternatively you can access the server via a remote browser launched from AppsAnywhere.
If researchers are interested in using Jupyter notebooks for research work that is unrelated to teaching there is a test server currently running at ncc1.clients.dur.ac.uk/COMP0000 which will work with Python 3.6, C, Haskell and R. This server allows for the use of GPU backed notebooks and provides a similar level of compute to Google Colab.
It is now possible to configure your own Python environment for use with Jupyter notebooks on NCC. There is a script available in the assignments tab of the Jupyterhub landing page. Click “Assignments” and “Fetch” the released assignment. This adds a new folder to your files tab called “Create Jupyter Environment Script”, and inside that folder is a Jupyter notebook that can create your new environment and make it available within a Jupyter notebook. Follow the instructions in the provided setup script carefully, and only edit the variables that you are told to. If you are intending to use PyTorch on a GPU notebook you should ensure that you don’t request any CUDA modules in the environment creation script. PyTorch is shipped with its own CUDA libraries and loading an additional CUDA module will cause conflicts.
Connecting to the System
The system is configured with Ubuntu Linux 18.04 and is accessible via SSH through two head nodes:
ncc1.clients.dur.ac.uk: our new head node;
ncc.clients.dur.ac.uk: our old head node, restricted to research students and staff.
To connect onto the system from a Linux machine, open a terminal and type:
ssh [email protected]
On a Windows 10 machine, you can connect to NCC using the same method as on Linux using the Windows Subsystem for Linux (WSL). This is the best method to access NCC from Windows. Have a look to this link to install WSL. See this link for an introduction to WSL. Once you have a Linux-like terminal running on Windows using WSL, you can connect to NCC using:
ssh [email protected]
On other versions of Windows, you can access the system using Putty. From Putty GUI, connect to
ncc1.clients.dur.ac.uk and then type your username and password when prompted. On Windows 10, we recommand WSL instead of Putty because it is a lot easier to transfer files and create new terminal with WSL than with Putty.
Outside the University
NCC is not directly accessible from outside the university for security reasons. You can either ssh to the university gateway called
mira.dur.ac.uk or use the VPN. You have to request access to both of these services via forms on the CIS webpages, and both are secured with Multi Factor Authentication.
https://durhamuniversity.sharepoint.com/Sites/MyDigitalDurham/SitePages/ServicePage.aspx?Service=%22Linux%20Timeshare%20(MIRA)%22 for information on Mira and
https://durhamuniversity.sharepoint.com/Sites/MyDigitalDurham/SitePages/ServicePage.aspx?Service=%22Access%20VPN%20-%20Staff%22 for information on the VPN. The direct link to the form to request VPN access is here.
In practice, it seems that Mira is actually faster than the VPN if you have a fast fibre broadband connection. For general usage though we recommend the VPN, as this has the advantage mentioned below of being able to copy files directly using
From a Linux client or Windows WSL, it is possible to transparently connect to NCC using Mira as a proxy. In order to do that, add (or create) the following to your
Host *.clients.dur.ac.uk ProxyCommand ssh email@example.com nc %h %p
then connect to NCC as if from inside the university. You will be prompted twice for the password (once for Mira, once for NCC). This is advantegeous because it means that you can copy files using
scp directly to/from NCC without having to copy them first to Mira.
If you have never used Linux before
The system is based on Ubuntu Linux 20.04 and does not have any graphical interface therefore you must get used to the Linux command line in order to use the system. You can get familiar with Linux terminal using one the many tutorials available on Internet (for instance, this one).
Getting your Data on the System
NCC is available through SSH at
ncc1.clients.dur.ac.uk. Therefore you can copy your programs and input data to and from the system using scp, rsync or sshfs. While the command line is the recommended way of doing it, you can also use a GUI tool such as FileZilla which is available both on Windows and Linux. For your code, GIT (using BitBucket or Github for instance) is the best way to keep your code synchronized.
NCC is not a storage system — and is not backed-up. You must not leave large amount of unused data for extended amount of time on NCC. Please copy out of NCC your results and delete any temporary files or job output as soon as possible from the system.
Please note that apart from a few exceptions, your NCC account and data will be deleted once your CIS account cease to exist therefore make sure that all important data has been copied out of the system if you are leaving the university or graduating.
Your storage space on NCC is limited by a quota. The default quota is 100Gb if your account is on the old storage node (/home2). Accounts created after July 2023 will have a default quota of 250GB and reside on our new storage node (/home3). Your account can be moved from /home2 to /home3 on request, but this will break any python virtual environments that you have created and you will need to rebuild them.
You can check your quota and current utilization using the command
quota. If you are using the old storage, this command will not work and you must use
df -h $HOME.
If you need to use more than your current quota, please send a request by email to [email protected].
Once you have copied your data and code on NCC, half of the work is done to get you started on NCC.
NCC uses SLURM to schedule jobs on the system which is also used on the university-wide Hamilton HPC. The role of the SLURM scheduling system is to divide the cluster resources fairly between all the jobs submitted by the users. Occasionally, this means that your job might stay a little while in a queue while the system is busy processing other jobs. Once SLURM allocates resources to your job, the job is guaranteed to have full usage of the resources until it terminates.
To use SLURM two major components need to be dealt with:
- the batch scripting language which tells SLURM how to run your job;
- the command line tools to start and control your job.
ALL ACTIVITY MUST BE THROUGH THE SLURM SYSTEM. THIS ENSURES FAIRUSE.
When scheduling a job/simulations/render through SLURM a script needs to be written which the job scheduler parses for instructions to itself and then passes the remaining instructions to the command line for execution once the job starts. SLURM batch scripts can be written in any terminal scripting language. As ‘BASH’ is the default shell environment on NCC, this tutorial will use bash syntax.
Below is a sample SLURM Job Script. The comments at the end of the line explain what each statement does.
#!/bin/bash # This line is required to inform the Linux #command line to parse the script using #the bash shell # Instructing SLURM to locate and assign #X number of nodes with Y number of #cores in each node. # X,Y are integers. Refer to table for #various combinations #SBATCH -N X #SBATCH -c Y # Governs the run time limit and # resource limit for the job. Please pick values # from the partition and QOS tables below #for various combinations #SBATCH -p "partitionname" #SBATCH --qos="qosname" #SBATCH -t DD-HH:MM:SS # Source the bash profile (required to use the module command) source /etc/profile # Run your program (replace this with your program) ./hello-world
The lines from the script above are required to run any job. The first line creates the user environment for the job. The next two are used to schedule and allocate resources to the job. The next two lines control the resource limits applied to a job. The partition indicates the set of nodes that the job can run on. Each partition has restriction on the amount of resources that can be used by a job. Pick the partition that suits your needs. The QOS specify how long the job can run at most. Longer QOS have stricter restrictions on the number of jobs than can be run at once. Please refer to the next two tables for a descriptions of all the partitions and QOS availables.
The final line is the maximum walltime that the job can run for. The default walltime is 30 minutes (regardless of QOS and partitions).
#SBATCH -t HH:MM:SS or
#SBATCH -t DD-HH
-c specifies the resources needed by the job. Once your job is running, those resources will be allocated exclusively to your script therefore specify the exact amount needed. The default amount of memory allocated to a job is 4Gb per allocated CPU core. To allocate more (or less) memory, use:
When the system is busy, it is in your interest to request as little resources as possible. Smaller jobs will start much more quickly than bigger jobs. The current scheduler greedily start jobs as soon as resources are available for them therefore large jobs may never start even when they are the oldest job in the queue.
Please request the exact amount of resources required by your job. It is not okay to request an entire node for a single-threaded program.
The cluster is split into several overlapping partitions of nodes and you should use the most appropriate one for your job. There are three partitions. For cpu only jobs, use the
cpu partitions which will give you access to a pool of 134 CPU cores. For gpu jobs, use either the
gpu-large partitions. The
gpu-large has a restricted set of nodes to exclude nodes which are not able to accommodate large jobs.
The limits are purposefully set to maximise the number of jobs that can be scheduled at once. If you require more memory or CPU than available through the
gpu-large partition, please talk to an NCC administrator to get an exemption.
|Partition||Defaults||Limits per job||Available GPUs||Restricted to|
|res-gpu-small||cpu=1,mem=2G/cpu||gpu≥1,cpu≤4,mem≤28G||64***||research PG & staff|
|res-gpu-large**||cpu=1,mem=2G/cpu||gpu≥1,cpu≤16,mem≤28G||4||research PG & staff|
* On the CPU partition, a job can span multiple nodes however there is a limit of 32 cores and 60G memory per node.
** mem refers to system RAM, and not GPU VRAM. So the partitions res-gpu-large and gpu-bigmem just give access to additional CPU cores and system RAM, not larger GPUs.
*** Only 64 of the 82 GPUs we have are directly available for slurm jobs. There are 18 80GB A100 cards that have been virtualised into 126 10GB cards for use with Jupyter. You can learn more about the technology behind this here.
You must also choose a QOS that defines how many jobs you can ran simultaneously and for how long.
In your job script, you must specify one of the following QOS:
|QOS||Max jobs per user||Default Walltime||Max walltime||Comment|
|debug||30 minutes||2 hours|
|short||30 minutes||2 days|
|long-high-prio||4||30 minutes||7 days|
|long-low-prio||5||30 minutes||7 days||Job might be preempted|
|long-cpu||30 minutes||14 days||cpu partition only|
In addition to the QOS cap on the number of jobs, there is a cap of 4 simultaneously used GPUs for each user across all their job and 1 GPU per job. This cap might be raised or lowered without notice based on overall activity on the cluster. If you need to use multiple GPUs in the same job, please read the paragraph “Unusual requirements” below.
Please note that the limits and maximum defined in the QOS and partition tables are the maximum values that you can set however they are not defaults. Even in the
long-high-prio partition, your job will be killed after 30 minutes unless you have explictly stated a larger walltime in your script (using
-t). Similarly, in the
gpu-large partition, your job will be killed if it attempts to use more than 2Gb of RAM unless you have explictly asked for a higher memory limit (using
With the move to SLURM, we are experimenting with preemption. Please read the preemption information if you are using the
long-low-prioQOS to avoid data loss.
A note on memory and swap
Each NCC node has a swap partition. The swap file is used by the system to store some of your program memory if you run above the allowed amount of RAM (similarly to Windows’ paging system). This process is entirely transparent to you and the amount of swap that you use is not restricted — however it is not guaranted either.
Therefore your program might be allowed to continue running even if it uses more memory than the amount of RAM requested (using
--mem). This is particularly useful if your program needs more memory than available on NCC. In practice, there are currently some shortcomings because disk read/write buffers are counting toward the maximum amount of RAM and can cause your program to run over the RAM limit and be killed therefore swapping is not guaranted to work well while reading/writing files. Those shortcomings will be addressed in a future upgrade of NCC.
If your job does not fit into any of the partitions/QOS above or you need to use multiple GPUs, please contact [email protected] outlining your requirements to find an appropriate solution. We will try to accomodate any reasonable non-frivolous requests within the capacity of the system.
Please consider alternative solutions before contacting us. Out of memory is often caused by bugs and design problems rather than real needs. Multi-GPU is never needed to increase the batch size (because gradients can be accumulated over multiple iterations, e.g. Caffe
iter_size parameter). Besides writing code that scales well with multiple GPUs is hard. If a task will last more than 2 days, consider splitting it into multiple sequential jobs; you even submit a new job using
sbatch from inside the batch script of another job.
Please skip this section, if you do not want to use GPUs.
You can request a single GPU using:
Compile your code with both compute capability 6.1 (Pascal) and 7.5 (Turing) — or one of the two if you only intend to work on a single architecture. Your code must have been linked with CUDA 8.0 or above. Some programs such as Caffe need the GPU number. On NCC, the GPU allocated to your jobs are always numbered from 0. If you asked for one GPU, it will always be GPU 0. If you asked for two GPUs, they will always be 0 and 1. And so on.
NCC provides CUDA libraries using the module system. You can see a list of available modules with the command
module available. To load a specific version of CUDA:
module load cuda/8.0
Note that some libraries such as PyTorch are bundled with their own CUDA libraries and you do not need to load any module.
Here is a sample job script allocating one GPU on one node based on CUDA 8.0:
#!/bin/bash #SBATCH -N 1 #SBATCH -c 1 #SBATCH --gres=gpu #SBATCH -p gpu-small #SBATCH --qos=debug #SBATCH --job-name=HiThere source /etc/profile module load cuda/8.0 ./hello_world
Specific GPUs can be requested by type. The types available are currently pascal, turing, ampere and 1g.10gb and you can specify these in your batch file as follows:
|pascal||Titan X or Titan XP||2 x Titan X, 14 x Titan XP|
|turing||2080 Ti or Titan RTX or Quadro RTX 8000||24 x 2080 Ti, 8 x Titan RTX, 1 x Quadro RTX 8000|
|ampere||RTX A6000 or 80GB A100 (PCIe)||3 x RTX A6000, 12 x 80GB A100 (PCIe)|
|1g.10gb||Virtual 10GB GPU||126 x Virtual 10GB GPU (each equivalent to 1/7 compute performance of an 80GB A100)|
Other Optional Definitions
There are a few other batch instructions a user can input to better utilise the resources and to stream line their work.
#SBATCH -e stderr-filename #SBATCH -o stdout-filename
The second options above allow a user to specify a special file name for the standard error (-e) and for the standard output (-o). These can be set to appear as a more useful name but it is not recommended as on each run it will overwrite the previous logs unless the file name is changed. By default, unless -e is specified, SLURM will write the stderr to the same file as stdout.
#SBATCH --job-name=jobnamehere Assign a name to the job. Helps locate it in the queue or on email notification. Cannot start with a number or contain spaces. After fully adding the headers normal bash commands can be used to define the steps the scheduler needs to take. (see example script at the bottom of this page).
#SBATCH --mail-type=[BEGIN, END, FAIL, REQUEUE, or ALL] Defines on which events the server should send an email notification.
#SBATCH --mail-user [email protected] Assign an email address to send updates to.
Tools: Starting and controlling your jobs
Once the job script is prepared it has to be passed on to the job scheduler. There are various command line tools, which allow the user to submit, delete and check the status of jobs and queues.
[username@ncc ~]$ sbatch jobscript
sbatch command submits a batch file to the SLURM job queue.
sbatch takes as an argument the job scripts as well as any flags that can be used inside a script. (Please see Interactive Jobs for more information).
[username@ncc ~]$ squeue
The squeue command shows the status of all the queues running on the system.
[username@ncc ~]$ scancel jobid
To delete a job from the system the scancel command can be used. Note only jobs submitted by you can be deleted using the scancel command. If you feel a job is stuck and not responding to the scancel command please contact an administrator.
[username@ncc ~]$ sinfo
The sinfo commands can be used on the head nodes of the clusters to see which nodes are down,available or locked with jobs.
Sample Job Script
This job script below requests one core of one node from SLURM. The job is called HiThere. The job runs in the debug queue. The standard error and output are combined in one file. The program then changes to the working directory and runs the helloworld program.
#!/bin/bash #SBATCH -N 1 #SBATCH -c 1 #SBATCH -p debug #SBATCH --job-name=HiThere ./hello_world
Checking your Jobs
Wether you want to use one or two GPUs. It pays to double-check that your code is actually using the resources. To do that, do the following:
- Find which machine your job is running on using
nvidia-smion the head node. The right column list all processes and the associated user. Check that you can see your process in this row. If it not there, then you might have misconfigured your job.
CPU and RAM Usage
You can query for statistics about a currently running job using:
sstat <jobid>.batch|less -S
You can query for statistics about all finished jobs using:
sacct -l |less -S
The peak RAM usage is shown in the column MaxRSS and CPU usage is availale bin the column AveCPU.
Some general graphs are available on our Ganglia webserver.
NCC has a module system to load additionnal software. A full list of the module available on NCC can be shown using:
[username@ncc ~]$ module avail
By default, SLURM will write the output of your job to a file called
slurm-<yourjobid>.out. As the job output is written to a file rather than interactively, most programs (e.g. Python, C/C++ standard library) will buffer and write the standard output by blocks of 4kb. This can be very inconvenient if you are trying to debug your program by regularly inspecting the output file (e.g. using
To avoid this problem, you can force your program to use line-buffering using stdbuf (modify your batch script as follows):
stdbuf -oL ./yourprogram
To use Matlab, in your job script, load the desired version using either
module load matlab/2015b or
module load matlab/2017a.
We provide both
python3. If you require a specific package that is not installed on NCC and is available through the standard Ubuntu repository, speak to an NCC admin to install it.
If you would like to use a package that is not available on Ubuntu repository or you would like to use a more recent version of the package, you can create your own Python virtual environment using
virtualenv (see this very good tutorial. You can also use Anaconda but beware that Anaconda tends to have a restricted and out-of-date set of packages and does not integrate well with program installed outside of its environment.
OpenCV 3.1 is available has a module which you can load using
module load opencv/3.1. This comes with Python 2.7 and CUDA 7.5 support.
Do not use this version of OpenCV with CUDA 8.0 as this will cause your program to crash with mysterious errors because this version of OpenCV was built specifically against CUDA 7.5. Especially even common OpenCV operations might actually use CUDA internally resulting in various incompatibilities if your program uses CUDA 8.0.
As an alternative, you can also use the version of OpenCV packaged in Anaconda. CUDA support was not enabled in the Anaconda version therefore it should work fine regardless of the CUDA version that you might be using somewhere else in your program.
SLURM State Codes and Pending Reasons
Your job typically passes through several states throughout its execution. We outline below the most common status codes visible in the output of
- PD: Pending. Your job is awaiting resource allocation. The reason is display in the column on the right. Check the list of common reasons below.
- R: Running. Be patient!
- CG: Completing. Your job just finished.
- CA: Cancelled. The job has been cancelled.
Common reasons for pending jobs are:
- Resources: There are no available resources for your job yet. Just wait and your job should start as soon as resources are available.
- Priority: Another job with a higher priority than this job is also waiting. Your job will start once resources are available and all higher priority jobs have started. The job priority is used to assign resources as fairly as possible based on past cluster usage.
- AssocMaxGRESPerJob: You are trying to use more than one GPU. Please get in touch with an administrator if you need multi-GPU.
- AssocGrpGRES: You have reached the limit on how many GPUs you can use at once. Just wait and your job should start once one of your running jobs finishes.
- QOSMaxMemoryPerJob: You are trying to use more memory than allowed. Try the gpu-large partition; if you need even more memory, speak to an admin.
- QOSMaxCpuPerJobLimit: You are trying to use more CPUs than allowed. Try the gpu-large partition; if you need even more CPUs, speak to an admin.
- QOSMaxWallDurationPerJobLimit: You are trying to run a job for longer than allowed. Try the QOS long-high-prio; if 7 days is not enough, speak to an admin.
- QOSNotAllowed: You are not allowed to use this QOS or you are trying to the use the long-cpu QOS (reserved for the cpu partition) on one of the GPU partitions.
- ReqNodeNotAvail: You have explicitly asked for a specific node using –nodelist however this node is currently down (usually for maintenance).
- QOSMinGRES: You have submitted a job to a GPU partition but haven’t requested any GPU. Please look at the above job script examples.
- InvalidAccount: There is an issue with your account, speak to an admin.
A longer list of possible reasons is available here.
How can I change my shell to ZSH?
Use the command
chsh.ldap youruserid -s /bin/zsh then relogin to NCC after 15 minutes.
How can multiple users share/work together on the same data?
We can set up a new user group and directory under /projects with the right permissions for you. Please contact an admin for that.
Can you install PyTorch on NCC?
No, like TensorFlow, PyTorch moves too quickly to be installed globally in an HPC environment. Install your own version using virtualenv or Anaconda.
TensorFlow uses a lot of threads, should I book an entire node?
No, despite using a lot of threads, the actual CPU usage of TensorFlow is usually much lower. Try to empirically estimate the average CPU usage your code, round it up and request that number.
My job has been queued for a day and still not started!
NCC might be unusually busy. Double-check that you have not requested more resources than NCC can provide because SLURM does not always warn you if your request is impossible to fullfill. Especially absurd requests for a million hours of walltime, 100 CPUs or GPUs, 1000GB will be silently accepted by SLURM but never actually scheduled. If you requested a large number of CPU or GPU resources. Consider reducing the number. SLURM take into account the cost of running each job and will prioritize small jobs over larger ones and it is highly unlikely that 16 CPUs will be scheduled any time soon if there is a steady queue of small jobs. If you really need that many resources and NCC is busy, please contact an admin.
Can I interactively debug my program on NCC (e.g. using gdb)?
Yes however please consider doing it on your own development machine first. If you don’t have the resources to do so except on NCC then you can start an interactive job using:
srun -N X -c Y --gres=gpu:Z --pty /bin/bash This will allocate the desired resources and start an interactive shell inside a job. Then you can proceed debugging as usual. Please close this shell (typing exit or pressing Ctrl+D) as soon as you don’t need the shell anymore to avoid wasting resources.
Can you install program X?
Yes if the program is available through Ubuntu package manager (see the list here). For Python packages only available through pip, please use virtualenv or Anaconda. For other software, we might consider installing it if it is of general use (e.g. Matlab) or too difficult to install in your home folder.
My code using OpenCV crashes on NCC!
If you are using CUDA, make sure that you have loaded the OpenCV module matching the version of CUDA that you would like to use. Speak to an admin if you can’t find it.
I use x (where x is linux-based) and it’s too difficult to port my code to NCC environment! What can I do?
We have an experimental deployment of Docker. You can have your own Linux distribution with root permissions inside a container on NCC. Speak to an NCC admin if you want to try it out.