Skip to main content

PyTorch

PyTorch

Those pages are currently work in progress. If you would like to contribute some information, please get in touch by email ([email protected]).

Installing PyTorch

We recommend to install PyTorch in a virtual environment for each project. For example:

cd myproject
virtualenv3 env
source env/bin/activate
pip install torchvision

PyTorch is bundled with its own CUDA librairies which considerably simplifies the setup however if you are using other CUDA-based librairies, the two might conflict.

Configuring PyTorch

Since PyTorch is bundled with its own CUDA libraries, the batch script is fairly simple.

#!/bin/bash
#SBATCH -N 1
#SBATCH -c required-amount-of-core
#SBATCH --gres=gpu
#SBATCH -p please-pick-a-partition
#SBATCH --qos=please-pick-a-qos
#SBATCH -t pleas-pick-a-walltime
#SBATCH --job-name=please-pick-a-jobname
#SBATCH --mem=required-amount-of-ramG

source /etc/profile
source env/bin/activate

python your-job-script.py ...

Please replace the following with the relevant details: required-amount-of-coreplease-pick-a-partitionplease-pick-a-qospleas-pick-a-walltimeplease-pick-a-jobnamerequired-amount-of-ram. The number of core should reflect the number of PyTorch data loading workers. A good rule of thumb is to have twice as many workers as the numbers of cores. However PyTorch workers tend to use more memory than necessary and this can become a constraint. Read the section on “Excessive PyTorch DataLoader memory usage” to learn how to reduce the cost of PyTorch workers.

Visualisation

Most PyTorch-based scripts use Visdom to visualise the training. Since NCC is composed of multiple machines and many users, the scripts often need small adjustments to work on NCC.

On NCC, you should start Visdom on the head node. We would advise using tmux to keep the server running even when you have closed your SSH connection. Once Visdom is running on the head node, you can configure the visdom client to connect to the Visdom server.

Since there are many users, pick a port that is not currently used by someone else. The default port is extremely likely to be used by somebody else, don’t use it.

Starting the server on the head node:

python -m visdom.server -port pick-a-port-that-is-not-already-used

Configuring the Visdom client in your Python script:

import visdom
viz = visdom.Visdom(server="http://ncc1.clients.dur.ac.uk", port=your-
chosen-port)

Excessive PyTorch DataLoader memory usage

The default settings in PyTorch can lead to a lot of memory wastage if you are using multiple workers for data loading — and you should be.

What happens is that by default, PyTorch creates new workers by “forking” the current process. The new worker become an identical copy of your running process. Unfortunately this means that all the memory, all the weights of your network, everything get duplicated inside your worker even if the worker does not use it. You cannot start the workers before loading your network because even if you managed to do it, the workers are respawned at each epoch.

The simple solution in Python 3 is to change the worker creation method to avoid forking. There are two possible methods that will circumvent the problem: forkserver and spawn. For instance to use the forkserver, add the following:

if __name__ == '__main__':
    # Note: You must put all your training code into one function rather than in the global scope
    #       (this is good practice anyway).
    #       Subsequently you must call the set_start_method and your main function from inside this
    #       if-statement. If you don't do that, each worker will attempt to run all of your training
    #       code and everything will go very wild and very wrong.
    torch.multiprocessing.set_start_method('forkserver')
    my_main_function()

If you are using Python 2, the solution is to switch to Python 3 (sorry).

Building PyTorch’s accimage

Accimage is not available through pip. If you would like to use it, you need to build it using the following steps:

source myenv/bin/activate
git clone https://github.com/pytorch/accimage
cd accimage
python setup.py build_ext --library-dirs=/opt/intel/compilers_and_libraries_2019.3.199/linux/ipp/lib/intel64_lin
python setup.py install

To use it as torchvision backend instead of PIL, set this in your Python code:

torchvision.set_image_backend('accimage')

and add to your SLURM batch script:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/compilers_and_libraries_2019.3.199/linux/ipp/lib/intel64_lin/