Those pages are currently work in progress. If you would like to contribute some information, please get in touch by email ([email protected]).
We recommend to install PyTorch in a virtual environment for each project. For example:
cd myproject virtualenv3 env source env/bin/activate pip install torchvision
PyTorch is bundled with its own CUDA librairies which considerably simplifies the setup however if you are using other CUDA-based librairies, the two might conflict.
Since PyTorch is bundled with its own CUDA libraries, the batch script is fairly simple.
#!/bin/bash #SBATCH -N 1 #SBATCH -c required-amount-of-core #SBATCH --gres=gpu #SBATCH -p please-pick-a-partition #SBATCH --qos=please-pick-a-qos #SBATCH -t pleas-pick-a-walltime #SBATCH --job-name=please-pick-a-jobname #SBATCH --mem=required-amount-of-ramG source /etc/profile source env/bin/activate python your-job-script.py ...
Please replace the following with the relevant details:
required-amount-of-ram. The number of core should reflect the number of PyTorch data loading workers. A good rule of thumb is to have twice as many workers as the numbers of cores. However PyTorch workers tend to use more memory than necessary and this can become a constraint. Read the section on “Excessive PyTorch DataLoader memory usage” to learn how to reduce the cost of PyTorch workers.
Most PyTorch-based scripts use Visdom to visualise the training. Since NCC is composed of multiple machines and many users, the scripts often need small adjustments to work on NCC.
On NCC, you should start Visdom on the head node. We would advise using tmux to keep the server running even when you have closed your SSH connection. Once Visdom is running on the head node, you can configure the visdom client to connect to the Visdom server.
Since there are many users, pick a port that is not currently used by someone else. The default port is extremely likely to be used by somebody else, don’t use it.
Starting the server on the head node:
python -m visdom.server -port pick-a-port-that-is-not-already-used
Configuring the Visdom client in your Python script:
import visdom viz = visdom.Visdom(server="http://ncc1.clients.dur.ac.uk", port=your- chosen-port)
Excessive PyTorch DataLoader memory usage
The default settings in PyTorch can lead to a lot of memory wastage if you are using multiple workers for data loading — and you should be.
What happens is that by default, PyTorch creates new workers by “forking” the current process. The new worker become an identical copy of your running process. Unfortunately this means that all the memory, all the weights of your network, everything get duplicated inside your worker even if the worker does not use it. You cannot start the workers before loading your network because even if you managed to do it, the workers are respawned at each epoch.
The simple solution in Python 3 is to change the worker creation method to avoid forking. There are two possible methods that will circumvent the problem:
spawn. For instance to use the
forkserver, add the following:
if __name__ == '__main__': # Note: You must put all your training code into one function rather than in the global scope # (this is good practice anyway). # Subsequently you must call the set_start_method and your main function from inside this # if-statement. If you don't do that, each worker will attempt to run all of your training # code and everything will go very wild and very wrong. torch.multiprocessing.set_start_method('forkserver') my_main_function()
If you are using Python 2, the solution is to switch to Python 3 (sorry).
Building PyTorch’s accimage
Accimage is not available through pip. If you would like to use it, you need to build it using the following steps:
source myenv/bin/activate git clone https://github.com/pytorch/accimage cd accimage python setup.py build_ext --library-dirs=/opt/intel/compilers_and_libraries_2019.3.199/linux/ipp/lib/intel64_lin python setup.py install
To use it as torchvision backend instead of PIL, set this in your Python code:
and add to your SLURM batch script: