TensorFlow

Those pages are currently work in progress. If you would like to contribute some information, please get in touch by email ([email protected]).

Installing TensorFlow

We recommend to install TensorFlow in a virtual environment for each project. For example:

cd myproject
virtualenv3 env
source env/bin/activate
pip install tensorflow-gpu

Important: To use TensorFlow on the GPU, you must load the right version of CUDA and cuDNN. Go on the page https://www.tensorflow.org/install/gpu and write down the version of the CUDA toolkit and cuDNN SDK specified in the “Software Requirements” section.

On NCC, we provide CUDA and cuDNN are bundled through a module system and you must explicitly load the right module. To find the version of CUDA and cuDNN available on NCC, run:

module available

Find the module which matches the version numbers that you found earlier on TensorFlow’s website. If you cannot find the appropriate module, please contact [email protected], as this might happen if TensorFlow was very recently upgraded to a new version of CUDA and cuDNN.

Configuring TensorFlow

#!/bin/bash
#SBATCH -N 1
#SBATCH -c required-amount-of-core
#SBATCH --gres=gpu
#SBATCH -p please-pick-a-partition
#SBATCH --qos=please-pick-a-qos
#SBATCH -t pleas-pick-a-walltime
#SBATCH --job-name=please-pick-a-jobname
#SBATCH --mem=required-amount-of-ramG

source /etc/profile
source env/bin/activate
module load the-cuda-module-needed-by-your-version-of-tensorflow

python your-job-script.py ...

Please replace the following with the relevant details: required-amount-of-core, please-pick-a-partition, please-pick-a-qos, pleas-pick-a-walltime, please-pick-a-jobname, required-amount-of-ram. The number of core should reflect the number of TensorFlow input loading threads. A good rule of thumb is to have twice as many workers as the numbers of cores. Replace the-cuda-module-needed-by-your-version-of-tensorflow with the version of the CUDA/cuDNN module that you determined earlier in the installation.

Warning: If you see error messages toward the top of your log file such as:

2019-11-26 14:41:44.207247: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/cuda-10.1-cudnn7.6/lib64::/usr/local/lib:/usr/local/lib
2019-11-26 14:41:44.207610: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/cuda-10.1-cudnn7.6/lib64::/usr/local/lib:/usr/local/lib
2019-11-26 14:41:44.208072: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/cuda-10.1-cudnn7.6/lib64::/usr/local/lib:/usr/local/lib
2019-11-26 14:41:44.208460: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/cuda-10.1-cudnn7.6/lib64::/usr/local/lib:/usr/local/lib
2019-11-26 14:41:44.208972: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/cuda-10.1-cudnn7.6/lib64::/usr/local/lib:/usr/local/lib
2019-11-26 14:41:44.209355: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/cuda-10.1-cudnn7.6/lib64::/usr/local/lib:/usr/local/lib
2019-11-26 14:41:45.176806: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-26 14:41:45.176841: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

This means that you have loaded the wrong version of CUDA. TensorFlow might still be able to run but it will not use the GPU. This is the most common pitfall on NCC. If this happens, check the version of CUDA/cuDNN required by TensorFlow at https://www.tensorflow.org/install/gpu and load the appropriate module from the list returned by module available.