Introduction to Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
Horovod is not just a wrapper around the MPI. It provides and manages all internal working of loss computation and aggregation where each worker can be assigned no. of processes. Let’s say I have two machines with 1 GPU, and I want to do distributed training. Each worker or device must have all the data, code, and copy of the model weighs present locally. When the chief worker initiates the training, both workers will have a separate batch of datasets for the gradient computation. Then both workers will use the ring all-reduce algorithm to synchronize and aggregate the gradients. Each worker also updates its local copy of the model weights.
For further guidance and utilizing the multiple machines multiple GPUs, let’s perform the following steps.
Horovod distributed training needs the full ssh access to the machines
All the machines that are to be used in distributed training must contain the same OS user, the same path of the code, and the same set paths.
The following are the system specification for the project on which we have run Horovod. By changing them, the steps may vary. Please read documentation and issues for detailed explanations.
OS: Ubuntu 18.04
Cuda version: 10.0
Cudnn version: 7.6
Step No 1:
Rename the OS username same for master and slave machines
Step No 2:
Setup the ssh access to both machines so machine #1 can access machine #2 without a password and vice versa.
For the passwordless ssh, please refer to the guide here.
Step No 3:
Check the full operability of code. It should be fully functional independently.
Step No 4:
Install the NCCL library compatible with your Cuda version.
So, for example, NCCL 2.4.8-1 is compatible with the Cuda 10.0 library.
It can be downloaded from here.
Download, extract and execute following commands from the folder.
sudo cp nccl_2.4.8-1+cuda10.0_x86_64/lib/libnccl* /usr/lib/x86_64-linux-gnu/
sudo cp nccl_2.4.8-1+cuda10.0_x86_64/include/nccl.h /usr/include/
echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu’ >> ~/.bashrc
Step No 5:
Download openmpi 4.0.0 from here
Extract and install using the following commands
make -j 8 all
echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib’ >> ~/.bashrc
echo ‘export PATH=$PATH:$HOME/openmpi/bin’ >> ~/.bashrc
Step No 6:
Install Horovod using following command with pip3 or pip
PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 #HOROVOD_WITHOUT_MXNET=1 “$HOME/venv-horovod-keras/bin/pip3” install –no-cache-dir horovod
You can select the horovod of particular deep learning framework from the above command and change the variable of HOROVOD_WITHOUT_XXXXX to HOROVOD_WITH_XXXXX.
Also, give the relevant path of the virtual python environment, if any, by changing “$HOME/venv-horovod-keras/bin/pip3“
You can check whether MPI is working properly by inspecting by executing the following Horovod functions
import horovod.tensorflow.keras as hvd
Step No 7:
Make the following changes to the code
I. Initialize TensorFlow as follows.
II. In the main function and initialization of any function of training, model preparation.
III. Images per GPU/BatchSize and Steps per epoch should be set as.
IV. Set callbacks with conditions so that model is saved only in the master node and slave node can communicate to send the weight updates.
V. Run the model fit generator with Keras session.
Step No 8:
Run the Horovod with only one node using the following command
horovodrun -np 4 -H localhost:4 python train.py
Step No 9:
If the above command successfully runs, make an alias of the master and slave nodes on the .ssh config file.
Step No 10:
Check the network interface to give into the mpi command for the master-slave communication.
Step No 11:
Replicate all the settings, data, and code to the same repository path on the slave machine.
Step No 12:
Run the following command on the master node.
And viola !!!
Distributed training started
DL framework: Keras, TensorFlow
Two epochs running parallel using both systems
Some further tips:
- Network communication between master and slave nodes should be fast.
- Please read MPI documentation because the command may vary for your specific problem
- Read Horovod documentation for the updates and code updation