Introduction to Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather , and, broadcast. See this page for more details.
Horovod is not just a wrapper around the MPI. It provides and manages all internal working of loss computation and aggregation where each worker can be assigned no. of processes. Let’s say I have two machines with 1 GPU, and I want to do distributed training. Each worker or device must have all the data, code, and copy of the model weighs present locally. When the chief worker initiates the training, both workers will have a separate batch of datasets for the gradient computation. Then both workers will use the ring all-reduce algorithm to synchronize and aggregate the gradients. Each worker also updates its local copy of the model weights.
For further guidance and utilizing the multiple machines multiple GPUs, let’s perform the following steps.
Horovod distributed training needs the full ssh access to the machines
All the machines that are to be used in distributed training must contain the same OS user, the same path of the code, and the same set paths.
The following are the system specification for the project on which we have run Horovod. By changing them, the steps may vary. Please read documentation and issues for detailed explanations.
OS: Ubuntu 18.04
Cuda version: 10.0
Cudnn version: 7.6
Step No 1:
Rename the OS username same for master and slave machines
Step No 2:
Setup the ssh access to both machines so machine #1 can access machine #2 without a password and vice versa.
For the passwordless ssh, please refer to the guide here.
Step No 3:
Check the full operability of code. It should be fully functional independently.
Step No 4:
Install the NCCL library compatible with your Cuda version.
So, for example, NCCL 2.4.8-1 is compatible with the Cuda 10.0 library.
It can be downloaded from here.
Download, extract and execute following commands from the folder.
sudo cp nccl_2.4.8-1+cuda10.0_x86_64/lib/libnccl* /usr/lib/x86_64-linux-gnu/
sudo cp nccl_2.4.8-1+cuda10.0_x86_64/include/nccl.h /usr/include/
echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu’ >> ~/.bashrc
Step No 5:
Download openmpi 4.0.0 from here
Extract and install using the following commands
make -j 8 all
echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib’ >> ~/.bashrc
echo ‘export PATH=$PATH:$HOME/openmpi/bin’ >> ~/.bashrc
Step No 6:
Install Horovod using following command with pip3 or pip
PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 #HOROVOD_WITHOUT_MXNET=1 “$HOME/venv-horovod-keras/bin/pip3” install –no-cache-dir horovod
You can select the horovod of particular deep learning framework from the above command and change the variable of HOROVOD_WITHOUT_XXXXX to HOROVOD_WITH_XXXXX.
Also, give the relevant path of the virtual python environment, if any, by changing “$HOME/venv-horovod-keras/bin/pip3“
You can check whether MPI is working properly by inspecting by executing the following Horovod functions
import horovod.tensorflow.keras as hvd
Step No 7:
Make the following changes to the code
I. Initialize TensorFlow as follows.
Run the Horovod with only one node using the following command
horovodrun -np 4 -H localhost:4 python train.py
Step No 9:
If the above command successfully runs, make an alias of the master and slave nodes on the .ssh config file.
Check the network interface to give into the mpi command for the master-slave communication.
Replicate all the settings, data, and code to the same repository path on the slave machine.
Step No 12:
Run the following command on the master node.
Distributed training started
DL framework: Keras, TensorFlow
Two epochs running parallel using both systems
- Network communication between master and slave nodes should be fast.
- Please read MPI documentation because the command may vary for your specific problem
- Read Horovod documentation for the updates and code updation.