Distributed Training Using Horovod and Keras

Introduction to Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

Horovod core principles are based on MPI concepts such as sizeranklocal rankallreduceallgather , and, broadcast. See this page for more details.

Horovod is not just a wrapper around the MPI. It provides and manages all internal working of loss computation and aggregation where each worker can be assigned no. of processes. Let’s say I have two machines with 1 GPU, and I want to do distributed training. Each worker or device must have all the data, code, and copy of the model weighs present locally. When the chief worker initiates the training, both workers will have a separate batch of datasets for the gradient computation. Then both workers will use the ring all-reduce algorithm to synchronize and aggregate the gradients. Each worker also updates its local copy of the model weights.

For further guidance and utilizing the multiple machines multiple GPUs, let’s perform the following steps.

Horovod distributed training needs the full ssh access to the machines

All the machines that are to be used in distributed training must contain the same OS user, the same path of the code, and the same set paths.

System Requirements

The following are the system specification for the project on which we have run Horovod. By changing them, the steps may vary. Please read documentation and issues for detailed explanations.

OS: Ubuntu 18.04 

Cuda version: 10.0

Cudnn version: 7.6

Tensorflow: 1.15

Step No 1:

Rename the OS username same for master and slave machines

Step No 2:

Setup the ssh access to both machines so machine #1 can access machine #2 without a password and vice versa.

For the passwordless ssh, please refer to the guide here.

Step No 3:

Check the full operability of code. It should be fully functional independently.

Step No 4:

Install the NCCL library compatible with your Cuda version.

So, for example, NCCL 2.4.8-1 is compatible with the Cuda 10.0 library.

It can be downloaded from here.

Download, extract and execute following commands from the folder.

sudo cp nccl_2.4.8-1+cuda10.0_x86_64/lib/libnccl* /usr/lib/x86_64-linux-gnu/

sudo cp nccl_2.4.8-1+cuda10.0_x86_64/include/nccl.h  /usr/include/

echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu’ >> ~/.bashrc

Step No 5:

Download openmpi 4.0.0 from here

Extract and install using the following commands

cd “openmpi-4.0.0

./configure –prefix=$HOME/openmpi

make -j 8 all

make install

echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib’ >> ~/.bashrc

echo ‘export PATH=$PATH:$HOME/openmpi/bin’ >> ~/.bashrc

Step No 6:

Install Horovod using following command with pip3 or pip

PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 #HOROVOD_WITHOUT_MXNET=1 “$HOME/venv-horovod-keras/bin/pip3” install –no-cache-dir horovod

You can select the horovod of particular deep learning framework from the above command and change the variable of HOROVOD_WITHOUT_XXXXX to HOROVOD_WITH_XXXXX.

Also, give the relevant path of the virtual python environment, if any, by changing “$HOME/venv-horovod-keras/bin/pip3

You can check whether MPI is working properly by inspecting by executing the following Horovod functions

import horovod.tensorflow.keras as hvd




Step No 7:

Make the following changes to the code

I. Initialize TensorFlow as follows.

Run the Horovod with only one node using the following command

horovodrun -np 4 -H localhost:4 python

Step No 9:

If the above command successfully runs, make an alias of the master and slave nodes on the .ssh config file.

Check the network interface to give into the mpi command for the master-slave communication.

Replicate all the settings, data, and code to the same repository path on the slave machine.

Step No 12:

Run the following command on the master node.

Distributed training started 

Framework: Horovod 

DL framework: Keras, TensorFlow

Two epochs running parallel using both systems

  • Network communication between master and slave nodes should be fast.
  • Please read MPI documentation because the command may vary for your specific problem
  • Read Horovod documentation for the updates and code updation.

Let's make it happen

We love fixing complex problems with innovative solutions. Get in touch to let us know what you’re looking for and our solution architect will get back to you soon.