Distributed Training Using Horovod and Keras

Introduction to Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

Horovod core principles are based on MPI concepts such as sizeranklocal rankallreduceallgather , and, broadcast. See this page for more details.

Horovod is not just a wrapper around the MPI. It provides and manages all internal working of loss computation and aggregation where each worker can be assigned no. of processes. Let’s say I have two machines with 1 GPU, and I want to do distributed training. Each worker or device must have all the data, code, and copy of the model weighs present locally. When the chief worker initiates the training, both workers will have a separate batch of datasets for the gradient computation. Then both workers will use the ring all-reduce algorithm to synchronize and aggregate the gradients. Each worker also updates its local copy of the model weights.

For further guidance and utilizing the multiple machines multiple GPUs, let’s perform the following steps.

Horovod distributed training needs the full ssh access to the machines

All the machines that are to be used in distributed training must contain the same OS user, the same path of the code, and the same set paths.

System Requirements

The following are the system specification for the project on which we have run Horovod. By changing them, the steps may vary. Please read documentation and issues for detailed explanations.

OS: Ubuntu 18.04 

Cuda version: 10.0

Cudnn version: 7.6

Tensorflow: 1.15

Step No 1:

Rename the OS username same for master and slave machines

Step No 2:

Setup the ssh access to both machines so machine #1 can access machine #2 without a password and vice versa.

For the passwordless ssh, please refer to the guide here.

Step No 3:

Check the full operability of code. It should be fully functional independently.

Step No 4:

Install the NCCL library compatible with your Cuda version.

So, for example, NCCL 2.4.8-1 is compatible with the Cuda 10.0 library.

It can be downloaded from here.

Download, extract and execute following commands from the folder.

sudo cp nccl_2.4.8-1+cuda10.0_x86_64/lib/libnccl* /usr/lib/x86_64-linux-gnu/

sudo cp nccl_2.4.8-1+cuda10.0_x86_64/include/nccl.h  /usr/include/

echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu’ >> ~/.bashrc

Step No 5:

Download openmpi 4.0.0 from here

Extract and install using the following commands

cd “openmpi-4.0.0

./configure –prefix=$HOME/openmpi

make -j 8 all

make install

echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib’ >> ~/.bashrc

echo ‘export PATH=$PATH:$HOME/openmpi/bin’ >> ~/.bashrc

Step No 6:

Install Horovod using following command with pip3 or pip

PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 #HOROVOD_WITHOUT_MXNET=1 “$HOME/venv-horovod-keras/bin/pip3” install –no-cache-dir horovod

You can select the horovod of particular deep learning framework from the above command and change the variable of HOROVOD_WITHOUT_XXXXX to HOROVOD_WITH_XXXXX.

Also, give the relevant path of the virtual python environment, if any, by changing “$HOME/venv-horovod-keras/bin/pip3

You can check whether MPI is working properly by inspecting by executing the following Horovod functions

import horovod.tensorflow.keras as hvd

hvd.size()

hvd.rank()

hvd.local_rank()

Step No 7:

Make the following changes to the code

I. Initialize TensorFlow as follows.

II. In the main function and initialization of any function of training, model preparation.
III. Images per GPU/BatchSize and Steps per epoch should be set as.
IV. Set callbacks with conditions so that model is saved only in the master node and slave node can communicate to send the weight updates.
V. Run the model fit generator with Keras session.
Step No 8:

Run the Horovod with only one node using the following command

horovodrun -np 4 -H localhost:4 python train.py

Step No 9:

If the above command successfully runs, make an alias of the master and slave nodes on the .ssh config file.

Step No 10:

Check the network interface to give into the mpi command for the master-slave communication.

Step No 11:

Replicate all the settings, data, and code to the same repository path on the slave machine.

Step No 12:

Run the following command on the master node.

And viola !!!

Distributed training started 

Framework: Horovod 

DL framework: Keras, TensorFlow

Two epochs running parallel using both systems

Some further tips:
  • Network communication between master and slave nodes should be fast.
  • Please read MPI documentation because the command may vary for your specific problem
  • Read Horovod documentation for the updates and code updation.

Let's make it happen

We love fixing complex problems with innovative solutions. Get in touch to let us know what you’re looking for and our solution architect will get back to you soon.