Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster

https://doi.org/10.1016/j.procs.2017.05.074Get rights and content
Under a Creative Commons license
open access

Abstract

Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the final accuracy of the models is studied.

Keywords

distributed computing
parallel systems
deep learning
Convolutional Neural Networks

Cited by (0)