Dynamic Delay Based Cyclic Gradient Update Method for Distributed Training

Hu, Wenhui; Wang, Peng; Wang, Qigang; Zhou, Zhengdong; Xiang, Hui; Li, Mei; Shi, Zhongchao

doi:10.1007/978-3-030-03338-5_46

Wenhui Hu²⁰,
Peng Wang²⁰,
Qigang Wang²⁰,
Zhengdong Zhou²⁰,
Hui Xiang²⁰,
Mei Li²⁰ &
…
Zhongchao Shi²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11258))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1974 Accesses

Abstract

Distributed training performance is constrained by two factors. One is the communication overhead between parameter servers and workers. The other is the unbalanced computing powers across workers. We propose a dynamic delay based cyclic gradient update method, which allows workers to push gradients to parameter servers in a round-robin order with dynamic delays. Stale gradient information is accumulated locally in each worker. When a worker obtains the token to update gradients, the accumulated gradients are pushed to parameter servers. Experiments show that, compared with the previous synchronous and cyclic gradient update methods, the dynamic delay cyclic method converges to the same accuracy at a faster speed.

You have full access to this open access chapter, Download conference paper PDF

DOSP: an optimal synchronization of parameter server for distributed machine learning

Article 25 March 2022

Asynchronous Distributed Proximal Policy Optimization Training Framework Based on GPU

HPSGD: Hierarchical Parallel SGD with Stale Gradients Featuring

Keywords

1 Introduction

Deep learning trains deep neural networks with huge volumes of data. The training process is compute-intensive. It can take weeks or months with one modern GPU. Many researchers employ distributed training to accelerate the training process with a server cluster [1].

Model parallelism and data parallelism are two commonly adopted paradigms for distributed training. Model parallelism splits the model into different parts and allocates each part to one GPU [2]. Although model parallelism can speed up the training process with parallel computing, it has two drawbacks which limit its application. The first drawback is scalability, which means it is hard to create a generic model parallelism solution which splits arbitrary model into balanced parts, allocates to adequate GPUs and achieves sublinear scaling ratio. The second drawback is that model parallelism has high communication-to-computation ratio and the communication overhead may counteract the performance gain. Data parallelism is more widely adopted for its simplicity and generality. The training dataset is usually large and easy to split into sub-datasets. Each GPU hosts a replica of the model and trains it with its sub-dataset concurrently.

Various architectures have been proposed for data parallelism, e.g. parameter server (PS) [3], peer-to-peer, ring-based structure [4]. PS architecture has been proved to be effective and are widely adopted [5, 6]. There are two entities defined in the PS architecture: parameter servers and workers. Parameter servers are responsible for collecting gradient updates from workers and calculating new model parameters with received gradients. Workers pull latest parameters from parameter servers, train their model replicas with their sub-datasets, calculate gradients, and push gradients to parameter servers. The gradient update method between parameter servers and workers can be roughly classified into synchronous method and asynchronous method. For synchronous method, all workers push gradients to parameter servers in every training iteration. This method is robust, fast and has been proved to be equivalent to the standard stochastic gradient descent (SGD) in single GPU training. But the synchronous method has two issues. One is traffic burst when all workers push gradients at roughly the same time. The other one is that if workers are not homogeneous, the slowest one will slow down the overall training process. Asynchronous methods have been proposed to overcome these issues [7]. However, asynchronous methods may suffer from slower convergence or divergence issues due to stale gradients [8].

In this paper, we propose a method to delay the gradient updates between parameter servers and workers dynamically. Experimental results show that our method increases the distributed training throughput, reduces the network bandwidth requirement, and achieves almost the same accuracy as the synchronous method.

2 Related Works

Many previous works target at reducing the communication overhead in distributed training. Chen et al. [9] propose a double buffering technique which shows the delayed update works well. Seide et al. [10] and Strom et al. [11] use an 1-bit SGD method which adds delay to gradient updates. Lin et al. [12] propose a gradient threshold algorithm, which throttles small gradient updates and accumulates them locally. These gradient sparsification technologies can reduce the communication volume and they are validated by experiments. But the convergence of these implicit delayed methods are not proved in theory. Agarwal et al. [7] propose explicitly delayed gradient update methods to reduce the communication frequency. For convex optimization problem, it has been theoretically proved that the delayed gradient update can be asymptotically negligible and the convergence rate scales as \(\mathcal {O} \left( 1/\sqrt{nT} \right) \) for n-node cluster after T iterations. However, this cyclic delayed method suffers from the unbalanced worker computing power issues. In the next section, we introduce a dynamic delay based algorithm to overcome these problems and improve the performance.

3 Dynamic Delay Based Cyclic Gradient Update Method

We propose a dynamic delay based cyclic gradient update method, which extends the previous cyclic delayed method. A dynamic delay is applied to the gradient update of each worker. The delay is calculated from the real-time global gradient updating status. This method decouples the cyclic period and actual delays of workers.

The conventional cyclic delayed architecture computes the stochastic gradients in parallel and updates the model parameters in sequence. The worker i computes the gradient \(g_{i} \left( t - \tau \right) = \nabla F \left[ x \left( t - \tau \right) \right] \) from the stale parameters \(x \left( t - \tau \right) \) of \(\tau \) updates before. The central parameter server obtains \(g_{i} \left( t - \tau \right) \) from worker i, computes the updated model \(x \left( t + 1 \right) \) and pushes it back only to worker i. Meanwhile, other workers do their computations on the stale parameters other than the latest \(x \left( t + 1 \right) \). The delay \(\tau \) comes from the sequential updating of the parameters among the workers, where \(\tau = n - 1\) for a n worker cluster in the simplest case. The errors coming from \(\tau \) is a second order effect, which makes the penalty of delay asymptotically negligible [7].

The behavior of each worker can be roughly classified into two phases: a communication phase for gradient synchronization, and a computation phase for gradient calculation and accumulation. Although it is possible to overlap part of the backward computation phase with the communication phase for a single worker, the cyclic method focuses more on overlapping the computation phase of a worker with the communication phases of other workers.

In our proposed method, an additional delay is introduced to improve the throughput performance. Each worker maintains an independent local training pool. When a worker is in computation phase, it keeps doing local training and gradient accumulation. Mini-batches of the dataset are fetched continuously to train the local model replica. The following communication phase is dynamically postponed until the worker obtains the token. The delay is adaptive, which helps to maintain a good load balance. A powerful worker does more training (and hence processes more training data examples) in its computation phase and a weak worker does less.

Figure 1 shows the runtime illustration of different gradient update methods. Blue blocks denote communication phases, and green blocks denote training operations. The width denotes the duration of different operations in the runtime, which is variational because of the imbalance of workloads. In the synchronous method (sync), a strong worker has to wait for a weak worker in every communication phase. The delayed synchronous method (dsync) postpones the synchronizations with a fixed amount of local computations. This additional delay alleviates the load imbalance and reduces the communication volume. In the cyclic method, the round-robin communication phases prevents the network traffic burst. However the computation phases are not fully utilized if the workers are heterogeneous. Additional computations of the strong workers are introduced in our dynamic delay cyclic method (dcyclic), where the computation phase is prolonged to overlap the communication phase of other workers. The dynamic delay makes full use of every worker’s computation power while minimizing the network traffic.

The dynamic delay cyclic method is described in Algorithm 1.1, where the local variables on the workers are decorated with a tilde. N denotes the number of workers, T denotes the maximum global step, and D denotes the maximum amount of accumulations, i.e. the limitation on the delay of the communication phase in every training iteration. x denotes the weights, g denotes the gradients and G denotes the accumulations. The global step t serves as the token for the synchronization communication and is maintained by the PS. The subscript \(-1\) of x is introduced for convenience, which is unnecessary in the implementation. Each worker implements two operations. One is the communication operation (remote push-to/pull-from the PS), the other one is the local computation operation (computing/accumulating of gradients).

The communication operation is based on the cyclic delayed method [7]. All workers cooperate in a round-robin order. The worker obtaining the communication token performs the communication operation, including the pushing of gradients and the pulling of updated weights. Then the global step t increases by one, in which case the token is relayed to the next worker.

The dynamic delay occurs in the computation phase of the worker. Compared to the single gradient computation in the conventional cyclic architecture, our dynamic cyclic method enables additional gradient computations and accumulations before the worker obtains the token. In the meanwhile, the amount of accumulations is adaptive in runtime, which is limited by the predefined largest delay D. When \(D = 1\), this method falls back to the conventional cyclic delay method [7]. When \(D > 1\), local updates and gradient accumulations are activated. In the computation phase when the worker processes new mini-batches, it keeps monitoring the global step t. As soon as it obtains the communication token, the worker aborts the remnant local operations in order to do the communication operation at the earliest.

The dynamic delay cyclic method brings two benefits. One is the optimized throughput (e.g. in examples/second) due to gradient accumulations. By doing as many training as possible in the computation phase, device utilization is improved. As a result, the total processing time for the same quantity of examples is decreased. The other benefit is the convergence conservation. Being able to abort the computation helps to suppress the actual delay and the staleness of gradients, even when the predefined D is large. This helps to achieve the convergence state.

4 Experimental Results

In this section, the dynamic delay cyclic method is evaluated with two large-scale datasets.

4.1 Datasets and Experiment Setup

Two datasets are selected for the evaluations. One is the ILSVRC2012 [13] dataset, which focuses on the image object classification. The training set contains 1.2 million images and the validation set contains 150 thousand images. Both of them are labeled with the presence or absence of 1000 object categories. The ResNet-V2-50 [14] model is adopted for the classification task. The other dataset is the union of the \(10^{9}\) Word Parallel Corpus for training and the updated development set of the News Crawl for validation from the WMT’15 [15]. The training corpus consists of over 22 million sentences, and the validation corpus consists of 3 thousand sentences. Both focus on the recurring translation task on the French–English pair. The Seq2Seq model [16] is adopted for the translation task.

Two computing nodes are utilized for all experiments. Both nodes are equipped with dual Intel Xeon E5-2600v4 CPUs, 512 GB memory and a Mellanox 40 Gbit/s network adapter. One node has 4 NVIDIA Tesla P100 GPUs, and the other node has no GPU. The distributed computing environment is simulated with these two nodes by making use of the GPU affinity as illustrated in Fig. 2. The worker procedures are bound to different GPUs, in the meanwhile the PS procedure is launched on the other node. PS and workers communicate through the network adapter, in the same way as a real distributed cluster.

4.2 Algorithm and Implementation

We compare our method with the cyclic and the delayed synchronous methods, and take the vanilla synchronous method as the baseline. Workers in the cyclic method update the parameters in a round-robin order [7]. In the vanilla synchronous method, the weights on the PS are updated by gradients received from all workers at around the same time. In the delayed synchronous method, the gradients are accumulated and applied to the local model replicas first. And then the gradient update to the PS works similarly with the vanilla synchronous method.

We implement these four methods with the PS architecture [17], where the server maintains the parameters and the workers do the computations. The data manipulation is automatically managed by TensorFlow [6] from the implicit insertion of nodes to the computation graph.

The ResNet-V2-50 model is trained with the Nesterov accelerated gradient (NAG) method [18, 19] with a batch size of 32, a momentum of 0.9 and a learning rate of 0.005 in 80 epochs. The learning rate is exponentially decayed with a factor of 0.1 every 20 epochs. The learning rate warmup [20] is implemented in the synchronous methods in order to accelerate the convergence. The vanilla SGD is used to train the Seq2Seq model in 1 epoch with a batch size of 64. The learning rate starts at 0.02 and decays every 0.01 epoch with a decay factor of 0.99. The learning rate warmup is not utilized in the training of the Seq2Seq model.

4.3 Results

We first investigate the performance of different methods. The convergence rates of train (dashed) and validation (solid) are plotted in Fig. 3. The columns from left to right show the synchronous (blue), the delayed synchronous (green), the cyclic (red) and the dynamic delay cyclic (cyan) methods. The top-5 error of the ResNet model is on the top, and the perplexity of the Seq2Seq model is at the bottom. The actual amount of gradient accumulations are tuned to be the same during the training of each model.

In the ResNet model, the cyclic methods achieve the same performance with the synchronous method. The rate of convergence is not impacted by the inherent gradient staleness from the round-robin order. The additional gradient accumulations limit the rate of convergence in the delayed synchronous method. Nevertheless, it takes little effect on the dynamic-cyclic method.

In the Seq2Seq model, the cyclic methods perform better than the synchronous methods, where the perplexity converges quickly to a lower value in the limited number of epochs. The additional accumulations impacts the convergence rate negatively in the delayed synchronous method. The result shows that the dynamic delay method is more robust to the staleness of the gradient information than the delayed synchronous method.

The gradient accumulations improve the throughput performance significantly. In the delayed methods, the synchronizations are postponed by the local operations on the workers. This delay reduces the communication-to-computation ratio and increases the utilization of the computing device, which leads to a higher throughput as illustrated in Fig. 4. Large datasets are trained to the same convergence rate at a faster speed with the dynamic delay cyclic method.

The network traffic is reduced with the dynamic delay cyclic method. The delay reduces the communication frequency and the total communication volume. In the cyclic methods, the PS responds to only one worker at a time. The rolling of the communication token prevents the traffic burst issue in the synchronous methods and reduces the network requirements. In our experiments, the delay and cyclic methods significantly reduce the network traffic as shown in Fig. 5. The dynamic delay cyclic method preserves the convergence and requires less network traffic than the synchronous and the cyclic methods.

5 Conclusions and Discussions

We propose a dynamic delay based cyclic gradient update method, which benefits from the cyclic gradient update architecture and the local gradient accumulations. The network traffic burst is relieved from the round-robin updating order, and the communication volume and frequency is suppressed by the explicit delay of gradient updates. This method keeps the rate of convergence from the restricted duration between synchronizations, and improves the throughput performance by the dynamic extension of the actual delay. The wall-clock time is reduced in the training of large datasets.

The cyclic methods take full use of the gradients computed from every mini-batch of examples. The gradients are not only employed to update local model replicas, but also accumulated to update the global model on the PS. A fixed (perhaps with decay) learning rate is more applicable for these aggressive methods.

The actual delay is bounded to prevent the convergence problem rising from the gradient staleness. The PS cycles the refresh of the local replicas among all workers in the cluster. The duration of the computation phase scales linearly with the number of workers, which comes from the round-robin nature of the cyclic methods. An oversize delay may limit the convergence rate because of the gradient staleness. An optimized delay restriction should be selected to accelerate the training and preserve the convergence simultaneously.

References

Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: NIPS (2012)
Google Scholar
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 (2014)
Li, M., et al.: Scaling distributed machine learning with the parameter server. In: OSDI 2014, pp. 583–598 (2014)
Google Scholar
Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. arXiv:1706.03292 (2017)
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274 (2015)
Abadi, M., Barham, P., Chen, J.M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016)
Google Scholar
Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: NIPS 2011, 4247 (2011)
Google Scholar
Ho, Q., et al.: More effective distributed ML via a stale synchronous parallel parameter server. In: NIPS 2012, pp. 2141–2149 (2012)
Google Scholar
Chen, X., Eversole, A., Li, G., Yu, D., Seide, F.: Pipelined back-propagation for context-dependent deep neural networks. In: Interspeech 2012 (2012)
Google Scholar
Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In: Interspeech 2014 (2014)
Google Scholar
Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: Interspeech 2015 (2015)
Google Scholar
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: ICLR 2018 (2018)
Google Scholar
Russakovsky, O., Deng, J., Su, H., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034 (2015)
Google Scholar
Bojar, O., Buck, C., Federmann, C., et al.: Findings of the 2015 workshop on statistical machine translation. In: Tenth Workshop on Statistical Machine Translation (2015). http://www.statmt.org/wmt15
Vinyals, O., Kaiser, L., Koo, T., et al.: Grammar as a foreign language. In: NIPS 2015, pp. 2773–2781 (2015)
Google Scholar
Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: NIPS 2014, pp. 19–27 (2014)
Google Scholar
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Article Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(\cal{O} \left(1 / k^{2} \right)\). Soviet Math. Doklady 27(2), 372–376 (1983)
MATH Google Scholar
Goyal, P., Dollár, P., Girshick, R., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv:1706.02677 (2017)

Download references

Acknowledgements

We would like to acknowledge the computation power support from the appliance group in the laboratory. We would also like to thank Mr. Zhenhua Liu from the computer vision group for the fruitful discussion.

Author information

Authors and Affiliations

Artificial Intelligence Lab, Lenovo Research, Beijing, 100085, China
Wenhui Hu, Peng Wang, Qigang Wang, Zhengdong Zhou, Hui Xiang, Mei Li & Zhongchao Shi

Authors

Wenhui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qigang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengdong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hui Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Mei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhongchao Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenhui Hu .

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, China
Jian-Huang Lai
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Tsinghua University, Beijing, China
Jie Zhou
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi'an Jiaotong University, Xi'an, China
Nanning Zheng
Peking University, Beijing, China
Hongbin Zha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, W. et al. (2018). Dynamic Delay Based Cyclic Gradient Update Method for Distributed Training. In: Lai, JH., et al. Pattern Recognition and Computer Vision. PRCV 2018. Lecture Notes in Computer Science(), vol 11258. Springer, Cham. https://doi.org/10.1007/978-3-030-03338-5_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-03338-5_46
Published: 03 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03337-8
Online ISBN: 978-3-030-03338-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics