Abstract
Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existence of stragglers, which significantly restricts the performance of the whole system. It is an enormous challenge to fully exploit the computing capacity of the system based on parameter server architecture, especially in a heterogeneous environment. Motivated by this, we designed a method named EP4DDL to minimize the impact of the straggler problem by load balance technique. In a statistical view, the approach introduces a novel metric named performance variance to give a comprehensive inspection of stragglers and employs flexible parallelism techniques for each node. We verify the algorithm on standard benchmarks and demonstrate that it can reduce training time to 57.46%, 24.8%, and 11.5%, respectively, without accuracy loss compared with the FlexRR, Con-SGD, and Falcon.













Similar content being viewed by others
References
Zhong Y, Oh S, Moon HC (2021) Service transformation under industry 4.0: investigating acceptance of facial recognition payment through an extended technology acceptance model. Technol Soc 64:101515
Stewart R, Velupillai S (2021) Applied natural language processing in mental health big data. Neuropsychopharmacology 46(1):252
Lanctot M, Lockhart E, Lespiau JB, et al (2019) OpenSpiel: a framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453
Peng Y, Bao Y, Chen Y et al (2021) Dl2: a deep learning-driven scheduler for deep learning clusters. IEEE Trans Parallel Distrib Syst 32(8):1947–1960
Jiang J, Cui B, Zhang C et al (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the ACM International Conference on Management of Data, pp 463–478
Ho Q, Cipar J, Cui H et al (2013) More effective distributed ml via a stale synchronous parallel parameter server. Adv Neural Inf Process Syst:1223
Zhou Q, Guo S, Lu H et al (2020) Falcon: addressing stragglers in heterogeneous parameter server via multiple parallelism. IEEE Trans Comput 70(1):139–155
Gill SS, Ouyang X, Garraghan P (2020) Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres. J Supercomputing 76(12):10050–10089
Harlap A, Cui H, Dai W et al (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp 98–111
Kishor A, Chakraborty C, Jeberson W (2021) A novel fog computing approach for minimization of latency in healthcare using machine learning. Int J Interact Multimed Artif Intell 6(Special Issue on Current Trends in Intelligent Multimedia Processing Systems):7–17
Benalla M (2016) A distributed intelligent system for emergency convoy. Int J Interact Multimed Artif Intell 4:1
Aktas MF, Peng P, Soljanin E (2017) Effective straggler mitigation: which clones should attack and when? ACM SIGMETRICS Perform Eval Rev 45(2):12–14
Zhang J, Simeone O (2020) LAGC: Lazily aggregated gradient coding for straggler-tolerant and communication-efficient distributed learning[J]. IEEE Trans Neural Networks Learn Syst 32(3): 962–974
Bitar R, Wootters M, El Rouayheb S (2020) Stochastic gradient coding for straggler mitigation in distributed learning. IEEE J Sel Areas Inf Theor 1(1):277–291
Guo Y, Rao J, Jiang C et al (2016) Moving hadoop into the cloud with flexible slot management and speculative execution. IEEE Tran Parallel Distrib Syst 28(3):798–812
Huang Y, Jin T, Wu Y et al (2018) Flexps: Flexible parallelism control in parameter server architecture. Proc VLDB Endow 11(5):566–579
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
LeCun Y, Cortes C, Burges CJC "THE MNIST DATABASE of handwritten digits". http://yann.lecun.com/exdb/mnist/
Krizhevsky A, Nair V, Hinton G CIFAR-10: cs.toronto.edu/~kriz/cifar.html
Huang Y, Cheng Y, Bapna A et al (2018) Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965
Dean J, Corrado GS, Monga R et al (2012) Large scale distributed deep networks
Wu X, Xu H, Li B et al (2020) Stanza: layer separation for distributed training in deep learning. IEEE Trans Serv Comput
Geng J, Li D, Wang S (2020) Fela: incorporating flexible parallelism and elastic tuning to accelerate large-scale DML. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, pp 1393–1404
Chen J, Pan X, Monga R et al (2016) Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981
Zheng S, Meng Q, Wang T et al (2017) Asynchronous stochastic gradient descent with delay compensation. In: International Conference on Machine Learning. PMLR, pp 4120–4129
Costantini S, De Gasperis G, De Lauretis L (2021) An application of declarative languages in distributed architectures: ASP and DALI microservices. Int J Interact Multimed Artif Intell 6(Special Issue on Artificial Intelligence, Paving the Way to the Future):66–78
Niu F, Recht B, Re C et al (2011) HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. Adv Neural Inf Process Syst 24:693–701
Zhang W, Gupta S, Lian X et al (2016) Staleness-aware async-SGD for distributed deep learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 2350–2356
Chen M, Mao B, Ma T (2021) FedSA: a staleness-aware asynchronous Federated Learning algorithm with non-IID data. Fut Gener Comput Syst 120:1–12
Khaleghzadeh H, Manumachu RR, Lastovetsky A (2018) A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms[J]. IEEE Trans Parallel Distribut Syst 29(10):2176–2190
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chen, C et al (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM Symposium on Cloud Computing
Acknowledgements
We would like to thank the anonymous reviewers, whose insightful comments greatly improved the quality of this paper. The work described in this paper was supported in part by the Key Basic Research Program of the China Basic Strengthening Program (2019-JCJQ-ZD-041) and the National Key Research and Development Program of China (2016YFB0200902).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ji, Z., Zhang, X., Li, J. et al. EP4DDL: addressing straggler problem in heterogeneous distributed deep learning. J Supercomput 78, 15663–15680 (2022). https://doi.org/10.1007/s11227-022-04466-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04466-8