EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Ji, Zeyu; Zhang, Xingjun; Li, Jingbo; Wei, Jia; Wei, Zheng

doi:10.1007/s11227-022-04466-8

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Published: 21 April 2022

Volume 78, pages 15663–15680, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zeyu Ji ORCID: orcid.org/0000-0003-0362-7506¹,
Xingjun Zhang¹,
Jingbo Li¹^na1,
Jia Wei¹^na1 &
…
Zheng Wei¹^na1

409 Accesses
2 Citations
Explore all metrics

Abstract

Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existence of stragglers, which significantly restricts the performance of the whole system. It is an enormous challenge to fully exploit the computing capacity of the system based on parameter server architecture, especially in a heterogeneous environment. Motivated by this, we designed a method named EP4DDL to minimize the impact of the straggler problem by load balance technique. In a statistical view, the approach introduces a novel metric named performance variance to give a comprehensive inspection of stragglers and employs flexible parallelism techniques for each node. We verify the algorithm on standard benchmarks and demonstrate that it can reduce training time to 57.46%, 24.8%, and 11.5%, respectively, without accuracy loss compared with the FlexRR, Con-SGD, and Falcon.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DistDL: A Distributed Deep Learning Service Schema with GPU Accelerating

Distributed Training Large-Scale Deep Architectures

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Article Open access 08 June 2023

References

Zhong Y, Oh S, Moon HC (2021) Service transformation under industry 4.0: investigating acceptance of facial recognition payment through an extended technology acceptance model. Technol Soc 64:101515
Article Google Scholar
Stewart R, Velupillai S (2021) Applied natural language processing in mental health big data. Neuropsychopharmacology 46(1):252
Article Google Scholar
Lanctot M, Lockhart E, Lespiau JB, et al (2019) OpenSpiel: a framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453
Peng Y, Bao Y, Chen Y et al (2021) Dl2: a deep learning-driven scheduler for deep learning clusters. IEEE Trans Parallel Distrib Syst 32(8):1947–1960
Article Google Scholar
Jiang J, Cui B, Zhang C et al (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the ACM International Conference on Management of Data, pp 463–478
Ho Q, Cipar J, Cui H et al (2013) More effective distributed ml via a stale synchronous parallel parameter server. Adv Neural Inf Process Syst:1223
Zhou Q, Guo S, Lu H et al (2020) Falcon: addressing stragglers in heterogeneous parameter server via multiple parallelism. IEEE Trans Comput 70(1):139–155
Article Google Scholar
Gill SS, Ouyang X, Garraghan P (2020) Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres. J Supercomputing 76(12):10050–10089
Article Google Scholar
Harlap A, Cui H, Dai W et al (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp 98–111
Kishor A, Chakraborty C, Jeberson W (2021) A novel fog computing approach for minimization of latency in healthcare using machine learning. Int J Interact Multimed Artif Intell 6(Special Issue on Current Trends in Intelligent Multimedia Processing Systems):7–17
Google Scholar
Benalla M (2016) A distributed intelligent system for emergency convoy. Int J Interact Multimed Artif Intell 4:1
Google Scholar
Aktas MF, Peng P, Soljanin E (2017) Effective straggler mitigation: which clones should attack and when? ACM SIGMETRICS Perform Eval Rev 45(2):12–14
Article Google Scholar
Zhang J, Simeone O (2020) LAGC: Lazily aggregated gradient coding for straggler-tolerant and communication-efficient distributed learning[J]. IEEE Trans Neural Networks Learn Syst 32(3): 962–974
Article MathSciNet Google Scholar
Bitar R, Wootters M, El Rouayheb S (2020) Stochastic gradient coding for straggler mitigation in distributed learning. IEEE J Sel Areas Inf Theor 1(1):277–291
Article Google Scholar
Guo Y, Rao J, Jiang C et al (2016) Moving hadoop into the cloud with flexible slot management and speculative execution. IEEE Tran Parallel Distrib Syst 28(3):798–812
Article Google Scholar
Huang Y, Jin T, Wu Y et al (2018) Flexps: Flexible parallelism control in parameter server architecture. Proc VLDB Endow 11(5):566–579
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
LeCun Y, Cortes C, Burges CJC "THE MNIST DATABASE of handwritten digits". http://yann.lecun.com/exdb/mnist/
Krizhevsky A, Nair V, Hinton G CIFAR-10: cs.toronto.edu/~kriz/cifar.html
Huang Y, Cheng Y, Bapna A et al (2018) Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965
Dean J, Corrado GS, Monga R et al (2012) Large scale distributed deep networks
Wu X, Xu H, Li B et al (2020) Stanza: layer separation for distributed training in deep learning. IEEE Trans Serv Comput
Geng J, Li D, Wang S (2020) Fela: incorporating flexible parallelism and elastic tuning to accelerate large-scale DML. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, pp 1393–1404
Chen J, Pan X, Monga R et al (2016) Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981
Zheng S, Meng Q, Wang T et al (2017) Asynchronous stochastic gradient descent with delay compensation. In: International Conference on Machine Learning. PMLR, pp 4120–4129
Costantini S, De Gasperis G, De Lauretis L (2021) An application of declarative languages in distributed architectures: ASP and DALI microservices. Int J Interact Multimed Artif Intell 6(Special Issue on Artificial Intelligence, Paving the Way to the Future):66–78
Google Scholar
Niu F, Recht B, Re C et al (2011) HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. Adv Neural Inf Process Syst 24:693–701
Google Scholar
Zhang W, Gupta S, Lian X et al (2016) Staleness-aware async-SGD for distributed deep learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 2350–2356
Chen M, Mao B, Ma T (2021) FedSA: a staleness-aware asynchronous Federated Learning algorithm with non-IID data. Fut Gener Comput Syst 120:1–12
Article Google Scholar
Khaleghzadeh H, Manumachu RR, Lastovetsky A (2018) A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms[J]. IEEE Trans Parallel Distribut Syst 29(10):2176–2190
Article Google Scholar
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chen, C et al (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM Symposium on Cloud Computing

Download references

Acknowledgements

We would like to thank the anonymous reviewers, whose insightful comments greatly improved the quality of this paper. The work described in this paper was supported in part by the Key Basic Research Program of the China Basic Strengthening Program (2019-JCJQ-ZD-041) and the National Key Research and Development Program of China (2016YFB0200902).

Author information

Jingbo Li, Jia Wei and Zheng Wei have contributed equally to this work.

Authors and Affiliations

School of Computer Science and Technology, Xi’an Jiaotong University, West Xianning Road, Xi’an, 710049, Shaanxi, China
Zeyu Ji, Xingjun Zhang, Jingbo Li, Jia Wei & Zheng Wei

Authors

Zeyu Ji
View author publications
You can also search for this author inPubMed Google Scholar
Xingjun Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Jingbo Li
View author publications
You can also search for this author inPubMed Google Scholar
Jia Wei
View author publications
You can also search for this author inPubMed Google Scholar
Zheng Wei
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xingjun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, Z., Zhang, X., Li, J. et al. EP4DDL: addressing straggler problem in heterogeneous distributed deep learning. J Supercomput 78, 15663–15680 (2022). https://doi.org/10.1007/s11227-022-04466-8

Download citation

Accepted: 17 March 2022
Published: 21 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11227-022-04466-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DistDL: A Distributed Deep Learning Service Schema with GPU Accelerating

Distributed Training Large-Scale Deep Architectures

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now