MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction

Zeng, Yan; Ding, Yong; Ou, Dongyang; Zhang, Jilin; Ren, Yongjian; Zhang, Yunquan

doi:10.1007/s42514-022-00098-9

MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction

Regular Paper
Published: 26 August 2022

Volume 5, pages 429–441, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Yan Zeng^1,2,3,
Yong Ding ORCID: orcid.org/0000-0003-2026-417X⁴,
Dongyang Ou¹,
Jilin Zhang^1,2,3,
Yongjian Ren^1,2,3 &
…
Yunquan Zhang⁵

225 Accesses
1 Citation
Explore all metrics

Abstract

With the increasing scale of data sets and neural network models, distributed training of deep neural networks has become a trend. The main distributed parallel technology is based on expert experience, it is low efficient and hard to optimize as it needs lots of domain knowledge. There are some researchers have proposed auto-parallel technology to implement model distributed training which focused on specific models and parallel optimization factors. These methods have the problems of single factor of performance optimization, complex and low efficiency, etc. In this paper, we propose an adaptive distributed parallel training method (MP-DPS), based on the node merging of heterogeneous computing power-aware and path prediction, to search optimal parallel strategy automatically in large-scale networks. Firstly, we construct a multidimensional performance cost model to guide the design and implementation of the distributed parallel strategy. Secondly, we propose a node merging method with heterogeneous computing power awareness to reduce the search space and improve search efficiency. Finally, a graph search algorithm based on path prediction is proposed, it finds the optimal distributed parallel strategy by optimizing critical path execution time, which is based on predicting the optimal placement of critical operator node on the path. The experiments show that the deep learning model (such as ResNet, NasNet, etc.) can effectively be trained on 4 GPU and 8 GPU (P100) with the distributed parallel strategy searched by MP-DPS method, and the search time of optimal distributed parallel strategy can be reduced efficiently, compared with the FastT method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interference-aware parallelization for deep learning workload in GPU cluster

Article 02 January 2020

Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Article 21 April 2022

References

Addanki, R.: Learning generalizable device placement algorithms for distributed machine learning. Massachusetts Institute of Technology (2019)
Addanki, R., Bojja Venkatakrishnan, S., Gupta, S., Mao, H., Alizadeh, M.: Placeto: Learning generalizable device placement algorithms for distributed machine learning. arXiv preprint arXiv:1906.08879 (2019)
Alixandre, B., Dorn, M.: D-BRKGA: a distributed biased random-key genetic algorithm. In: 2017 IEEE congress on evolutionary computation (CEC). IEEE, (2017)
Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014)
Article Google Scholar
Bai, Y., Wang, J., Wang, X., et al.: The summary of deep learning in the field of weather forecast research. J. Phys. Conf. Ser. 1646(1), 12035 (2020)
Article Google Scholar
Ballard, G., Buluc, A., Demmel, J., et al.: Communication optimal parallel multiplication of sparse random matrices. In: Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures, pp 222–231 (2013)
Barnard, S.T., Simon, H.D.: Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Concurr Pract Exp 6, 101–117 (1994)
Article Google Scholar
Bolton, T., Zanna, L.: Applications of deep learning to ocean data inference and subgrid parameterization. J Adv Model Earth Syst 11(1), 376–399 (2019)
Article Google Scholar
Brown, T.B., Mann, B., Nick, R., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. pp xiv, 1, 3, 3, 3, 18, 19
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
Cai, Z., Ma, K., Yan, X., Wu, Y., Huang, Y., Cheng, J., Su, T., Yu, F.: TensorOpt: exploring the tradeoffs in distributed DNN training with auto-parallelism. arXiv preprint arXiv:2004.10856 (2020)
Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer, Cham, pp 213–229 (2020)
Dean, J.: A hierarchical model for device placement. In International conference on learning representations (2018)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, pp 1, 1, 3, 3, 40, 141, 142, 154, 154, 154, 154,163, 173 (2018)
Frazier, P.I.: A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).
Gao, Y., Chen,L., Li, B.: Post: Device placement with cross-entropy minimization and proximal policy optimization. In: Advances in neural information processing systems. 9971–9980 (2018)
Jia, Z., Lin, S., Qi, C.R., et al.: Exploring hidden dimensions in parallelizing convolutional neural networks. arXiv preprint arXiv:1802.04924, (2018)
Jia, Z., Zaharia, M., Aiken, A.. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).
Kim, S., Xing, E.P.: Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann. Appl. Stat. 2012, 1095–1117 (2012)
MathSciNet Google Scholar
Liang, X., Shen, X., Feng, J., et al.: Semantic object parsing with graph LSTM[C]//European conference on computer vision, pp. 125–143. Springer, Cham (2016)
Google Scholar
Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: International conference on machine learning. PMLR, 2430–2439 (2017a)
Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: International Conference on Machine Learning. PMLR, 2430–2439 (2017b)
Mirhoseini, A., Pham, H., Le, Q.V., et al.: Device placement optimization with reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 2430–2439 (2017)
Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q.V, Dean J.: A hierarchical model for device placement. In International conference on learning representations (2018)
Neubig, G., Dyer, C., Goldberg, Y., et al.: Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, (2017)
Pellegrini, F.: Distillating knowledge about Scotch. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2009a)
Pellegrini, F.: Distillating knowledge about Scotch. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2009b)
Pellegrini, F., Roman, J.: Experimental analysis of the dual recursive bipartitioning algorithm for static mapping. In TR 1038–96, LaBRI, URA CNRS 1304, Univ. Bordeaux I. Citeseer (1996)
Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference. 1–14 (2018)
Peters, M.E., Ammar, W., Bhagavatula, C., Power R.: Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108 (2017)
Paliwal, A., Gimeno, F., Nair, V., Li,Y., Lubin, M., Kohli, P., Vinyals, O.: Reinforced genetic algorithm learning for optimizing computation graphs. arXiv preprint arXiv:1905.02494 (2019a).
Paliwal, A., Gimeno, F., Nair, V., Li, Y., Lubin, M., Kohli, P., Vinyals, O.: Reinforced genetic algorithm learning for optimizing computation graphs. arXiv preprint arXiv:1905.02494 (2019b)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, p 1 (2019)
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. arXiv preprint arXiv:1701.06538, pp 1, 32, 32, 88 (2017)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., VanDen, D.G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Sun, S., Chen, W., Bian, J., Liu, X., Liu, T.-Y.:. Slim-DP: a multi-agent system for communication-efficient distributed deep learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. pp 721–729 (2018)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 (2014)
Tensorflow slim (2016) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim.
Wang, F., Casalino, L.P., Khullar, D.: Deep learning in medicine—promise, progress, and challenges. JAMA Intern. Med. 179(3), 293–294 (2019a)
Article Google Scholar
Wang, M., Huang, C., Li, J.: Supporting very large models using automatic dataflow graph partitioning[C]. In: Proceedings of the Fourteenth EuroSys Conference 2019b, 1–17. (2019b)
Wang, L., Guo, Z.H., Cao, F., et al.: Automatic generation method of model splitting strategy for model parallel training. Comput Eng Sci (2020)
Wu, Y., Schuster, M., Chen, Z., Le, Q.L., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Yi, X., Luo, Z., Meng, C., et al.: Fast training of deep learning models over multiple GPUs[C]. In: Proceedings of the 21st International middleware conference., pp 105–118. (2020)
Yu, C., Gao, C., Wang, J., et al.: Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. arXiv preprint arXiv:2004.02147, (2020)
Zhang, S., Tay, Y., Yao, L., et al.: Deeprec: An open-source toolkit for deep learning based recommendation. arXiv preprint arXiv:1905.10536, (2019)
Zhang, H., Li, Y., Deng, Z., et al.: AutoSync: learning to synchronize for data-parallel distributed deep learning. Adv Neural Inf Process Syst 33, 906–917 (2020)
Google Scholar
Zhou, Q.: Research on task scheduling method of heterogeneous multi-processor in distributed environment. South China Univ. Technol. (2017)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 62072146;The Key Research and Development Program of Zhejiang Province under Grant (2019C01059, 2019C03135, 2019C03134 ); The National Natural Science Foundation of China under Grant No.61972358; The Science Foundation of Beijing No. L182053, the CAS Interdisciplinary Innovation Team of Efficient Space Weather Forecast Models; The State Key Laboratory of Computer Architecture (ICT,CAS) under Grant No. CARCHB202120.

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Yan Zeng, Dongyang Ou, Jilin Zhang & Yongjian Ren
Key Laboratory of Complex System Modeling and Simulation, Ministry of Education, Hangzhou, 310018, China
Yan Zeng, Jilin Zhang & Yongjian Ren
Zhejiang Engineering Research Center of Data Security Governance, Hangzhou, 310018, China
Yan Zeng, Jilin Zhang & Yongjian Ren
HDU-ITMO Joint Institute, Hangzhou Dianzi University, Hangzhou, 310018, China
Yong Ding
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100086, China
Yunquan Zhang

Authors

Yan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Yong Ding
View author publications
You can also search for this author in PubMed Google Scholar
Dongyang Ou
View author publications
You can also search for this author in PubMed Google Scholar
Jilin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongjian Ren
View author publications
You can also search for this author in PubMed Google Scholar
Yunquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jilin Zhang or Yongjian Ren.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zeng, Y., Ding, Y., Ou, D. et al. MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction. CCF Trans. HPC 5, 429–441 (2023). https://doi.org/10.1007/s42514-022-00098-9

Download citation

Received: 14 December 2021
Accepted: 06 March 2022
Published: 26 August 2022
Issue Date: December 2023
DOI: https://doi.org/10.1007/s42514-022-00098-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction

Abstract

Access this article

Similar content being viewed by others

Interference-aware parallelization for deep learning workload in GPU cluster

Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction

Abstract

Access this article

Similar content being viewed by others

Interference-aware parallelization for deep learning workload in GPU cluster

Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation