Skip to main content
Log in

MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

With the increasing scale of data sets and neural network models, distributed training of deep neural networks has become a trend. The main distributed parallel technology is based on expert experience, it is low efficient and hard to optimize as it needs lots of domain knowledge. There are some researchers have proposed auto-parallel technology to implement model distributed training which focused on specific models and parallel optimization factors. These methods have the problems of single factor of performance optimization, complex and low efficiency, etc. In this paper, we propose an adaptive distributed parallel training method (MP-DPS), based on the node merging of heterogeneous computing power-aware and path prediction, to search optimal parallel strategy automatically in large-scale networks. Firstly, we construct a multidimensional performance cost model to guide the design and implementation of the distributed parallel strategy. Secondly, we propose a node merging method with heterogeneous computing power awareness to reduce the search space and improve search efficiency. Finally, a graph search algorithm based on path prediction is proposed, it finds the optimal distributed parallel strategy by optimizing critical path execution time, which is based on predicting the optimal placement of critical operator node on the path. The experiments show that the deep learning model (such as ResNet, NasNet, etc.) can effectively be trained on 4 GPU and 8 GPU (P100) with the distributed parallel strategy searched by MP-DPS method, and the search time of optimal distributed parallel strategy can be reduced efficiently, compared with the FastT method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Addanki, R.: Learning generalizable device placement algorithms for distributed machine learning. Massachusetts Institute of Technology (2019)

  • Addanki, R., Bojja Venkatakrishnan, S., Gupta, S., Mao, H., Alizadeh, M.: Placeto: Learning generalizable device placement algorithms for distributed machine learning. arXiv preprint arXiv:1906.08879 (2019)

  • Alixandre, B., Dorn, M.: D-BRKGA: a distributed biased random-key genetic algorithm. In: 2017 IEEE congress on evolutionary computation (CEC). IEEE, (2017)

  • Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014)

    Article  Google Scholar 

  • Bai, Y., Wang, J., Wang, X., et al.: The summary of deep learning in the field of weather forecast research. J. Phys. Conf. Ser. 1646(1), 12035 (2020)

    Article  Google Scholar 

  • Ballard, G., Buluc, A., Demmel, J., et al.: Communication optimal parallel multiplication of sparse random matrices. In: Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures, pp 222–231 (2013)

  • Barnard, S.T., Simon, H.D.: Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Concurr Pract Exp 6, 101–117 (1994)

    Article  Google Scholar 

  • Bolton, T., Zanna, L.: Applications of deep learning to ocean data inference and subgrid parameterization. J Adv Model Earth Syst 11(1), 376–399 (2019)

    Article  Google Scholar 

  • Brown, T.B., Mann, B., Nick, R., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. pp xiv, 1, 3, 3, 3, 18, 19

  • Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

  • Cai, Z., Ma, K., Yan, X., Wu, Y., Huang, Y., Cheng, J., Su, T., Yu, F.: TensorOpt: exploring the tradeoffs in distributed DNN training with auto-parallelism. arXiv preprint arXiv:2004.10856 (2020)

  • Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer, Cham, pp 213–229 (2020)

  • Dean, J.: A hierarchical model for device placement. In International conference on learning representations (2018)

  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, pp 1, 1, 3, 3, 40, 141, 142, 154, 154, 154, 154,163, 173 (2018)

  • Frazier, P.I.: A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).

  • Gao, Y., Chen,L., Li, B.: Post: Device placement with cross-entropy minimization and proximal policy optimization. In: Advances in neural information processing systems. 9971–9980 (2018)

  • Jia, Z., Lin, S., Qi, C.R., et al.: Exploring hidden dimensions in parallelizing convolutional neural networks. arXiv preprint arXiv:1802.04924, (2018)

  • Jia, Z., Zaharia, M., Aiken, A.. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).

  • Kim, S., Xing, E.P.: Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann. Appl. Stat. 2012, 1095–1117 (2012)

    MathSciNet  Google Scholar 

  • Liang, X., Shen, X., Feng, J., et al.: Semantic object parsing with graph LSTM[C]//European conference on computer vision, pp. 125–143. Springer, Cham (2016)

    Google Scholar 

  • Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: International conference on machine learning. PMLR, 2430–2439 (2017a)

  • Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: International Conference on Machine Learning. PMLR, 2430–2439 (2017b)

  • Mirhoseini, A., Pham, H., Le, Q.V., et al.: Device placement optimization with reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 2430–2439 (2017)

  • Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q.V, Dean J.: A hierarchical model for device placement. In International conference on learning representations (2018)

  • Neubig, G., Dyer, C., Goldberg, Y., et al.: Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, (2017)

  • Pellegrini, F.: Distillating knowledge about Scotch. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2009a)

  • Pellegrini, F.: Distillating knowledge about Scotch. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2009b)

  • Pellegrini, F., Roman, J.: Experimental analysis of the dual recursive bipartitioning algorithm for static mapping. In TR 1038–96, LaBRI, URA CNRS 1304, Univ. Bordeaux I. Citeseer (1996)

  • Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference. 1–14 (2018)

  • Peters, M.E., Ammar, W., Bhagavatula, C., Power R.: Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108 (2017)

  • Paliwal, A., Gimeno, F., Nair, V., Li,Y., Lubin, M., Kohli, P., Vinyals, O.: Reinforced genetic algorithm learning for optimizing computation graphs. arXiv preprint arXiv:1905.02494 (2019a).

  • Paliwal, A., Gimeno, F., Nair, V., Li, Y., Lubin, M., Kohli, P., Vinyals, O.: Reinforced genetic algorithm learning for optimizing computation graphs. arXiv preprint arXiv:1905.02494 (2019b)

  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, p 1 (2019)

  • Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. arXiv preprint arXiv:1701.06538, pp 1, 32, 32, 88 (2017)

  • Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., VanDen, D.G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  • Sun, S., Chen, W., Bian, J., Liu, X., Liu, T.-Y.:. Slim-DP: a multi-agent system for communication-efficient distributed deep learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. pp 721–729 (2018)

  • Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 (2014)

  • Tensorflow slim (2016) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim.

  • Wang, F., Casalino, L.P., Khullar, D.: Deep learning in medicine—promise, progress, and challenges. JAMA Intern. Med. 179(3), 293–294 (2019a)

    Article  Google Scholar 

  • Wang, M., Huang, C., Li, J.: Supporting very large models using automatic dataflow graph partitioning[C]. In: Proceedings of the Fourteenth EuroSys Conference 2019b, 1–17. (2019b)

  • Wang, L., Guo, Z.H., Cao, F., et al.: Automatic generation method of model splitting strategy for model parallel training. Comput Eng Sci (2020)

  • Wu, Y., Schuster, M., Chen, Z., Le, Q.L., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  • Yi, X., Luo, Z., Meng, C., et al.: Fast training of deep learning models over multiple GPUs[C]. In: Proceedings of the 21st International middleware conference., pp 105–118. (2020)

  • Yu, C., Gao, C., Wang, J., et al.: Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. arXiv preprint arXiv:2004.02147, (2020)

  • Zhang, S., Tay, Y., Yao, L., et al.: Deeprec: An open-source toolkit for deep learning based recommendation. arXiv preprint arXiv:1905.10536, (2019)

  • Zhang, H., Li, Y., Deng, Z., et al.: AutoSync: learning to synchronize for data-parallel distributed deep learning. Adv Neural Inf Process Syst 33, 906–917 (2020)

    Google Scholar 

  • Zhou, Q.: Research on task scheduling method of heterogeneous multi-processor in distributed environment. South China Univ. Technol. (2017)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 62072146;The Key Research and Development Program of Zhejiang Province under Grant (2019C01059, 2019C03135, 2019C03134 ); The National Natural Science Foundation of China under Grant No.61972358; The Science Foundation of Beijing No. L182053, the CAS Interdisciplinary Innovation Team of Efficient Space Weather Forecast Models; The State Key Laboratory of Computer Architecture (ICT,CAS) under Grant No. CARCHB202120.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jilin Zhang or Yongjian Ren.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, Y., Ding, Y., Ou, D. et al. MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction. CCF Trans. HPC 5, 429–441 (2023). https://doi.org/10.1007/s42514-022-00098-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-022-00098-9

Keywords

Navigation