Skip to main content

Optimizing the Parallelism of Communication and Computation in Distributed Training Platform

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14487))

  • 126 Accesses

Abstract

With the development of deep learning, DNN models have become more complex. Large-scale model parameters enhance the level of AI by improving the accuracy of DNN models. However, they also present more severe challenges to the hardware training platform for training a large model needs a lot of computing and memory resources, which can easily exceed the capacity of an accelerator. In addition, with the increasing demand for the accuracy of DNN models in academia and industry, the number of training iterations is also skyrocketing. In these backgrounds, more accelerators are integrated on a hierarchical platform to conduct distributed training. In distributed training platforms, the computation of the DNN model and the communication of the intermediate parameters are handled by different hardware modules, so their degree of parallelism profoundly affects the training speed. In this work, based on the widely used hierarchical Torus-Ring training platform and the Ring All-Reduce collective communication algorithm, we improve the speed of distributed training by optimizing the parallelism of communication and computation. Specifically, based on the analysis of the distributed training process, we schedule the computation and communication so that they execute simultaneously as much as possible. Finally, for data parallelism and model parallelism, we reduce the communication exposure time and the computation exposure time, respectively. Compared with the previous work, the training speed (including 5 training iterations) of the Resnet50 model and the Transformer model is increased by 23.77\(\%\)–25.64\(\%\) and 11.66\(\%\)–12.83\(\%\).

This work is supported in part by the National Key RD Project No. 2021YFB0300300, the NSFC (62172430), the NSF of Hunan Province 2021JJ10052, the STIP of Hunan Province 2022RC3065, and the Key Laboratory of Advanced Microprocessor Chips and Systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)

  2. Agarwal, N., Krishna, T., Peh, L.S., Jha, N.K.: Garnet: a detailed on-chip network model inside a full-system simulator. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26–28, 2009, Boston, Massachusetts, USA, Proceedings (2009)

    Google Scholar 

  3. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)

    Google Scholar 

  4. Chao, C., Saeta, B.: Cloud tpu: codesigning architecture and infrastructure. In: Hot Chips, volume 31 (2019)

    Google Scholar 

  5. Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and sgd don’t care. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  6. Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981 (2016)

  7. Cho, M., Finkler, U., Kung, D., Hunter, H., et al.: Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Ibm Journal of Research and Development, PP(99), 1–1 (2019)

    Google Scholar 

  8. Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Computer ence 10(4), 429–439 (2015)

    Google Scholar 

  9. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, 25 (2012)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Graves, A., Jaitly, N., Mohamed, A.R.: Hybrid speech recognition with deep bidirectional lstm. In: Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop (2013)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S. and Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  13. Hou, X., Xu, R., Ma, S., Wang, Q., Jiang, W., Lu, H.: Co-designing the topology/algorithm to accelerate distributed training. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1010–1018, 2021

    Google Scholar 

  14. Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho, M., Patterson, D.: Ten lessons from three generations shaped google’s tpuv4i : Industrial product. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (2021)

    Google Scholar 

  15. Kielmann, T., Hofman, R.F., Bal, H.E., Plaat, A., Bhoedjang, R.A.. Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 131–140 (1999)

    Google Scholar 

  16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)

    Google Scholar 

  17. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  18. Li, M., et al.: Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598 (2014)

    Google Scholar 

  19. Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, 27 19–27 (2014)

    Google Scholar 

  20. Naumov, M., et al.: Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019)

  21. Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parall. Distrib. Comput. 69(2), 117–124 (2009)

    Article  Google Scholar 

  22. Perez, L., Wang, J.: The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017)

  23. Rashidi, S., Shurpali, P., Sridharan, S., Hassani, N., Krishna, T.: Scalable distributed training of recommendation models: An astra-sim + ns3 case-study with tcp/ip transport. In: 2020 IEEE Symposium on High-Performance Interconnects (HOTI) (2020)

    Google Scholar 

  24. Rashidi, S., Sridharan, S., Srinivasan, S., Krishna, T.: Astra-sim: enabling sw/hw co-design exploration for distributed dl training platforms. In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2020)

    Google Scholar 

  25. Rashidi, S., Sridharan, S., Srinivasan, S., Denton, M., Krishna, T.: Efficient communication acceleration for next-gen scale-up deep learning training platforms. arXiv preprint (2020)

    Google Scholar 

  26. Riley, G.F., Henderson, T.R.: The ns-3 network simulator. In: Wehrle, K., Güneş, M., Gross, J. (eds.) Modeling and Tools for Network Simulation, pp. 15–34. Springer Berlin Heidelberg, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12331-3_2

    Chapter  Google Scholar 

  27. Schlkopf, B., Platt, J., Hofmann, T.: Map-reduce for machine learning on multicore. In: Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference

    Google Scholar 

  28. Sepulchre, R., Paley, D.A. Leonard, N.E.: Stabilization of planar collective motion: All-to-all communication. IEEE Trans. Autom. Contr. 52(5), 811–824 (2007)

    Google Scholar 

  29. Shi, S., Wang, Q., Chu, X., Li, B.: dag model of synchronous stochastic gradient descent in distributed deep learning. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 425–432. IEEE (2018)

    Google Scholar 

  30. Träff, J.L.: Efficient all-gather communication on parallel systems with hierarchical communication structure. preparation (2003)

    Google Scholar 

  31. Vaswani, A.: Attention is all you need. In: Advances In Neural Information Processing Systems, 30 (2017)

    Google Scholar 

  32. Xu, R., Ma, S., Guo, Y., Li, D.: A survey of design and optimization for systolic array based dnn accelerators. In: ACM Computing Surveys (2023)

    Google Scholar 

  33. Rui, X., Ma, S., Wang, Y., Chen, X., Guo, Y.: Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Trans. Architect. Code Optim. (TACO) 18(4), 1–24 (2021)

    Google Scholar 

  34. Rui, X., Ma, S., Wang, Y., Guo, Y., Li, D., Qiao, Y.: Heterogeneous systolic array architecture for compact cnns hardware accelerators. IEEE Trans. Parallel Distrib. Syst. 33(11), 2860–2871 (2021)

    Google Scholar 

  35. Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on \(\{\)GPU\(\}\) clusters. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17), pp. 181–193 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiang Hou or Yuan Yuan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hou, X. et al. (2024). Optimizing the Parallelism of Communication and Computation in Distributed Training Platform. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14487. Springer, Singapore. https://doi.org/10.1007/978-981-97-0834-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0834-5_20

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0833-8

  • Online ISBN: 978-981-97-0834-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics