Skip to main content

A Memory Optimization Method for Distributed Training

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1961))

Included in the following conference series:

  • 364 Accesses

Abstract

In recent years, with the continuous development of artificial intelligence technology, the complexity of deep learning algorithms and the scale of model training is also increasing. A series of efficient pipelined parallel training methods emerged to improve the training speed and accuracy. Distributed training becomes an effective way to train large-scale models. To solve this problem, we propose an efficient pipeline-parallel training optimization method. Our approach processes small batches of data in parallel through multiple compute nodes in a pipelined manner. We propose a prefix sum partition algorithm to realize a balanced partition and save the memory of computing resources. At the same time, we also design a clock optimization strategy to limit the number of weight version generations to ensure the model’s accuracy. Compared with the current famous pipeline parallel frameworks, our method can achieve about 2 times training acceleration, save about 30% of memory consumption, and improve the model accuracy by about 10% compared with PipeDream.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  2. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  3. Cai, H., Zhu, L., Han, S.: Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018)

  4. Huang, Y., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  5. Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

    Google Scholar 

  6. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  8. Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)

  9. Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 (2018)

  10. Sabet, M.J., Dufter, P., Yvon, F., Schütze, H.: Simalign: high quality word alignments without parallel training data using static and contextualized embeddings. arXiv preprint arXiv:2004.08728 (2020)

  11. Fan, S., et al.: Dapple: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445 (2021)

    Google Scholar 

  12. Zhang, M., Zhou, Y., Zhao, L., Li, H.: Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1290–1302 (2021)

    Article  Google Scholar 

  13. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

    Google Scholar 

  14. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

  15. Jia, Z., Zaharia, M., Aiken, A.: Beyond data and model parallelism for deep neural networks. Proc. Mach. Learn. Syst. 1, 1–13 (2019)

    Google Scholar 

  16. Narayanan, D., Phanishayee, A., Shi, K., Chen, X., Zaharia, M.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)

    Google Scholar 

  17. Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 18), pp. 561–577 (2018)

    Google Scholar 

  18. Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)

    Google Scholar 

  19. Zhang, Q.: A novel resnet101 model based on dense dilated convolution for image classification. SN Appl. Sci. 4, 1–13 (2022)

    Article  Google Scholar 

  20. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 4780–4789 (2019)

    Google Scholar 

  21. Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478 (2017)

    Google Scholar 

  22. Jia, Z., Lin, S., Qi, C.R., Aiken, A.: Exploring hidden dimensions in accelerating convolutional neural networks. In: International Conference on Machine Learning, pp. 2274–2283. PMLR (2018)

    Google Scholar 

  23. Jiang, W., et al.: A novel stochastic gradient descent algorithm based on grouping over heterogeneous cluster systems for distributed deep learning. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 391–398. IEEE (2019)

    Google Scholar 

  24. Kim, J.K., et al.: STRADS: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 1–16 (2016)

    Google Scholar 

  25. Darriba, D., Taboada, G.L., Doallo, R., Posada, D.: jmodeltest 2: more models, new heuristics and parallel computing. Nat. Methods 9(8), 772–772 (2012)

    Article  Google Scholar 

  26. Prosky, L., Asp, N.-G., Schweizer, T.F., Devries, J.W., Furda, I.: Determination of insoluble, soluble, and total dietary fiber in foods and food products: interlaboratory study. J. Assoc. Off. Anal. Chem. 71(5), 1017–1023 (1988)

    Google Scholar 

  27. L. Shen, Y. Mao, Z. Wang, H. Nie, and J. Huang, "Dnn training optimization with pipelined parallel based on feature maps encoding. In: 2022 Tenth International Conference on Advanced Cloud and Big Data (CBD), pp. 36–41. IEEE (2022)

    Google Scholar 

  28. Chen, C.-C., Yang, C.-L., Cheng,H.-Y.: Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. arXiv preprint arXiv:1809.02839 (2018)

  29. Yang, B., Zhang, J., Li, J., Ré, C., Aberger, C., De Sa, C.: PipeMare: asynchronous pipeline parallel DNN training. In: Proceedings of Machine Learning and Systems, vol. 3, pp. 269–296 (2021)

    Google Scholar 

  30. Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with cuda. GPU Gems 3(39), 851–876 (2007)

    Google Scholar 

  31. Sengupta, S., Lefohn, A., Owens, J.D.: " work-efficient step-efficient prefix sum algorithm (2006)

    Google Scholar 

  32. Safari, M., Oortwijn, W., Joosten, S., Huisman, M.: Formal verification of parallel prefix sum. In: Lee, R., Jha, S., Mavridou, A., Giannakopoulou, D. (eds.) NFM 2020. LNCS, vol. 12229, pp. 170–186. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55754-6_10

    Chapter  Google Scholar 

Download references

Acknowledgment

This work is supported by the R &D and application of key technologies of independent and controllable computing power network (grant No. 2022JBZ01-01) and the Joint Fund of Shandong Natural Science Foundation (No. ZR2022LZH010).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lv, T., Wu, L., Zhao, Z., Wang, C., Li, C. (2024). A Memory Optimization Method for Distributed Training. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8126-7_30

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8125-0

  • Online ISBN: 978-981-99-8126-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics