A Memory Optimization Method for Distributed Training

Lv, Tiantian; Wu, Lu; Zhao, Zhigang; Wang, Chunxiao; Li, Chuantao

doi:10.1007/978-981-99-8126-7_30

Tiantian Lv¹⁰,
Lu Wu¹⁰,
Zhigang Zhao¹⁰,
Chunxiao Wang¹⁰ &
…
Chuantao Li¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1961))

Included in the following conference series:

International Conference on Neural Information Processing

364 Accesses

Abstract

In recent years, with the continuous development of artificial intelligence technology, the complexity of deep learning algorithms and the scale of model training is also increasing. A series of efficient pipelined parallel training methods emerged to improve the training speed and accuracy. Distributed training becomes an effective way to train large-scale models. To solve this problem, we propose an efficient pipeline-parallel training optimization method. Our approach processes small batches of data in parallel through multiple compute nodes in a pipelined manner. We propose a prefix sum partition algorithm to realize a balanced partition and save the memory of computing resources. At the same time, we also design a clock optimization strategy to limit the number of weight version generations to ensure the model’s accuracy. Compared with the current famous pipeline parallel frameworks, our method can achieve about 2 times training acceleration, save about 30% of memory consumption, and improve the model accuracy by about 10% compared with PipeDream.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Cai, H., Zhu, L., Han, S.: Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018)
Huang, Y., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 (2018)
Sabet, M.J., Dufter, P., Yvon, F., Schütze, H.: Simalign: high quality word alignments without parallel training data using static and contextualized embeddings. arXiv preprint arXiv:2004.08728 (2020)
Fan, S., et al.: Dapple: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445 (2021)
Google Scholar
Zhang, M., Zhou, Y., Zhao, L., Li, H.: Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1290–1302 (2021)
Article Google Scholar
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)
Jia, Z., Zaharia, M., Aiken, A.: Beyond data and model parallelism for deep neural networks. Proc. Mach. Learn. Syst. 1, 1–13 (2019)
Google Scholar
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., Zaharia, M.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)
Google Scholar
Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 18), pp. 561–577 (2018)
Google Scholar
Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)
Google Scholar
Zhang, Q.: A novel resnet101 model based on dense dilated convolution for image classification. SN Appl. Sci. 4, 1–13 (2022)
Article Google Scholar
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 4780–4789 (2019)
Google Scholar
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478 (2017)
Google Scholar
Jia, Z., Lin, S., Qi, C.R., Aiken, A.: Exploring hidden dimensions in accelerating convolutional neural networks. In: International Conference on Machine Learning, pp. 2274–2283. PMLR (2018)
Google Scholar
Jiang, W., et al.: A novel stochastic gradient descent algorithm based on grouping over heterogeneous cluster systems for distributed deep learning. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 391–398. IEEE (2019)
Google Scholar
Kim, J.K., et al.: STRADS: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 1–16 (2016)
Google Scholar
Darriba, D., Taboada, G.L., Doallo, R., Posada, D.: jmodeltest 2: more models, new heuristics and parallel computing. Nat. Methods 9(8), 772–772 (2012)
Article Google Scholar
Prosky, L., Asp, N.-G., Schweizer, T.F., Devries, J.W., Furda, I.: Determination of insoluble, soluble, and total dietary fiber in foods and food products: interlaboratory study. J. Assoc. Off. Anal. Chem. 71(5), 1017–1023 (1988)
Google Scholar
L. Shen, Y. Mao, Z. Wang, H. Nie, and J. Huang, "Dnn training optimization with pipelined parallel based on feature maps encoding. In: 2022 Tenth International Conference on Advanced Cloud and Big Data (CBD), pp. 36–41. IEEE (2022)
Google Scholar
Chen, C.-C., Yang, C.-L., Cheng,H.-Y.: Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. arXiv preprint arXiv:1809.02839 (2018)
Yang, B., Zhang, J., Li, J., Ré, C., Aberger, C., De Sa, C.: PipeMare: asynchronous pipeline parallel DNN training. In: Proceedings of Machine Learning and Systems, vol. 3, pp. 269–296 (2021)
Google Scholar
Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with cuda. GPU Gems 3(39), 851–876 (2007)
Google Scholar
Sengupta, S., Lefohn, A., Owens, J.D.: " work-efficient step-efficient prefix sum algorithm (2006)
Google Scholar
Safari, M., Oortwijn, W., Joosten, S., Huisman, M.: Formal verification of parallel prefix sum. In: Lee, R., Jha, S., Mavridou, A., Giannakopoulou, D. (eds.) NFM 2020. LNCS, vol. 12229, pp. 170–186. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55754-6_10
Chapter Google Scholar

Download references

Acknowledgment

This work is supported by the R &D and application of key technologies of independent and controllable computing power network (grant No. 2022JBZ01-01) and the Joint Fund of Shandong Natural Science Foundation (No. ZR2022LZH010).

Author information

Authors and Affiliations

Qilu University of Technology (Shandong Academy of Sciences), Shandong Computer Science Center (National Supercomputer Center in Jinan), Jinan, China
Tiantian Lv, Lu Wu, Zhigang Zhao, Chunxiao Wang & Chuantao Li

Authors

Tiantian Lv
View author publications
You can also search for this author in PubMed Google Scholar
Lu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhigang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Chunxiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chuantao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Wu .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lv, T., Wu, L., Zhao, Z., Wang, C., Li, C. (2024). A Memory Optimization Method for Distributed Training. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_30

Download citation

DOI: https://doi.org/10.1007/978-981-99-8126-7_30
Published: 13 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8125-0
Online ISBN: 978-981-99-8126-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Memory Optimization Method for Distributed Training