Abstract
In recent years, with the continuous development of artificial intelligence technology, the complexity of deep learning algorithms and the scale of model training is also increasing. A series of efficient pipelined parallel training methods emerged to improve the training speed and accuracy. Distributed training becomes an effective way to train large-scale models. To solve this problem, we propose an efficient pipeline-parallel training optimization method. Our approach processes small batches of data in parallel through multiple compute nodes in a pipelined manner. We propose a prefix sum partition algorithm to realize a balanced partition and save the memory of computing resources. At the same time, we also design a clock optimization strategy to limit the number of weight version generations to ensure the model’s accuracy. Compared with the current famous pipeline parallel frameworks, our method can achieve about 2 times training acceleration, save about 30% of memory consumption, and improve the model accuracy by about 10% compared with PipeDream.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Cai, H., Zhu, L., Han, S.: Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018)
Huang, Y., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 (2018)
Sabet, M.J., Dufter, P., Yvon, F., Schütze, H.: Simalign: high quality word alignments without parallel training data using static and contextualized embeddings. arXiv preprint arXiv:2004.08728 (2020)
Fan, S., et al.: Dapple: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445 (2021)
Zhang, M., Zhou, Y., Zhao, L., Li, H.: Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1290–1302 (2021)
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)
Jia, Z., Zaharia, M., Aiken, A.: Beyond data and model parallelism for deep neural networks. Proc. Mach. Learn. Syst. 1, 1–13 (2019)
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., Zaharia, M.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)
Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 18), pp. 561–577 (2018)
Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)
Zhang, Q.: A novel resnet101 model based on dense dilated convolution for image classification. SN Appl. Sci. 4, 1–13 (2022)
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 4780–4789 (2019)
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478 (2017)
Jia, Z., Lin, S., Qi, C.R., Aiken, A.: Exploring hidden dimensions in accelerating convolutional neural networks. In: International Conference on Machine Learning, pp. 2274–2283. PMLR (2018)
Jiang, W., et al.: A novel stochastic gradient descent algorithm based on grouping over heterogeneous cluster systems for distributed deep learning. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 391–398. IEEE (2019)
Kim, J.K., et al.: STRADS: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 1–16 (2016)
Darriba, D., Taboada, G.L., Doallo, R., Posada, D.: jmodeltest 2: more models, new heuristics and parallel computing. Nat. Methods 9(8), 772–772 (2012)
Prosky, L., Asp, N.-G., Schweizer, T.F., Devries, J.W., Furda, I.: Determination of insoluble, soluble, and total dietary fiber in foods and food products: interlaboratory study. J. Assoc. Off. Anal. Chem. 71(5), 1017–1023 (1988)
L. Shen, Y. Mao, Z. Wang, H. Nie, and J. Huang, "Dnn training optimization with pipelined parallel based on feature maps encoding. In: 2022 Tenth International Conference on Advanced Cloud and Big Data (CBD), pp. 36–41. IEEE (2022)
Chen, C.-C., Yang, C.-L., Cheng,H.-Y.: Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. arXiv preprint arXiv:1809.02839 (2018)
Yang, B., Zhang, J., Li, J., Ré, C., Aberger, C., De Sa, C.: PipeMare: asynchronous pipeline parallel DNN training. In: Proceedings of Machine Learning and Systems, vol. 3, pp. 269–296 (2021)
Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with cuda. GPU Gems 3(39), 851–876 (2007)
Sengupta, S., Lefohn, A., Owens, J.D.: " work-efficient step-efficient prefix sum algorithm (2006)
Safari, M., Oortwijn, W., Joosten, S., Huisman, M.: Formal verification of parallel prefix sum. In: Lee, R., Jha, S., Mavridou, A., Giannakopoulou, D. (eds.) NFM 2020. LNCS, vol. 12229, pp. 170–186. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55754-6_10
Acknowledgment
This work is supported by the R &D and application of key technologies of independent and controllable computing power network (grant No. 2022JBZ01-01) and the Joint Fund of Shandong Natural Science Foundation (No. ZR2022LZH010).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lv, T., Wu, L., Zhao, Z., Wang, C., Li, C. (2024). A Memory Optimization Method for Distributed Training. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_30
Download citation
DOI: https://doi.org/10.1007/978-981-99-8126-7_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8125-0
Online ISBN: 978-981-99-8126-7
eBook Packages: Computer ScienceComputer Science (R0)