CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN

Zhou, Guangyao; Lan, Haocheng; Xie, Yuanlun; Tian, Wenhong; Qian, Jiahong; Su, Teng

doi:10.1007/978-3-031-69766-1_20

Guangyao Zhou ORCID: orcid.org/0000-0003-0809-5799¹³,
Haocheng Lan¹³,
Yuanlun Xie¹³,
Wenhong Tian¹³,
Jiahong Qian¹⁴ &
…
Teng Su¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14802))

Included in the following conference series:

European Conference on Parallel Processing

821 Accesses
1 Citations

Abstract

Parallel training of large-scale networks has attracted the attention of both artificial intelligence and high-performance distributed systems. One of efficient parallelism is the micro-batch-based pipeline, e.g., GPipe. Based on the GPipe, we derive a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as treats these time as nonlinear to batch size. Focusing on the optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove IMD has appreciable theoretical optimality. Also extensive experiments on both CNN- and Transformer-based networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under GPipe parallelism: CSIMD achieves training speeds respectively $2.0\times $ and $2.5\times $ faster than GPipe-R and GPipe-E in CNNs; as well as $1.5\times $ and $1.6\times $ in Transformers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster

Article 16 April 2019

MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction

Article 26 August 2022

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Article 02 March 2021

References

Bai, Y., Li, C., Zhou, Q., et al.: Gradient compression supercharged high-performance data parallel DNN training. In: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pp. 359–375 (2021)
Google Scholar
Beaumont, O., Eyraud-Dubois, L., Shilova, A.: Madpipe: memory aware dynamic programming algorithm for pipelined model parallelism. In: International Parallel and Distributed Processing Symposium Workshops, pp. 1063–1073. IEEE (2022)
Google Scholar
Dong, J., Cao, Z., Zhang, T., et al.: Eflops: algorithm and system co-design for a high performance distributed training platform. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 610–622. IEEE (2020)
Google Scholar
Dryden, N., Maruyama, N., Moon, T., et al.: Channel and filter parallelism for large-scale CNN training. In: Taufer, M., Balaji, P., Peña, A.J. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 10:1–10:20. ACM (2019)
Google Scholar
Elango, V.: Pase: parallelization strategies for efficient DNN training. In: 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS, pp. 1025–1034. IEEE (2021)
Google Scholar
Fan, S., Rong, Y., Meng, C., et al.: DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445. ACM (2021)
Google Scholar
Fu, H., Tang, S., He, B., et al.: HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs. J. Supercomput. 77(11), 12741–12770 (2021)
Article Google Scholar
Gu, R., Chen, Y., Liu, S., et al.: Liquid: intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters. IEEE Trans. Parallel Distrib. Syst. 33(11), 2808–2820 (2022)
Google Scholar
Han, Z., Qu, G., Liu, B., Zhang, F.: Exploit the data level parallelism and schedule dependent tasks on the multi-core processors. Inf. Sci. 585, 382–394 (2022)
Article Google Scholar
Lee, Y., Chung, J., Rhu, M.: Smartsage: training large-scale graph neural networks using in-storage processing architectures. In: Proceedings of the 49th Annual International Symposium on Computer Architecture, pp. 932–945. ACM (2022)
Google Scholar
Li, J., Wang, Y., Zhang, J., et al.: Pipepar: a pipelined hybrid parallel approach for accelerating distributed DNN training. In: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 470–475. IEEE (2021)
Google Scholar
Li, Y., Zeng, Z., Li, J., et al.: Distributed model training based on data parallelism in edge computing-enabled elastic optical networks. IEEE Commun. Lett. 25(4), 1241–1244 (2021)
Article Google Scholar
Li, Z., Chang, V., Hu, H., et al.: Optimizing makespan and resource utilization for multi-DNN training in GPU cluster. Future Gener. Comput. Syst. 125, 206–220 (2021)
Article Google Scholar
Li, Z., Zhuang, S., Guo, S., et al.: Terapipe: token-level pipeline parallelism for training large-scale language models. In: International Conference on Machine Learning, pp. 6543–6552. PMLR (2021)
Google Scholar
Md, V., Misra, S., Ma, G., et al.: Distgnn: scalable distributed training for large-scale graph neural networks. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2021)
Google Scholar
Narayanan, D., Harlap, A., Phanishayee, A., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)
Google Scholar
Narayanan, D., Phanishayee, A., Shi, K., et al.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)
Google Scholar
Ouyang, S., Dong, D., Xu, Y., Xiao, L.: Communication optimization strategies for distributed deep neural network training: a survey. J. Parallel Distrib. Comput. 149, 52–65 (2021)
Article Google Scholar
Romero, J., Yin, J., Laanait, N., et al.: Accelerating collective communication in data parallel training across deep learning frameworks. In: 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2022), pp. 1027–1040 (2022)
Google Scholar
Sun, P., Wen, Y., Han, R., et al.: Gradientflow: optimizing network performance for large-scale distributed DNN training. IEEE Trans. Big Data 8(2), 495–507 (2022)
Google Scholar
Wallach, H.M., Larochelle, H., Beygelzimer, A., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Wang, H., Qu, Z., Zhou, Q., et al.: A comprehensive survey on training acceleration for large machine learning models in IoT. IEEE Internet Things J. 9(2), 939–963 (2022)
Article Google Scholar
Xu, J., Wang, J., Qi, Q., et al.: Effective scheduler for distributed DNN training based on mapreduce and GPU cluster. J. Grid Comput. 19(1), 8 (2021)
Article Google Scholar
Ye, X., Lai, Z., Li, S., et al.: Hippie: a data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 71:1–71:10. ACM (2021)
Google Scholar
Zeng, Z., Liu, C., Tang, Z., et al.: Training acceleration for deep neural networks: a hybrid parallelization strategy. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 1165–1170. IEEE (2021)
Google Scholar
Zhang, Z., Chen, J., Hu, B.: The optimization of model parallelization strategies for multi-GPU training. In: IEEE Global Communications Conference, GLOBECOM 2021, Madrid, Spain, 70–11 December 2021, pp. 1–6. IEEE (2021)
Google Scholar
Zhao, S., Li, F., Chen, X., et al.: Naspipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 374–387 (2022)
Google Scholar
Zhao, S., Li, F., Chen, X., et al.: vPipe: a virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel Distrib. Syst. 33(3), 489–506 (2022)
Article Google Scholar
Zheng, L., Li, Z., Zhang, H., et al.: Alpa: automating inter-and $\{$Intra-Operator$\}$ parallelism for distributed deep learning. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022), pp. 559–578 (2022)
Google Scholar

Download references

Acknowledgments

This research is supported by National Key Research and Development Program of China with Grant ID 2018AAA0103203, Project of Key Research and Development Program of Sichuan Province with Grant ID 2021YFG0325, and Technical Cooperation Project of Huawei with Grant ID H04W220751. We would like to thank assistance and support provided by Huawei Mindspore team.

Author information

Authors and Affiliations

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China
Guangyao Zhou, Haocheng Lan, Yuanlun Xie & Wenhong Tian
Huawei Technologies Co. Ltd., Shenzhen, China
Jiahong Qian & Teng Su

Authors

Guangyao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Haocheng Lan
View author publications
You can also search for this author in PubMed Google Scholar
Yuanlun Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wenhong Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jiahong Qian
View author publications
You can also search for this author in PubMed Google Scholar
Teng Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenhong Tian .

Editor information

Editors and Affiliations

University Carlos III of Madrid, Madrid, Spain
Jesus Carretero
University of Oregon, Eugene, OR, USA
Sameer Shende
University Carlos III of Madrid, Madrid, Spain
Javier Garcia-Blas
TU Wien, Vienna, Austria
Ivona Brandic
Universidad Complutense de Madrid, Madrid, Spain
Katzalin Olcoz
Université Grenoble Alpes, Saint Martin d'Hères, France
Martin Schreiber

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, G., Lan, H., Xie, Y., Tian, W., Qian, J., Su, T. (2024). CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-69766-1_20
Published: 26 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN