Abstract
Parallel training of large-scale networks has attracted the attention of both artificial intelligence and high-performance distributed systems. One of efficient parallelism is the micro-batch-based pipeline, e.g., GPipe. Based on the GPipe, we derive a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as treats these time as nonlinear to batch size. Focusing on the optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove IMD has appreciable theoretical optimality. Also extensive experiments on both CNN- and Transformer-based networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under GPipe parallelism: CSIMD achieves training speeds respectively \(2.0\times \) and \(2.5\times \) faster than GPipe-R and GPipe-E in CNNs; as well as \(1.5\times \) and \(1.6\times \) in Transformers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, Y., Li, C., Zhou, Q., et al.: Gradient compression supercharged high-performance data parallel DNN training. In: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pp. 359–375 (2021)
Beaumont, O., Eyraud-Dubois, L., Shilova, A.: Madpipe: memory aware dynamic programming algorithm for pipelined model parallelism. In: International Parallel and Distributed Processing Symposium Workshops, pp. 1063–1073. IEEE (2022)
Dong, J., Cao, Z., Zhang, T., et al.: Eflops: algorithm and system co-design for a high performance distributed training platform. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 610–622. IEEE (2020)
Dryden, N., Maruyama, N., Moon, T., et al.: Channel and filter parallelism for large-scale CNN training. In: Taufer, M., Balaji, P., Peña, A.J. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 10:1–10:20. ACM (2019)
Elango, V.: Pase: parallelization strategies for efficient DNN training. In: 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS, pp. 1025–1034. IEEE (2021)
Fan, S., Rong, Y., Meng, C., et al.: DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445. ACM (2021)
Fu, H., Tang, S., He, B., et al.: HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs. J. Supercomput. 77(11), 12741–12770 (2021)
Gu, R., Chen, Y., Liu, S., et al.: Liquid: intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters. IEEE Trans. Parallel Distrib. Syst. 33(11), 2808–2820 (2022)
Han, Z., Qu, G., Liu, B., Zhang, F.: Exploit the data level parallelism and schedule dependent tasks on the multi-core processors. Inf. Sci. 585, 382–394 (2022)
Lee, Y., Chung, J., Rhu, M.: Smartsage: training large-scale graph neural networks using in-storage processing architectures. In: Proceedings of the 49th Annual International Symposium on Computer Architecture, pp. 932–945. ACM (2022)
Li, J., Wang, Y., Zhang, J., et al.: Pipepar: a pipelined hybrid parallel approach for accelerating distributed DNN training. In: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 470–475. IEEE (2021)
Li, Y., Zeng, Z., Li, J., et al.: Distributed model training based on data parallelism in edge computing-enabled elastic optical networks. IEEE Commun. Lett. 25(4), 1241–1244 (2021)
Li, Z., Chang, V., Hu, H., et al.: Optimizing makespan and resource utilization for multi-DNN training in GPU cluster. Future Gener. Comput. Syst. 125, 206–220 (2021)
Li, Z., Zhuang, S., Guo, S., et al.: Terapipe: token-level pipeline parallelism for training large-scale language models. In: International Conference on Machine Learning, pp. 6543–6552. PMLR (2021)
Md, V., Misra, S., Ma, G., et al.: Distgnn: scalable distributed training for large-scale graph neural networks. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2021)
Narayanan, D., Harlap, A., Phanishayee, A., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)
Narayanan, D., Phanishayee, A., Shi, K., et al.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)
Ouyang, S., Dong, D., Xu, Y., Xiao, L.: Communication optimization strategies for distributed deep neural network training: a survey. J. Parallel Distrib. Comput. 149, 52–65 (2021)
Romero, J., Yin, J., Laanait, N., et al.: Accelerating collective communication in data parallel training across deep learning frameworks. In: 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2022), pp. 1027–1040 (2022)
Sun, P., Wen, Y., Han, R., et al.: Gradientflow: optimizing network performance for large-scale distributed DNN training. IEEE Trans. Big Data 8(2), 495–507 (2022)
Wallach, H.M., Larochelle, H., Beygelzimer, A., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Wang, H., Qu, Z., Zhou, Q., et al.: A comprehensive survey on training acceleration for large machine learning models in IoT. IEEE Internet Things J. 9(2), 939–963 (2022)
Xu, J., Wang, J., Qi, Q., et al.: Effective scheduler for distributed DNN training based on mapreduce and GPU cluster. J. Grid Comput. 19(1), 8 (2021)
Ye, X., Lai, Z., Li, S., et al.: Hippie: a data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 71:1–71:10. ACM (2021)
Zeng, Z., Liu, C., Tang, Z., et al.: Training acceleration for deep neural networks: a hybrid parallelization strategy. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 1165–1170. IEEE (2021)
Zhang, Z., Chen, J., Hu, B.: The optimization of model parallelization strategies for multi-GPU training. In: IEEE Global Communications Conference, GLOBECOM 2021, Madrid, Spain, 70–11 December 2021, pp. 1–6. IEEE (2021)
Zhao, S., Li, F., Chen, X., et al.: Naspipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 374–387 (2022)
Zhao, S., Li, F., Chen, X., et al.: vPipe: a virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel Distrib. Syst. 33(3), 489–506 (2022)
Zheng, L., Li, Z., Zhang, H., et al.: Alpa: automating inter-and \(\{\)Intra-Operator\(\}\) parallelism for distributed deep learning. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022), pp. 559–578 (2022)
Acknowledgments
This research is supported by National Key Research and Development Program of China with Grant ID 2018AAA0103203, Project of Key Research and Development Program of Sichuan Province with Grant ID 2021YFG0325, and Technical Cooperation Project of Huawei with Grant ID H04W220751. We would like to thank assistance and support provided by Huawei Mindspore team.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, G., Lan, H., Xie, Y., Tian, W., Qian, J., Su, T. (2024). CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-69766-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)