Skip to main content

CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN

  • Conference paper
  • First Online:
Euro-Par 2024: Parallel Processing (Euro-Par 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14802))

Included in the following conference series:

Abstract

Parallel training of large-scale networks has attracted the attention of both artificial intelligence and high-performance distributed systems. One of efficient parallelism is the micro-batch-based pipeline, e.g., GPipe. Based on the GPipe, we derive a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as treats these time as nonlinear to batch size. Focusing on the optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove IMD has appreciable theoretical optimality. Also extensive experiments on both CNN- and Transformer-based networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under GPipe parallelism: CSIMD achieves training speeds respectively \(2.0\times \) and \(2.5\times \) faster than GPipe-R and GPipe-E in CNNs; as well as \(1.5\times \) and \(1.6\times \) in Transformers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bai, Y., Li, C., Zhou, Q., et al.: Gradient compression supercharged high-performance data parallel DNN training. In: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pp. 359–375 (2021)

    Google Scholar 

  2. Beaumont, O., Eyraud-Dubois, L., Shilova, A.: Madpipe: memory aware dynamic programming algorithm for pipelined model parallelism. In: International Parallel and Distributed Processing Symposium Workshops, pp. 1063–1073. IEEE (2022)

    Google Scholar 

  3. Dong, J., Cao, Z., Zhang, T., et al.: Eflops: algorithm and system co-design for a high performance distributed training platform. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 610–622. IEEE (2020)

    Google Scholar 

  4. Dryden, N., Maruyama, N., Moon, T., et al.: Channel and filter parallelism for large-scale CNN training. In: Taufer, M., Balaji, P., Peña, A.J. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 10:1–10:20. ACM (2019)

    Google Scholar 

  5. Elango, V.: Pase: parallelization strategies for efficient DNN training. In: 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS, pp. 1025–1034. IEEE (2021)

    Google Scholar 

  6. Fan, S., Rong, Y., Meng, C., et al.: DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445. ACM (2021)

    Google Scholar 

  7. Fu, H., Tang, S., He, B., et al.: HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs. J. Supercomput. 77(11), 12741–12770 (2021)

    Article  Google Scholar 

  8. Gu, R., Chen, Y., Liu, S., et al.: Liquid: intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters. IEEE Trans. Parallel Distrib. Syst. 33(11), 2808–2820 (2022)

    Google Scholar 

  9. Han, Z., Qu, G., Liu, B., Zhang, F.: Exploit the data level parallelism and schedule dependent tasks on the multi-core processors. Inf. Sci. 585, 382–394 (2022)

    Article  Google Scholar 

  10. Lee, Y., Chung, J., Rhu, M.: Smartsage: training large-scale graph neural networks using in-storage processing architectures. In: Proceedings of the 49th Annual International Symposium on Computer Architecture, pp. 932–945. ACM (2022)

    Google Scholar 

  11. Li, J., Wang, Y., Zhang, J., et al.: Pipepar: a pipelined hybrid parallel approach for accelerating distributed DNN training. In: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 470–475. IEEE (2021)

    Google Scholar 

  12. Li, Y., Zeng, Z., Li, J., et al.: Distributed model training based on data parallelism in edge computing-enabled elastic optical networks. IEEE Commun. Lett. 25(4), 1241–1244 (2021)

    Article  Google Scholar 

  13. Li, Z., Chang, V., Hu, H., et al.: Optimizing makespan and resource utilization for multi-DNN training in GPU cluster. Future Gener. Comput. Syst. 125, 206–220 (2021)

    Article  Google Scholar 

  14. Li, Z., Zhuang, S., Guo, S., et al.: Terapipe: token-level pipeline parallelism for training large-scale language models. In: International Conference on Machine Learning, pp. 6543–6552. PMLR (2021)

    Google Scholar 

  15. Md, V., Misra, S., Ma, G., et al.: Distgnn: scalable distributed training for large-scale graph neural networks. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2021)

    Google Scholar 

  16. Narayanan, D., Harlap, A., Phanishayee, A., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)

    Google Scholar 

  17. Narayanan, D., Phanishayee, A., Shi, K., et al.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)

    Google Scholar 

  18. Ouyang, S., Dong, D., Xu, Y., Xiao, L.: Communication optimization strategies for distributed deep neural network training: a survey. J. Parallel Distrib. Comput. 149, 52–65 (2021)

    Article  Google Scholar 

  19. Romero, J., Yin, J., Laanait, N., et al.: Accelerating collective communication in data parallel training across deep learning frameworks. In: 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2022), pp. 1027–1040 (2022)

    Google Scholar 

  20. Sun, P., Wen, Y., Han, R., et al.: Gradientflow: optimizing network performance for large-scale distributed DNN training. IEEE Trans. Big Data 8(2), 495–507 (2022)

    Google Scholar 

  21. Wallach, H.M., Larochelle, H., Beygelzimer, A., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  22. Wang, H., Qu, Z., Zhou, Q., et al.: A comprehensive survey on training acceleration for large machine learning models in IoT. IEEE Internet Things J. 9(2), 939–963 (2022)

    Article  Google Scholar 

  23. Xu, J., Wang, J., Qi, Q., et al.: Effective scheduler for distributed DNN training based on mapreduce and GPU cluster. J. Grid Comput. 19(1), 8 (2021)

    Article  Google Scholar 

  24. Ye, X., Lai, Z., Li, S., et al.: Hippie: a data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 71:1–71:10. ACM (2021)

    Google Scholar 

  25. Zeng, Z., Liu, C., Tang, Z., et al.: Training acceleration for deep neural networks: a hybrid parallelization strategy. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 1165–1170. IEEE (2021)

    Google Scholar 

  26. Zhang, Z., Chen, J., Hu, B.: The optimization of model parallelization strategies for multi-GPU training. In: IEEE Global Communications Conference, GLOBECOM 2021, Madrid, Spain, 70–11 December 2021, pp. 1–6. IEEE (2021)

    Google Scholar 

  27. Zhao, S., Li, F., Chen, X., et al.: Naspipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 374–387 (2022)

    Google Scholar 

  28. Zhao, S., Li, F., Chen, X., et al.: vPipe: a virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel Distrib. Syst. 33(3), 489–506 (2022)

    Article  Google Scholar 

  29. Zheng, L., Li, Z., Zhang, H., et al.: Alpa: automating inter-and \(\{\)Intra-Operator\(\}\) parallelism for distributed deep learning. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022), pp. 559–578 (2022)

    Google Scholar 

Download references

Acknowledgments

This research is supported by National Key Research and Development Program of China with Grant ID 2018AAA0103203, Project of Key Research and Development Program of Sichuan Province with Grant ID 2021YFG0325, and Technical Cooperation Project of Huawei with Grant ID H04W220751. We would like to thank assistance and support provided by Huawei Mindspore team.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenhong Tian .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, G., Lan, H., Xie, Y., Tian, W., Qian, J., Su, T. (2024). CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-69766-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-69765-4

  • Online ISBN: 978-3-031-69766-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics