Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Guan, Lei; Li, Dong-Sheng; Liang, Ji-Ye; Wang, Wen-Jian; Ge, Ke-Shi; Lu, Xi-Cheng

doi:10.1007/s11390-024-3872-3

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Survey
Published: 22 July 2024

Volume 39, pages 567–584, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Lei Guan (关　磊)¹,
Dong-Sheng Li (李东升)²,
Ji-Ye Liang (梁吉业)³,
Wen-Jian Wang (王文剑)³,
Ke-Shi Ge (葛可适)² &
…
Xi-Cheng Lu (卢锡城)²

455 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Deep learning has become the cornerstone of artificial intelligence, playing an increasingly important role in human production and lifestyle. However, as the complexity of problem-solving increases, deep learning models become increasingly intricate, resulting in a proliferation of large language models with an astonishing number of parameters. Pipeline model parallelism (PMP) has emerged as one of the mainstream approaches to addressing the significant challenge of training “big models”. This paper presents a comprehensive review of PMP. It covers the basic concepts and main challenges of PMP. It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches, and discusses the main techniques to achieve load balance for both intra-node and inter-node training. Furthermore, the main techniques to optimize computation, storage, and communication are presented, with potential research directions being discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/CVPR.2016.90.
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.1725–1732. DOI: https://doi.org/10.1109/CVPR.2014.223.
Hinton G, Deng L, Yu D, Dahl G E, Mohamed A R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97. DOI: https://doi.org/10.1109/MSP.2012.2205597.
Article Google Scholar
Li J Y. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal and Information Processing, 2022, 11(1): e8. DOI: https://doi.org/10.1561/116.00000050.
Article Google Scholar
Wu Y, Schuster M, Chen Z F et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv: 1609.08144, 2016. https://arxiv.org/abs/1609.08144, May 2024.
Dabre R, Chu C H, Kunchukuttan A. A survey of multilingual neural machine translation. ACM Computing Surveys, 2021, 53(5): Article No. 99. DOI: https://doi.org/10.1145/3406095.
Chen C Y, Seff A, Kornhauser A, Xiao J X. DeepDriving: Learning affordance for direct perception in autonomous driving. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.2722–2730. DOI: https://doi.org/10.1109/ICCV.2015.312.
Bojarski M, Del Testa D, Dworakowski D et al. End to end learning for self-driving cars. arXiv: 1604.07316, 2016. https://arxiv.org/abs/1604.07316, May 2024.
Real E, Aggarwal A, Huang Y P, Le Q V. Regularized evolution for image classifier architecture search. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jan. 27–Feb. 1, 2019, pp.4780–4789. DOI: https://doi.org/10.1609/aaai.v33i01.33014780.
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: https://doi.org/10.18653/V1/N19-1423.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9.
Google Scholar
Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. arXiv: 2005.14165, 2020. https://arxiv.org/abs/2005.14165, May 2024.
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
MathSciNet Google Scholar
Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. Generative pretraining from pixels. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 158.
Zeng W, Ren X Z, Su T et al. PanGu-α: Largescale autoregressive pretrained Chinese language models with auto-parallel computation. arXiv: 2104.12369, 2021. https://arxiv.org/abs/2104.12369, May 2024.
Wang S H, Sun Y, Xiang Y et al. ERNIE 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv: 2112. 12731, 2021. https://arxiv.org/abs/2112.12731, May 2024.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In Proc. the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp.248–255. DOI: https://doi.org/10.1109/CVPR.2009.5206848.
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S. YouTube-8M: A large-scale video classification benchmark. arXiv: 1609. 08675, 2016. https://arxiv.org/abs/1609.08675, May 2024.
Narayanan D, Shoeybi M, Casper J et al. Efficient large-scale language model training on GPU clusters using megatron-LM. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021, Article No. 58. DOI: https://doi.org/10.1145/3458817.3476209.
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y Q, He K M. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv: 1706.02677, 2017. https:/arxiv.org/abs/1706.02677, May 2024.
You Y, Gitman I, Ginsburg B. Scaling SGD batch size to 32k for ImageNet training. arXiv: 1708.03888, 2017. https://arxiv.org/abs/1708.03888v1?2, May 2024.
Assran M, Loizou N, Ballas N, Rabbat M. Stochastic gradient push for distributed deep learning. In Proc. the 36th International Conference on Machine Learning, Jun. 2019, pp.344–353.
Dean J, Corrado G S, Monga R et al. Large scale distributed deep networks. In Proc. the 25th International Conference on Neural Information Processing Systems, Dec. 2012, pp.1223–1231.
Shazeer N, Cheng Y L, Parmar N et al. Mesh-Tensor-Flow: Deep learning for supercomputers. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.10435–10444.
Jia Z H, Zaharia M, Aiken A. Beyond data and model parallelism for deep neural networks. In Proc. the 2019 SysML Conference, Mar. 31–Apr. 2, Apr. 2019, pp.1–13.
Gan W S, Lin J C W, Fournier-Viger P, Chao H C, Yu P S. A survey of parallel sequential pattern mining. ACM Trans. Knowledge Discovery from Data, 2019, 13(3): 25. DOI: https://doi.org/10.1145/3314107.
Article Google Scholar
Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur N R, Ganger G R, Gibbons P B, Zaharia M. PipeDream: Generalized pipeline parallelism for DNN training. In Proc. the 27th ACM Symposium on Operating Systems Principles, Oct. 2019, pp.1–15. DOI: https://doi.org/10.1145/3341301.3359646.
Huang Y P, Cheng Y L, Bapna A, Firat O, Chen M X, Chen D H, Lee H, Ngiam J, Le Q V, Wu Y H, Chen Z F. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 10.
Pouyanfar S, Sadiq S, Yan Y L, Tian H M, Tao Y D, Reyes M P, Shyu M L, Chen S C, Iyengar S S. A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys, 2019, 51(5): 92. DOI: https://doi.org/10.1145/3234150.
Article Google Scholar
Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 2020, 52(4): 65. DOI: https://doi.org/10.1145/3320060.
Article Google Scholar
Tang Z H, Shi S H, Wang W, Li B, Chu X W. Communication-efficient distributed deep learning: A comprehensive survey. arXiv: 2003.06307, 2020. https://arxiv.org/abs/2003.06307, May 2024.
Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys, 2021, 53(1): Article No. 3. DOI: https://doi.org/10.1145/3363554.
Liang P, Tang Y, Zhang X D, Bai Y H, Su T, Lai Z Q, Qiao L B, Li D S. A survey on auto-parallelism of large-scale deep learning training. IEEE Trans. Parallel and Distributed Systems, 2023, 34(8): 2377–2390. DOI: https://doi.org/10.1109/TPDS.2023.3281931.
Article Google Scholar
Shen L, Sun Y, Yu Z Y, Ding L, Tian X M, Tao D C. On efficient training of large-scale deep learning models: A literature review. arXiv: 2304.03589, 2023. https://arxiv.org/abs/2304.03589, May 2024.
Kumar S. Introduction to Parallel Programming. Cambridge University Press, 2022.
Abadi M, Barham P, Chen J N et al. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Conference on Operating Systems Design and Implementation, Nov. 2016, pp.265–283.
Paszke A, Gross S, Massa F et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. the 33rd Conference on Neural Information Processing Systems, Dec. 2019, Article No. 721.
Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv: 1802.05799, 2018. https://arxiv.org/abs/1802.05799, May 2024.
Li M, G. Andersen D G, Park J W, Smola A J, Ahmed A, Josifovski V, Long J, Shekita E J, Su B Y. Scaling distributed machine learning with the parameter server. In Proc. the 11th USENIX Conference on Operating Systems Design and Implementation, Oct. 2014, pp.583–598.
Cui H G, Zhang H, Ganger G R, Gibbons P B, Xing E P. GeePs: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proc. the 11th European Conference on Computer Systems, Apr. 2016, Article No. 4. DOI: https://doi.org/10.1145/2901318.2901323.
Patarasuk P, Yuan X. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 2009, 69(2): 117–124. DOI: https://doi.org/10.1016/j.jpdc.2008.09.002.
Article Google Scholar
Alistarh D, Grubic D, Li J Z, Tomioka R, Vojnovic M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.1707–1718.
Jia Z H, Lin S N, Qi C R, Aiken A. Exploring hidden dimensions in parallelizing convolutional neural networks. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp.2279–2288.
Rajbhandari S, Rasley J, Ruwase O, He Y X. ZeRO: Memory optimizations toward training trillion parameter models. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. DOI: https://doi.org/10.1109/SC41405.2020.00024.
Gusak J, Cherniuk D, Shilova A et al. Survey on efficient training of large neural networks. In Proc. the 31st International Joint Conference on Artificial Intelligence, Jul. 2022, pp.5494–5501. DOI: https://doi.org/10.24963/ijcai.2022/769.
Li S G, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021, Article No. 27. DOI: https://doi.org/10.1145/3458817.3476145.
Fan S Q, Rong Y, Meng C et al. DAPPLE: A pipelined data parallel approach for training large models. In Proc. the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2021, pp.431–445. DOI: https://doi.org/10.1145/3437801.3441593.
Mirhoseini A, Pham H, Le Q V, Steiner B, Larsen R, Zhou Y F, Kumar N, Norouzi M, Bengio S, Dean J. Device placement optimization with reinforcement learning. In Proc. the 34th International Conference on Machine Learning, Aug. 2017, pp.2430–2439.
Krizhevsky A. One weird trick for parallelizing convolutional neural networks. arXiv: 1404.5997, 2014. https://arxiv.org/abs/1404.5997, May 2024.
Li S G, Liu H X, Bian Z D, Fang J R, Huang H C, Liu Y L, Wang B X, You Y. Colossal-AI: A unified deep learning system for large-scale parallel training. In Proc. the 52nd International Conference on Parallel Processing, Aug. 2023, pp.766–775. DOI: https://doi.org/10.1145/3605573.3605613.
Lai Z Q, Li S W, Tang X D, Ge K S, Liu W J, Duan Y B, Qiao L B, Li D S. Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans. Parallel and Distributed Systems, 2023, 34(5): 1466–1478. DOI: https://doi.org/10.1109/TPDS.2023.3247001.
Article Google Scholar
Ramashekar T, Bondhugula U. Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Architecture and Code Optimization, 2013, 10(4): 60. DOI: https://doi.org/10.1145/2544100.
Article Google Scholar
Jain A, Awan A A, Aljuhani A M, Hashmi J M, Anthony Q G, Subramoni H, Panda D K, Machiraju R, Parwani A. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, pp.1–15. DOI: https://doi.org/10.1109/SC41405.2020.00049.
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv: 1909.08053, 2019. https://arxiv.org/abs/1909.08053, May 2024.
Qi P, Wan X, Huang G, Lin M. Zero bubble pipeline parallelism. In Proc. the 11th International Conference on Learning Representations, Jul. 2023.
Gaunt A L, Johnson M A, Riechert M, Tarlow D, Tomioka R, Vytiniotis D, Webster S. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arXiv: 1705.09786, 2017. https://arxiv.org/abs/1705.09786, May 2024.
Narayanan D, Phanishayee A, Shi K Y, Chen X, Zaharia M. Memory-efficient pipeline-parallel DNN training. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.7937–7947.
Chen C C, Yang C L, Cheng H Y. Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. arXiv: 1809.02839, 2018. https://arxiv.org/abs/1809.02839, May 2024.
Guan L, Yin W T, Li D S, Lu X C. XPipe: Efficient pipeline model parallelism for multi-GPU DNN training. arXiv: 1911.04610, 2019. https://arxiv.org/abs/1911.04610, May 2024.
Chen Z H, Xu C, Qian W N, Zhou A Y. Elastic averaging for efficient pipelined DNN training. In Proc. the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Feb. 2023, pp.380–391. DOI: https://doi.org/10.1145/3572848.3577484.
Yang P C, Zhang X M, Zhang W P, Yang M, Wei H. Group-based interleaved pipeline parallelism for large-scale DNN training. In Proc. the 10th International Conference on Learning Representations, Apr. 2022.
Qian N. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999, 12(1): 145–151. DOI: https://doi.org/10.1016/S0893-6080(98)00116-6.
Article Google Scholar
Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In Proc. the 30th International Conference on Machine Learning, Jun. 2013, pp.III-1139-III-1147.
Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
Yang B W, Zhang J, Li J, Ré C, Aberger C R, De Sa C. PipeMare: Asynchronous pipeline parallel DNN training. arXiv: 1910.05124, 2019. https://arxiv.org/abs/1910.05124, May 2024.
Zhang S X, Choromanska A, LeCun Y. Deep learning with elastic averaging SGD. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.685–693.
Guan L, Qiao L B, Li D S, Sun T, Ge K S, Lu X C. An efficient ADMM-based algorithm to Nonconvex penalized support vector machines. In Proc. the 2018 IEEE International Conference on Data Mining Workshops, Nov. 2018, pp.1209–1216. DOI: https://doi.org/10.1109/ICDMW.2018.00173.
Zeng Z H, Liu C B, Tang Z, Chang W L, Li K L. Training acceleration for deep neural networks: A hybrid parallelization strategy. In Proc. the 58th ACM/IEEE Design Automation Conference, Dec. 2021, pp.1165–1170. DOI: https://doi.org/10.1109/DAC18074.2021.9586300.
Zheng L M, Li Z H, Zhang H, Zhuang Y H, Chen Z F, Huang Y P, Wang Y D, Xu Y Z, Zhuo D Y, Xing E P, Gonzalez J E, Stoica I. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.559–578.
Liu W J, Lai Z Q, Li S W, Duan Y B, Ge K S, Li D S. AutoPipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing. In Proc. the 2022 IEEE International Conference on Cluster Computing, Sept. 2022, pp.301–312. DOI: https://doi.org/10.1109/CLUSTER51413.2022.00042.
Unger C, Jia Z H, Wu W et al. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.267–284.
Zhao S X, Li F X, Chen X S, Guan X X, Jiang J Y, Huang D, Qing Y, Wang S, Wang P, Zhang G, Li C, Luo P, Cui H M. VPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel and Distributed Systems, 2022, 33(3): 489–506. DOI: https://doi.org/10.1109/TPDS.2021.3094364.
Article Google Scholar
Osawa K, Li S, Hoefler T. Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. In Proc. the 6th Conference on Machine Learning and Systems, May 2023.
Tarnawski J, Narayanan D, Phanishayee A. Piper: Multi-dimensional planner for DNN parallelization. In Proc. the 35th International Conference on Neural Information Processing Systems, Dec. 2021, Article No. 1902.
Jiang W, Wang B, Ma S, Hou X, Huang L B, Dai Y, Fang J B. PipeFB: An optimized pipeline parallelism scheme to reduce the peak memory usage. In Proc. the 22nd International Conference on Algorithms and Architectures for Parallel Processing, Oct. 2022, pp.590–604. DOI: https://doi.org/10.1007/978-3-031-22677-9_31.
Chen T Q, Xu B, Zhang C Y, Guestrin C. Training deep nets with sublinear memory cost. arXiv: 1604.06174, 2016. https://arxiv.org/abs/1604.06174, May 2024.
Kim T, Kim H, Yu G I, Chun B G. BPIPE: Memory-balanced pipeline parallelism for training large language models. In Proc. the 40th International Conference on Machine Learning, Jul. 2023, Article No. 682.
Wang L N, Ye J M, Zhao Y Y, Wu W, Li A, Song S L, Xu Z L, Kraska T. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proc. the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2018, pp.41–53. DOI: https://doi.org/10.1145/3178487.3178491.
Jiang W, Xu R, Ma S, Wang Q, Hou X, Lu H Y. A memory saving mechanism based on data transferring for pipeline parallelism. In Proc. the 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Sept. 30-Oct. 3, 2021, pp.1230–1235. DOI: https://doi.org/10.1109/ISPA-BD-Cloud-SocialCom-SustainCom52081.2021.00169.
Zhou Q, Wang H Q, Yu X Y, Li C, Bai Y H, Yan F, Xu Y L. MPress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism. In Proc. the 29th IEEE International Symposium on High-Performance Computer Architecture, Feb. 25-Mar. 1, 2023, pp.556–569. DOI: https://doi.org/10.1109/HPCA56546.2023.10071077.
Le Scao T, Fan A, Akiki C et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv: 2211.05100, 2022. https://arxiv.org/abs/2211.05100, May 2024.
Guan L. Weight prediction boosts the convergence of AdamW. In Proc. the 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 2023, pp.329–340. DOI: https://doi.org/10.1007/978-3-031-33374-3_26.
Guan L, Li D S, Shi Y Q, Meng J. XGrad: Boosting gradient-based optimizers with weight prediction. IEEE Trans. Pattern Analysis and Machine Intelligence. DOI: https://doi.org/10.1109/TPAMI.2024.3387399.
Shi N C, Li D W, Hong M Y, Sun R Y. RMSprop converges with proper hyper-parameter. In Proc. the ICLR 2021, May 2021.
Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv: 1711.05101, 2017. https://arxiv.org/abs/1711.05101, May 2024.
Zhuang J T, Tang T, Ding Y F et al. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, pp.18795–18806.
Wang Y Z, Kang Y, Qin C, Wang H, Xu Y, Zhang Y L, Fu Y. Momentum is all you need for data-driven adaptive optimization. In Proc. the 23rd IEEE International Conference on Data Mining, Dec. 2023, pp.1385–1390. DOI: https://doi.org/10.1109/ICDM58522.2023.00179.
Liao X K, Pang Z B, Wang K F, Lu Y T, Xie M, Xia J, Dong D Z, Suo G. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology, 2015, 30(2): 259–272. DOI: https://doi.org/10.1007/s11390-015-1520-7.
Article Google Scholar
Yang X J, Liao X K, Lu K, Hu Q F, Song J Q, Su J S. The TianHe-1A supercomputer: Its hardware and software. Journal of Computer Science and Technology, 2011, 26(3): 344–351. DOI: https://doi.org/10.1007/s02011-011-1137-8.
Article Google Scholar
Zhan J, Zhang J H. Pipe-torch: Pipeline-based distributed deep learning in a GPU cluster with heterogeneous networking. In Proc. the 7th International Conference on Advanced Cloud and Big Data, Sept. 2019, pp.55–60. DOI: https://doi.org/10.1109/CBD.2019.00020.
Park J H, Yun G, Yi C M, Nguyen N T, Lee S, Choi J, Noh S H, Choi Y R. HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In Proc. the 2020 USENIX Conference on Usenix Annual Technical Conference, Jul. 2020, Article No. 21.

Download references

Acknowledgements

Lei Guan thanks Prof. Shi-Gang Li at Beijing University of Posts and Telecommunications (BUPT) for stimulating discussions about pipeline parallelism.

Author information

Authors and Affiliations

College of Science, National University of Defense Technology, Changsha, 410073, China
Lei Guan (关　磊)
College of Computer, National University of Defense Technology, Changsha, 410073, China
Dong-Sheng Li (李东升), Ke-Shi Ge (葛可适) & Xi-Cheng Lu (卢锡城)
School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, China
Ji-Ye Liang (梁吉业) & Wen-Jian Wang (王文剑)

Authors

Lei Guan (关　磊)
View author publications
You can also search for this author inPubMed Google Scholar
Dong-Sheng Li (李东升)
View author publications
You can also search for this author inPubMed Google Scholar
Ji-Ye Liang (梁吉业)
View author publications
You can also search for this author inPubMed Google Scholar
Wen-Jian Wang (王文剑)
View author publications
You can also search for this author inPubMed Google Scholar
Ke-Shi Ge (葛可适)
View author publications
You can also search for this author inPubMed Google Scholar
Xi-Cheng Lu (卢锡城)
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Dong-Sheng Li (李东升).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 62025208, U21A20473, U21A20513, 62076154, and 62302512, and the State Administration of Science, Technology, and Industry for National Defense of China under Grant No. WDZC20235250118.

Lei Guan received his Ph.D. degree in computer science and technology from the National University of Defense Technology (NUDT), Changsha, in 2022. He is an associate professor in the College of Science at NUDT. His research interests include deep learning, parallel computing, optimization, and AI for science.

Dong-Sheng Li received his Ph.D. degree in computer science and technology from the National University of Defense Technology (NUDT), Changsha, in 2005. He is a professor in the College of Computer at NUDT. He was awarded the Chinese National Excellent Doctoral Dissertation in 2008. His research interests include distributed systems, cloud computing, and big data processing.

Ji-Ye Liang received his Ph.D. degree in applied mathematics from Xi’an Jiaotong University, Xi’an, in 2001. He is a professor with the Key Laboratory of Computational Intelligence and Chinese Information Processing of the Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan. His research interests include artificial intelligence, granular computing, data mining, and machine learning.

Wen-Jian Wang received her Ph.D. degree in applied mathematics from Xi’an Jiaotong University, Xi’an, in 2004. Now she is a full professor and Ph.D. supervisor of the Key Laboratory of Computational Intelligence and Chinese Information Processing of the Ministry of Education, Shanxi University, Taiyuan. Her research interests include machine learning, data mining, intelligent computing, etc.

Ke-Shi Ge received his B.S. degree in computer science and technology from the Department of Computer Science and Technology, Xi’ an Jiaotong University, Xi’an, in 2015, and his Ph.D. and M.S. degrees in computer science and technology from the College of Computer, National University of Defense Technology (NUDT), Changsha, in 2022 and 2017, respectively. He is currently an assistant professor with NUDT. His research interests include high-performance computing and distributed machine learning systems.

Xi-Cheng Lu received his B.S. degree in computer science from the Harbin Military Engineering Institute, Harbin, in 1970. He is currently a professor with the College of Computer, National University of Defense Technology, Changsha. His research interests include distributed computing, computer networks, and parallel computing. He is an Academician of the Chinese Academy of Engineering.

Electronic Supplementary Material