skip to main content
10.1145/3492321.3519563acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections

Out-of-order backprop: an effective scheduling technique for deep learning

Published:28 March 2022Publication History

ABSTRACT

Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the training. This paper proposes out-of-order (ooo) back-prop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single- and multi-GPU training can be commonly improve by applying ooo backprop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream ooo computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline stalls. We evaluate our optimizations with twelve neural networks and five public datasets. Compared to the respective state of the art training systems, our algorithms improve the training throughput by 1.03--1.58× for single-GPU training, by 1.10--1.27× for data-parallel training, and by 1.41--1.99× for pipeline-parallel training.

References

  1. BytePS github repository. URL: https://github.com/bytedance/byteps.Google ScholarGoogle Scholar
  2. DAPPLE github repository. URL: https://github.com/AlibabaPAI/DAPPLE.Google ScholarGoogle Scholar
  3. FTPipe github repository. URL: https://github.com/saareliad/FTPipe.Google ScholarGoogle Scholar
  4. MXNet dual stream convolution. URL: https://github.com/apache/incubator-mxnet/pull/14006.Google ScholarGoogle Scholar
  5. NVIDIA Corporation, FasterTransformer. URL: https://github.com/NVIDIA/FasterTransformer.Google ScholarGoogle Scholar
  6. NVIDIA Corporation, TensorRT. URL: https://developer.nvidia.com/tensorrt.Google ScholarGoogle Scholar
  7. The repository for ooo backprop experiments. URL: https://github.com/mlsys-seo/ooo-backprop.Google ScholarGoogle Scholar
  8. Torchscript. URL: https://pytorch.org/docs/stable/jit.html.Google ScholarGoogle Scholar
  9. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.Google ScholarGoogle Scholar
  10. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, R. Cattoni, and Marcello Federico. The IWSLT 2015 evaluation campaign.Google ScholarGoogle Scholar
  11. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.Google ScholarGoogle Scholar
  12. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578--594, 2018.Google ScholarGoogle Scholar
  13. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.Google ScholarGoogle Scholar
  14. Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric Xing. Toward understanding the impact of staleness in distributed machine learning. In International Conference on Learning Representations, 2018.Google ScholarGoogle Scholar
  15. Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In NeurIPS, 2020.Google ScholarGoogle Scholar
  16. Giovanni De Micheli. Synthesis and optimization of digital circuits. Number BOOK. McGraw Hill, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Julien Demouth. CUDA Pro Tip: Minimize the tail effect, NVIDIA Developer Blog, 2015.Google ScholarGoogle Scholar
  18. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google ScholarGoogle Scholar
  19. Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. Ios: Inter-operator scheduler for cnn acceleration. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 167--180, 2021. URL: https://proceedings.mlsys.org/paper/2021/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf.Google ScholarGoogle Scholar
  20. Saar Eliad, Ido Hakimi, Alon De Jagger, Mark Silberstein, and Assaf Schuster. Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. In 2021 USENIX Annual Technical Conference (USENIXATC '21), pages 381--396, 2021.Google ScholarGoogle Scholar
  21. Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: a pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431--445, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Michael R Garey, David S Johnson, and Ravi Sethi. The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2):117--129, 1976.Google ScholarGoogle Scholar
  23. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356--3369, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  24. Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.Google ScholarGoogle Scholar
  25. Ronald L Graham. Bounds for certain multiprocessing anomalies. Bell System Technical Journal(BSTJ), 45(9):1563--1581, 1966.Google ScholarGoogle Scholar
  26. Alan Gray. Getting started with cuda graphs. https://developer.nvidia.com/blog/cuda-graphs/, NVIDIA Developer Blog, 2019.Google ScholarGoogle Scholar
  27. Suyog Gupta, Wei Zhang, and Fei Wang. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In IEEE 16th International Conference on Data Mining (ICDM), pages 171--180. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  28. Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018.Google ScholarGoogle Scholar
  29. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  30. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020.Google ScholarGoogle Scholar
  31. Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B Gibbons, Garth A Gibson, Gregory R Ganger, and Eric P Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in Neural Information Processing Systems (NeurIPS), pages 1223--1231, 2013.Google ScholarGoogle Scholar
  32. Dorit S Hochbaum and David B Shmoys. Using dual approximation algorithms for scheduling problems theoretical and practical results. Journal of the ACM (JACM), 34(1):144--162, 1987.Google ScholarGoogle Scholar
  33. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.Google ScholarGoogle Scholar
  34. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700--4708, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  35. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems(NeurIPS), pages 103--112, 2019.Google ScholarGoogle Scholar
  36. Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. Priority-based parameter propagation for distributed dnn training. arXiv preprint arXiv:1905.03960, 2019.Google ScholarGoogle Scholar
  37. Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multitenant GPU clusters for DNN training workloads. In USENIX Annual Technical Conference (USENIX ATC), pages 947--960, 2019.Google ScholarGoogle Scholar
  38. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous gpu/cpu clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 463--479. USENIX Association, November 2020. URL: https://www.usenix.org/conference/osdi20/presentation/jiang.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yunyong Ko, Kibong Choi, Jiwon Seo, and Sang-Wook Kim. An indepth analysis of distributed training of deep neural networks. 2021 IEEE International Parallel & Distributed Processing Symposium, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  40. Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.Google ScholarGoogle Scholar
  41. Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, and Byung-Gon Chun. Nimble: Lightweight and parallel gpu task scheduling for deep learning. Advances in Neural Information Processing Systems, 33, 2020.Google ScholarGoogle Scholar
  42. Monica Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, pages 318--328, 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2019.Google ScholarGoogle Scholar
  44. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12).Google ScholarGoogle Scholar
  45. Shigang Li and Torsten Hoefler. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. arXiv preprint arXiv:2107.06925, 2021.Google ScholarGoogle Scholar
  46. Xiaqing Li, Guangyan Zhang, H Howie Huang, Zhufan Wang, and Weimin Zheng. Performance analysis of gpu-based convolutional neural networks. In 45th International Conference on Parallel Processing (ICPP), pages 67--76. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (ICML), 2018.Google ScholarGoogle Scholar
  48. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.Google ScholarGoogle Scholar
  49. Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 881--897, 2020.Google ScholarGoogle Scholar
  50. Paulius Micikevicius. GPU performance analysis and optimization. https://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf, GTC, 2012.Google ScholarGoogle Scholar
  51. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 1--15, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503, 2020.Google ScholarGoogle Scholar
  53. Deepak Narayanan, Keshav Santhanam, Amar Phanishayee, and Matei Zaharia. Accelerating deep learning workloads through efficient multi-model execution. In NeurIPS Workshop on Systems for Machine Learning, page 20, 2018.Google ScholarGoogle Scholar
  54. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters. arXiv preprint arXiv:2104.04473, 2021.Google ScholarGoogle Scholar
  55. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), pages 8024--8035, 2019.Google ScholarGoogle Scholar
  56. Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16--29, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211--252, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.Google ScholarGoogle Scholar
  59. Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. arXiv preprint arXiv:2004.12817, 2020.Google ScholarGoogle Scholar
  60. Mingcong Song, Yang Hu, Yunlong Xu, Chao Li, Huixiang Chen, Jingling Yuan, and Tao Li. Bridging the semantic gaps of gpu acceleration for scale-out cnn-based big data processing: Think big, see small. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 315--326, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158--2170, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  62. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1--9, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  63. Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on gpus. ACM SIGARCH Computer Architecture News, 42(3):193--204, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. The TensorFlow Team. Disable multi-stream support in TensorFlow. https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/compiler/xla/service/gpu/stream_assignment.cc#L78.Google ScholarGoogle Scholar
  65. The XLA team. XLA - tensorflow, compiled. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html, 2017.Google ScholarGoogle Scholar
  66. Robert M Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of research and Development, 11(1):25--33, 1967.Google ScholarGoogle Scholar
  67. Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems, 13(3):260--274, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Nanda K Unnikrishnan and Keshab K Parhi. Layerpipe: Accelerating deep neural network training by intra-layer and inter-layer gradient pipelining and multiprocessor scheduling. arXiv preprint arXiv:2108.06629, 2021.Google ScholarGoogle Scholar
  69. Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2018, pages 1112--1122. Association for Computational Linguistics (ACL), 2018.Google ScholarGoogle ScholarCross RefCross Ref
  70. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 235--246. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  71. Arissa Wongpanich, Yang You, and James Demmel. Rethinking the value of asynchronous solvers for distributed deep learning. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pages 52--60, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 5753--5763, 2019.Google ScholarGoogle Scholar
  73. Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained gpu sharing primitives for deep learning applications. MLSys' 20, 2020.Google ScholarGoogle Scholar
  74. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In USENIX Annual Technical Conference (USENIX ATC), pages 181--193, 2017.Google ScholarGoogle Scholar
  75. Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems (NeurIPS), pages 685--693, 2015.Google ScholarGoogle Scholar
  76. Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, and Ninghui Sun. A performance analysis framework for exploiting gpu microarchitectural capability. In Proceedings of the International Conference on Supercomputing (ICS), pages 1--10, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Out-of-order backprop: an effective scheduling technique for deep learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      EuroSys '22: Proceedings of the Seventeenth European Conference on Computer Systems
      March 2022
      783 pages
      ISBN:9781450391627
      DOI:10.1145/3492321

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 March 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate241of1,308submissions,18%

      Upcoming Conference

      EuroSys '24
      Nineteenth European Conference on Computer Systems
      April 22 - 25, 2024
      Athens , Greece

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader