research-article

Out-of-order backprop: an effective scheduling technique for deep learning

Authors:
Hyungjun Oh

Hanyang University

Hanyang University
View Profile

,
Junyeol Lee

Hanyang University

Hanyang University
View Profile

,
Hyeongju Kim

Hanyang University

Hanyang University
View Profile

,
Jiwon Seo

Hanyang University

Hanyang University
View Profile

EuroSys '22: Proceedings of the Seventeenth European Conference on Computer SystemsMarch 2022Pages 435–452https://doi.org/10.1145/3492321.3519563

Published:28 March 2022Publication History

EuroSys '22: Proceedings of the Seventeenth European Conference on Computer Systems

Pages 435–452

ABSTRACT

Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the training. This paper proposes out-of-order (ooo) back-prop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single- and multi-GPU training can be commonly improve by applying ooo backprop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream ooo computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline stalls. We evaluate our optimizations with twelve neural networks and five public datasets. Compared to the respective state of the art training systems, our algorithms improve the training throughput by 1.03--1.58× for single-GPU training, by 1.10--1.27× for data-parallel training, and by 1.41--1.99× for pipeline-parallel training.

References

BytePS github repository. URL: https://github.com/bytedance/byteps.Google Scholar
DAPPLE github repository. URL: https://github.com/AlibabaPAI/DAPPLE.Google Scholar
FTPipe github repository. URL: https://github.com/saareliad/FTPipe.Google Scholar
MXNet dual stream convolution. URL: https://github.com/apache/incubator-mxnet/pull/14006.Google Scholar
NVIDIA Corporation, FasterTransformer. URL: https://github.com/NVIDIA/FasterTransformer.Google Scholar
NVIDIA Corporation, TensorRT. URL: https://developer.nvidia.com/tensorrt.Google Scholar
The repository for ooo backprop experiments. URL: https://github.com/mlsys-seo/ooo-backprop.Google Scholar
Torchscript. URL: https://pytorch.org/docs/stable/jit.html.Google Scholar
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.Google Scholar
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, R. Cattoni, and Marcello Federico. The IWSLT 2015 evaluation campaign.Google Scholar
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.Google Scholar
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578--594, 2018.Google Scholar
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.Google Scholar
Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric Xing. Toward understanding the impact of staleness in distributed machine learning. In International Conference on Learning Representations, 2018.Google Scholar
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In NeurIPS, 2020.Google Scholar
Giovanni De Micheli. Synthesis and optimization of digital circuits. Number BOOK. McGraw Hill, 1994.Google ScholarDigital Library
Julien Demouth. CUDA Pro Tip: Minimize the tail effect, NVIDIA Developer Blog, 2015.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. Ios: Inter-operator scheduler for cnn acceleration. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 167--180, 2021. URL: https://proceedings.mlsys.org/paper/2021/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf.Google Scholar
Saar Eliad, Ido Hakimi, Alon De Jagger, Mark Silberstein, and Assaf Schuster. Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. In 2021 USENIX Annual Technical Conference (USENIXATC '21), pages 381--396, 2021.Google Scholar
Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: a pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431--445, 2021.Google ScholarDigital Library
Michael R Garey, David S Johnson, and Ravi Sethi. The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2):117--129, 1976.Google Scholar
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356--3369, 2020.Google ScholarCross Ref
Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.Google Scholar
Ronald L Graham. Bounds for certain multiprocessing anomalies. Bell System Technical Journal(BSTJ), 45(9):1563--1581, 1966.Google Scholar
Alan Gray. Getting started with cuda graphs. https://developer.nvidia.com/blog/cuda-graphs/, NVIDIA Developer Blog, 2019.Google Scholar
Suyog Gupta, Wei Zhang, and Fei Wang. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In IEEE 16th International Conference on Data Mining (ICDM), pages 171--180. IEEE, 2016.Google ScholarCross Ref
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 770--778, 2016.Google ScholarCross Ref
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020.Google Scholar
Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B Gibbons, Garth A Gibson, Gregory R Ganger, and Eric P Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in Neural Information Processing Systems (NeurIPS), pages 1223--1231, 2013.Google Scholar
Dorit S Hochbaum and David B Shmoys. Using dual approximation algorithms for scheduling problems theoretical and practical results. Journal of the ACM (JACM), 34(1):144--162, 1987.Google Scholar
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.Google Scholar
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700--4708, 2017.Google ScholarCross Ref
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems(NeurIPS), pages 103--112, 2019.Google Scholar
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. Priority-based parameter propagation for distributed dnn training. arXiv preprint arXiv:1905.03960, 2019.Google Scholar
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multitenant GPU clusters for DNN training workloads. In USENIX Annual Technical Conference (USENIX ATC), pages 947--960, 2019.Google Scholar
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous gpu/cpu clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 463--479. USENIX Association, November 2020. URL: https://www.usenix.org/conference/osdi20/presentation/jiang.Google ScholarDigital Library
Yunyong Ko, Kibong Choi, Jiwon Seo, and Sang-Wook Kim. An indepth analysis of distributed training of deep neural networks. 2021 IEEE International Parallel & Distributed Processing Symposium, 2021.Google ScholarCross Ref
Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.Google Scholar
Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, and Byung-Gon Chun. Nimble: Lightweight and parallel gpu task scheduling for deep learning. Advances in Neural Information Processing Systems, 33, 2020.Google Scholar
Monica Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, pages 318--328, 1988.Google ScholarDigital Library
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2019.Google Scholar
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12).Google Scholar
Shigang Li and Torsten Hoefler. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. arXiv preprint arXiv:2107.06925, 2021.Google Scholar
Xiaqing Li, Guangyan Zhang, H Howie Huang, Zhufan Wang, and Weimin Zheng. Performance analysis of gpu-based convolutional neural networks. In 45th International Conference on Parallel Processing (ICPP), pages 67--76. IEEE, 2016.Google ScholarCross Ref
Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (ICML), 2018.Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.Google Scholar
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 881--897, 2020.Google Scholar
Paulius Micikevicius. GPU performance analysis and optimization. https://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf, GTC, 2012.Google Scholar
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 1--15, 2019.Google ScholarDigital Library
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503, 2020.Google Scholar
Deepak Narayanan, Keshav Santhanam, Amar Phanishayee, and Matei Zaharia. Accelerating deep learning workloads through efficient multi-model execution. In NeurIPS Workshop on Systems for Machine Learning, page 20, 2018.Google Scholar
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters. arXiv preprint arXiv:2104.04473, 2021.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), pages 8024--8035, 2019.Google Scholar
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16--29, 2019.Google ScholarDigital Library
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211--252, 2015.Google ScholarDigital Library
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.Google Scholar
Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. arXiv preprint arXiv:2004.12817, 2020.Google Scholar
Mingcong Song, Yang Hu, Yunlong Xu, Chao Li, Huixiang Chen, Jingling Yuan, and Tao Li. Bridging the semantic gaps of gpu acceleration for scale-out cnn-based big data processing: Think big, see small. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 315--326, 2016.Google ScholarDigital Library
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158--2170, 2020.Google ScholarCross Ref
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1--9, 2015.Google ScholarCross Ref
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on gpus. ACM SIGARCH Computer Architecture News, 42(3):193--204, 2014.Google ScholarDigital Library
The TensorFlow Team. Disable multi-stream support in TensorFlow. https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/compiler/xla/service/gpu/stream_assignment.cc#L78.Google Scholar
The XLA team. XLA - tensorflow, compiled. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html, 2017.Google Scholar
Robert M Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of research and Development, 11(1):25--33, 1967.Google Scholar
Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems, 13(3):260--274, 2002.Google ScholarDigital Library
Nanda K Unnikrishnan and Keshab K Parhi. Layerpipe: Accelerating deep neural network training by intra-layer and inter-layer gradient pipelining and multiprocessor scheduling. arXiv preprint arXiv:2108.06629, 2021.Google Scholar
Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2018, pages 1112--1122. Association for Computational Linguistics (ACL), 2018.Google ScholarCross Ref
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 235--246. IEEE, 2010.Google ScholarCross Ref
Arissa Wongpanich, Yang You, and James Demmel. Rethinking the value of asynchronous solvers for distributed deep learning. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pages 52--60, 2020.Google ScholarDigital Library
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 5753--5763, 2019.Google Scholar
Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained gpu sharing primitives for deep learning applications. MLSys' 20, 2020.Google Scholar
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In USENIX Annual Technical Conference (USENIX ATC), pages 181--193, 2017.Google Scholar
Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems (NeurIPS), pages 685--693, 2015.Google Scholar
Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, and Ninghui Sun. A performance analysis framework for exploiting gpu microarchitectural capability. In Proceedings of the International Conference on Supercomputing (ICS), pages 1--10, 2017.Google ScholarDigital Library

Index Terms

Out-of-order backprop: an effective scheduling technique for deep learning
1. Computing methodologies
  1. Machine learning

Recommendations

Disjoint out-of-order execution processor

High-performance superscalar architectures used to exploit instruction level parallelism in single-thread applications have become too complex and power hungry for the multicore processors era. We propose a new architecture that uses multiple small ...
Read More
HGL: accelerating heterogeneous GNN training with holistic representation and optimization
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Graph neural networks (GNNs) have shown to significantly improve graph analytics. Existing systems for GNN training are primarily designed for homogeneous graphs. In industry, however, most graphs are actually heterogeneous in nature (i.e., having ...
Read More
Out-of-order vector architectures
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture

Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '22: Proceedings of the Seventeenth European Conference on Computer Systems
March 2022
783 pages
ISBN:9781450391627
DOI:10.1145/3492321
General Chair:
Yérom-David Bromberg
University of Rennes 1
,
Program Chairs:
Anne-Marie Kermarrec
EPFL
,
Christos Kozyrakis
Stanford University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Evaluated & Functional / v1.1
- Artifacts Available / v1.1
Author Tags
deep learning systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Upcoming Conference
EuroSys '24

Sponsor:

sigops

Nineteenth European Conference on Computer Systems

April 22 - 25, 2024

Athens , Greece
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 1,521
  Total Downloads
- Downloads (Last 12 months)456
- Downloads (Last 6 weeks)57
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Out-of-order backprop: an effective scheduling technique for deep learning

EuroSys '22: Proceedings of the Seventeenth European Conference on Computer Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Disjoint out-of-order execution processor

HGL: accelerating heterogeneous GNN training with holistic representation and optimization

Out-of-order vector architectures