Abstract
Graphics Processing Units (GPUs) have evolved into powerful accelerators for the development of Convolutional Neural Network (CNN) models. Most existing GPU-based frameworks adopt a kernel-based execution approach and only focus on optimizing individual kernels for better performance and resource utilization. With this approach, kernels involved will be launched sequentially, which may result in the underutilization of GPU resources due to the limited optimization space of a single kernel. In this paper, we propose an efficient software parallelization framework, called HGP4CNN, to accelerate the training of CNN models by considering the characteristics of workloads from both the same layer and adjacent layers as well as new GPU features such as concurrent kernel execution. In the intra-layer level, to achieve a better training performance of a single network layer, we design a novel model-based lightweight parallelization module to make better use of the concurrent kernel execution feature on modern GPUs. An asynchronous resource tracker is used to collect kernels’ information at runtime, and a kernel analyzer is devised to calculate the number of kernels that can be dispatched concurrently. Moreover, to avoid consuming too many CPU threads or process resources, we integrate a runtime scheduler module for kernel launch and a pool-based stream manager for GPU work queue management. While in the inter-layer level, we present a pipeline execution strategy to overlap the processing of workloads from adjacent layers. To determine the number of samples to be processed by a single pipeline stage, the analysis result from the intra-layer module is considered. In the end, we implement a prototype of the proposed framework with Caffe, a well-known deep learning framework and conduct experiments with four off-the-shelf CNN models on three NVIDIA GPUs. Results show that HGP4CNN can be exploited to achieve better performance over the original implementation and keep the convergence property of networks. We can achieve a speedup of up to 6.X for a single convolutional layer and 2.X for multiple layers within pipelines of a network model.
Similar content being viewed by others
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association, Savannah, GA, pp 265–283, https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
AMD (2018) Hcblas library. https://gpuopen.com/compute-product/hcblas/
Chen J, Pan X, Monga R, Bengio S, Jozefowicz R (2016a) Revisiting distributed synchronous sgd. arXiv preprint arXiv:160400981
Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:1512.01274
Chen T, Xu B, Zhang C, Guestrin C (2016b) Training deep nets with sublinear memory cost. CoRR arXiv:1604.06174
Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, Guestrin C, Krishnamurthy A (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), USENIX Association, Carlsbad, CA, pp 578–594, https://www.usenix.org/conference/osdi18/presentation/chen
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:14100759
Cipar J, Ho Q, Kim JK, Lee S, Ganger GR, Gibson G, Keeton K, Xing E (2013) Solving the straggler problem with bounded staleness. In: Presented as part of the 14th Workshop on Hot Topics in Operating Systems, USENIX, Santa Ana Pueblo, NM, https://www.usenix.org/conference/hotos13/solving-straggler-problem-bounded-staleness
Cui H, Cipar J, Ho Q, Kim JK, Lee S, Kumar A, Wei J, Dai W, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2014) Exploiting bounded staleness to speed up big data analytics. In: 2014 USENIX Annual Technical Conference (USENIX ATC 14), USENIX Association, Philadelphia, PA, pp 37–48, https://www.usenix.org/conference/atc14/technical-sessions/presentation/cui
Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 16), Association for Computing Machinery, New York, NY, USA, pp 1–16, https://doi.org/10.1145/2901318.2901323
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le QV, Ng AY (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231
Fu H, Tang S, He B, Yu C, Sun J (2018) Glp4nn: A convergence-invariant and network-agnostic light-weight parallelization framework for deep neural networks on modern gpus. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225077
GmbH GT (2018) Vampir. https://www.vampir.eu/
Goyal P, Dollár P, Girshick RB, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR arXiv:1706.02677
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778 https://doi.org/10.1109/CVPR.2016.90
Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231
Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, Chen z (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 103–112, http://papers.nips.cc/paper/8305-gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.pdf
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 448–456, http://proceedings.mlr.press/v37/ioffe15.html
Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L, Chen T, Hu G, Shi S, Chu X (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR arXiv:1807.11205
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’14, p 675–678 https://doi.org/10.1145/2647868.2654889
Jiang W, Zhang Y, Liu P, Ye G, Jin H (2018) Filayer: A novel fine-grained layer-wise parallelism strategy for deep neural networks. In: Kůrková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I (eds) Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, pp 321–330
Jin H, Liu B, Jiang W, Ma Y, Shi X, He B, Zhao S (2018) Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Trans Archit Code Optim. https://doi.org/10.1145/3243904
Kim JK, Ho Q, Lee S, Zheng X, Dai W, Gibson GA, Xing EP (2016) Strads: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’16, pp 5:1–5:16, https://doi.org/10.1145/2901318.2901331
Kjolstad F, Kamil S, Chou S, Lugato D, Amarasinghe S (2017) The tensor algebra compiler. Proc ACM Program Lang 1(OOPSLA), https://doi.org/10.1145/3133901
Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Toronto, Tech. rep
Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:14045997
Krizhevsky A, Sutskever I, Hinton GE (2012a) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25, Curran Associates, Inc., pp 1097–1105, http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Krizhevsky A, Sutskever I, Hinton GE (2012b) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25, Curran Associates, Inc., pp 1097–1105, http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Lavini A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4013–4021
Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, SC ’16
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), vol 14, pp 583–598
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Cham, pp 21–37
Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:13125851
Meng L, Brothers J (2019) Efficient winograd convolution via integer arithmetic. CoRR arXiv:1901.01965
Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, Association for Computing Machinery, New York, NY, USA, SOSP ’19, p 1–15 https://doi.org/10.1145/3341301.3359646
NVIDIA (2017) Nccl library. https://developer.nvidia.com/nccl
NVIDIA (2018a) cublas library. https://developer.nvidia.com/cublas
NVIDIA (2018b) Cuda profiling tools interface. https://developer.nvidia.com/cuda-profiling-tools-interface
NVIDIA (2018c) Nvidia visual profiler. https://developer.nvidia.com/nvidia-visual-profiler
Paul J, He J, He B (2016) Gpl: A gpu-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’16, pp 1935–1950, https://doi.org/10.1145/2882903.2915224
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788, https://doi.org/10.1109/CVPR.2016.91
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp 91–99, http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW (2016) vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 1–13
Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in gpgpu applications. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’12, p 11–22, https://doi.org/10.1145/2145816.2145819
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Steiner B, DeVito Z, Chintala S, Gross S, Paszke A, Massa F, Lerer A, Chanan G, Lin Z, Yang E, Desmaison A, Tejani A, Kopf A, Bradbury J, Antiga L, Raison M, Gimelshein N, Chilamkurthy S, Killeen T, Feng L, Bai J (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 33, Curran Associates, Inc
Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9
Tallada MG (2016) Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’16, https://doi.org/10.1145/2851141.2851158
Tang S, He B, Zhang S, Niu Z (2016) Elastic multi-resource fairness: balancing fairness and efficiency in coupled cpu-gpu architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, SC ’16
Vasilache N, Johnson J, Mathieu M, Chintala S, Piantino S, LeCun Y (2014) Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:14127580
Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) Singa: Putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’15, p 25–34, https://doi.org/10.1145/2733373.2806232
Wei J, Dai W, Kumar A, Zheng X, Ho Q, Xing EP (2013) Consistent bounded-asynchronous parameter servers for distributed ml. arXiv preprint arXiv:13127869
Wen Z, He B, Ramamohanarao K, Lu S, Shi J (2018) Efficient gradient boosted decision tree training on gpus. In: Parallel and Distributed Processing Symposium (IPDPS), (2018) IEEE International. Vancouver, British Columbia, Canada
Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67
You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225069
Zhong J, He B (2014) Kernelet: high-throughput gpu kernel executions with dynamic slicing and schedulingh. IEEE Trans Parallel Distrib Syst 25(6):1522–1532
Zhou F, Wu F, Zhang Z, Dong M (2017) Node-level parallelization for deep neural networks with conditional independent graph. Neurocomputing 267:261–270
Acknowledgements
This work is sponsored by the National Natural Science Foundation of China (61972277) and Tianjin Natural Science Foundation (18JCZDJC30800). Ce Yu is supported by the Joint Research Fund in Astronomy (U1731243, U1931130) under cooperative agreement between the National Natural Science Foundation of China (NSFC) and Chinese Academy of Sciences (CAS). Bingsheng He is supported by a MoE AcRF Tier 1 grant (T1 251RES1610) and an NUS startup grant in Singapore.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fu, H., Tang, S., He, B. et al. HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs. J Supercomput 77, 12741–12770 (2021). https://doi.org/10.1007/s11227-021-03746-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03746-z