Skip to main content
Log in

HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Graphics Processing Units (GPUs) have evolved into powerful accelerators for the development of Convolutional Neural Network (CNN) models. Most existing GPU-based frameworks adopt a kernel-based execution approach and only focus on optimizing individual kernels for better performance and resource utilization. With this approach, kernels involved will be launched sequentially, which may result in the underutilization of GPU resources due to the limited optimization space of a single kernel. In this paper, we propose an efficient software parallelization framework, called HGP4CNN, to accelerate the training of CNN models by considering the characteristics of workloads from both the same layer and adjacent layers as well as new GPU features such as concurrent kernel execution. In the intra-layer level, to achieve a better training performance of a single network layer, we design a novel model-based lightweight parallelization module to make better use of the concurrent kernel execution feature on modern GPUs. An asynchronous resource tracker is used to collect kernels’ information at runtime, and a kernel analyzer is devised to calculate the number of kernels that can be dispatched concurrently. Moreover, to avoid consuming too many CPU threads or process resources, we integrate a runtime scheduler module for kernel launch and a pool-based stream manager for GPU work queue management. While in the inter-layer level, we present a pipeline execution strategy to overlap the processing of workloads from adjacent layers. To determine the number of samples to be processed by a single pipeline stage, the analysis result from the intra-layer module is considered. In the end, we implement a prototype of the proposed framework with Caffe, a well-known deep learning framework and conduct experiments with four off-the-shelf CNN models on three NVIDIA GPUs. Results show that HGP4CNN can be exploited to achieve better performance over the original implementation and keep the convergence property of networks. We can achieve a speedup of up to 6.X for a single convolutional layer and 2.X for multiple layers within pipelines of a network model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://gitee.com/tju_haibo/HGP4CNN-Caffe.

References

  1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association, Savannah, GA, pp 265–283, https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

  2. AMD (2018) Hcblas library. https://gpuopen.com/compute-product/hcblas/

  3. Chen J, Pan X, Monga R, Bengio S, Jozefowicz R (2016a) Revisiting distributed synchronous sgd. arXiv preprint arXiv:160400981

  4. Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:1512.01274

  5. Chen T, Xu B, Zhang C, Guestrin C (2016b) Training deep nets with sublinear memory cost. CoRR arXiv:1604.06174

  6. Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, Guestrin C, Krishnamurthy A (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), USENIX Association, Carlsbad, CA, pp 578–594, https://www.usenix.org/conference/osdi18/presentation/chen

  7. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:14100759

  8. Cipar J, Ho Q, Kim JK, Lee S, Ganger GR, Gibson G, Keeton K, Xing E (2013) Solving the straggler problem with bounded staleness. In: Presented as part of the 14th Workshop on Hot Topics in Operating Systems, USENIX, Santa Ana Pueblo, NM, https://www.usenix.org/conference/hotos13/solving-straggler-problem-bounded-staleness

  9. Cui H, Cipar J, Ho Q, Kim JK, Lee S, Kumar A, Wei J, Dai W, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2014) Exploiting bounded staleness to speed up big data analytics. In: 2014 USENIX Annual Technical Conference (USENIX ATC 14), USENIX Association, Philadelphia, PA, pp 37–48, https://www.usenix.org/conference/atc14/technical-sessions/presentation/cui

  10. Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 16), Association for Computing Machinery, New York, NY, USA, pp 1–16, https://doi.org/10.1145/2901318.2901323

  11. Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le QV, Ng AY (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231

  12. Fu H, Tang S, He B, Yu C, Sun J (2018) Glp4nn: A convergence-invariant and network-agnostic light-weight parallelization framework for deep neural networks on modern gpus. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225077

  13. GmbH GT (2018) Vampir. https://www.vampir.eu/

  14. Goyal P, Dollár P, Girshick RB, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR arXiv:1706.02677

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778 https://doi.org/10.1109/CVPR.2016.90

  16. Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231

  17. Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, Chen z (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 103–112, http://papers.nips.cc/paper/8305-gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.pdf

  18. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 448–456, http://proceedings.mlr.press/v37/ioffe15.html

  19. Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L, Chen T, Hu G, Shi S, Chu X (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR arXiv:1807.11205

  20. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’14, p 675–678 https://doi.org/10.1145/2647868.2654889

  21. Jiang W, Zhang Y, Liu P, Ye G, Jin H (2018) Filayer: A novel fine-grained layer-wise parallelism strategy for deep neural networks. In: Kůrková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I (eds) Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, pp 321–330

  22. Jin H, Liu B, Jiang W, Ma Y, Shi X, He B, Zhao S (2018) Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Trans Archit Code Optim. https://doi.org/10.1145/3243904

    Article  Google Scholar 

  23. Kim JK, Ho Q, Lee S, Zheng X, Dai W, Gibson GA, Xing EP (2016) Strads: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’16, pp 5:1–5:16, https://doi.org/10.1145/2901318.2901331

  24. Kjolstad F, Kamil S, Chou S, Lugato D, Amarasinghe S (2017) The tensor algebra compiler. Proc ACM Program Lang 1(OOPSLA), https://doi.org/10.1145/3133901

  25. Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Toronto, Tech. rep

  26. Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:14045997

  27. Krizhevsky A, Sutskever I, Hinton GE (2012a) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25, Curran Associates, Inc., pp 1097–1105, http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

  28. Krizhevsky A, Sutskever I, Hinton GE (2012b) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25, Curran Associates, Inc., pp 1097–1105, http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

  29. Lavini A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4013–4021

  30. Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, SC ’16

  31. Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), vol 14, pp 583–598

  32. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Cham, pp 21–37

    Chapter  Google Scholar 

  33. Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:13125851

  34. Meng L, Brothers J (2019) Efficient winograd convolution via integer arithmetic. CoRR arXiv:1901.01965

  35. Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, Association for Computing Machinery, New York, NY, USA, SOSP ’19, p 1–15 https://doi.org/10.1145/3341301.3359646

  36. NVIDIA (2017) Nccl library. https://developer.nvidia.com/nccl

  37. NVIDIA (2018a) cublas library. https://developer.nvidia.com/cublas

  38. NVIDIA (2018b) Cuda profiling tools interface. https://developer.nvidia.com/cuda-profiling-tools-interface

  39. NVIDIA (2018c) Nvidia visual profiler. https://developer.nvidia.com/nvidia-visual-profiler

  40. Paul J, He J, He B (2016) Gpl: A gpu-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’16, pp 1935–1950, https://doi.org/10.1145/2882903.2915224

  41. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788, https://doi.org/10.1109/CVPR.2016.91

  42. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp 91–99, http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

  43. Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW (2016) vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 1–13

  44. Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in gpgpu applications. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’12, p 11–22, https://doi.org/10.1145/2145816.2145819

  45. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556

  46. Steiner B, DeVito Z, Chintala S, Gross S, Paszke A, Massa F, Lerer A, Chanan G, Lin Z, Yang E, Desmaison A, Tejani A, Kopf A, Bradbury J, Antiga L, Raison M, Gimelshein N, Chilamkurthy S, Killeen T, Feng L, Bai J (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 33, Curran Associates, Inc

  47. Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9

  48. Tallada MG (2016) Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’16, https://doi.org/10.1145/2851141.2851158

  49. Tang S, He B, Zhang S, Niu Z (2016) Elastic multi-resource fairness: balancing fairness and efficiency in coupled cpu-gpu architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, SC ’16

  50. Vasilache N, Johnson J, Mathieu M, Chintala S, Piantino S, LeCun Y (2014) Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:14127580

  51. Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) Singa: Putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’15, p 25–34, https://doi.org/10.1145/2733373.2806232

  52. Wei J, Dai W, Kumar A, Zheng X, Ho Q, Xing EP (2013) Consistent bounded-asynchronous parameter servers for distributed ml. arXiv preprint arXiv:13127869

  53. Wen Z, He B, Ramamohanarao K, Lu S, Shi J (2018) Efficient gradient boosted decision tree training on gpus. In: Parallel and Distributed Processing Symposium (IPDPS), (2018) IEEE International. Vancouver, British Columbia, Canada

  54. Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67

    Article  Google Scholar 

  55. You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225069

  56. Zhong J, He B (2014) Kernelet: high-throughput gpu kernel executions with dynamic slicing and schedulingh. IEEE Trans Parallel Distrib Syst 25(6):1522–1532

    Article  MathSciNet  Google Scholar 

  57. Zhou F, Wu F, Zhang Z, Dong M (2017) Node-level parallelization for deep neural networks with conditional independent graph. Neurocomputing 267:261–270

    Article  Google Scholar 

Download references

Acknowledgements

This work is sponsored by the National Natural Science Foundation of China (61972277) and Tianjin Natural Science Foundation (18JCZDJC30800). Ce Yu is supported by the Joint Research Fund in Astronomy (U1731243, U1931130) under cooperative agreement between the National Natural Science Foundation of China (NSFC) and Chinese Academy of Sciences (CAS). Bingsheng He is supported by a MoE AcRF Tier 1 grant (T1 251RES1610) and an NUS startup grant in Singapore.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shanjiang Tang or Ce Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, H., Tang, S., He, B. et al. HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs. J Supercomput 77, 12741–12770 (2021). https://doi.org/10.1007/s11227-021-03746-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03746-z

Keywords

Navigation