Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration

Kim, Youngrang; Lee, Jaehwan; Kim, Jik-Soo; Jei, Hyunseung; Roh, Hongchan

doi:10.1007/s10586-019-02974-6

Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration

Published: 23 August 2019

Volume 23, pages 2193–2204, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Youngrang Kim¹,
Jaehwan Lee ORCID: orcid.org/0000-0001-6248-9567¹,
Jik-Soo Kim²,
Hyunseung Jei³ &
…
Hongchan Roh³

913 Accesses
8 Citations
Explore all metrics

Abstract

This paper presents a comprehensive suite of techniques for optimized memory management in multi-GPU systems to accelerate deep learning application execution. We employ a hybrid utilization of GPU and CPU memories in a multi-GPU environment by effectively addressing contention issues in the shared interconnect (e.g., PCIe, NVLink). In addition, we designed and implemented an intelligent prefetching algorithm (from CPU memory to GPU) that achieves the highest processing throughput while sustaining a large mini-batch size. We successfully implemented our optimization techniques on TensorFlow, and performed extensive experiments in various multi-GPU environments including traditional PCIe and the latest high-bandwidth interconnect, NVLink. Evaluation results show that our proposed scheme actually improves computing performance by decreasing the I/O bottleneck, and effectively increasing the mini-batch size without sacrificing overall training throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 8

Fig. 9

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

Article 11 July 2020

Youngrang Kim, Hyeonseong Choi, … Hongchan Roh

Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

References

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc., Red Hook (2012)
Google Scholar
Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., Keckler, S.W.: vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2016)
TensorFlow: an open source machine learning library for research and production. https://www.tensorflow.org/ (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (2015)
Kim, Y., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Efficient multi-GPU memory management for deep learning acceleration. In: 6th International Workshop on Autonomic Management of High Performance Grid and Cloud Computing (AMGCC’18) (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arxiv.org (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE (1998)
NVIDIA DGX-1 system architecture. White Paper, NVIDIA
Duc, T.L.: Ibmcaffe: the harmony of CPU and GPU in training deep neural networks. IBM Research, Technical Reports
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arxiv.org (2014)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training imagenet in 1 hour. arxiv.org (2018)
NVIDIA: Unified memory in cuda 6
XLA: https://www.tensorflow.org/performance/xla (2019)
NNVM: https://github.com/dmlc/nnvm-fusion (2019)
MXNet’s Optimizing Memory Consumption in Deep Learning. http://mxnet.incubator.apache.org/architecture/note-memory.html (2019)
Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arxiv.org (2016)
Kehne, J., Metter, J., Bellosa, F.: Gpuswap: enabling oversubscription of GPU memory through transparent swapping. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 65–77
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: Sixth International Conference on Learning Representations (ICLR) (2018)

Download references

Acknowledgements

This research was supported by Basic Science Research Program (NRF-2019R1H1A2039658), and Next-Generation Information Computing Development Program (NRF-2015M3C4A7065646) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, and partly supported by the GRRC program of Gyeonggi province (No. GRRC-KAU-2018-B01, “Study on the Video and Space Convergence Platform for 360VR Services”) and IT R&D program of MOTIE/KEIT (10076476, Scalable Machine Learning Acceleration Hardware Technology for Big-Data Servers).

Author information

Authors and Affiliations

Korea Aerospace University, Goyang-si, Republic of Korea
Youngrang Kim & Jaehwan Lee
Myongji University, Yongin-si, Republic of Korea
Jik-Soo Kim
SK Telecom ML Infra Lab, Seongnam-si, Republic of Korea
Hyunseung Jei & Hongchan Roh

Authors

Youngrang Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jaehwan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jik-Soo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hyunseung Jei
View author publications
You can also search for this author in PubMed Google Scholar
Hongchan Roh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaehwan Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, Y., Lee, J., Kim, JS. et al. Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration. Cluster Comput 23, 2193–2204 (2020). https://doi.org/10.1007/s10586-019-02974-6

Download citation

Received: 28 December 2018
Revised: 02 May 2019
Accepted: 14 August 2019
Published: 23 August 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10586-019-02974-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration

Abstract

Access this article

Similar content being viewed by others

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration

Abstract

Access this article

Similar content being viewed by others

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation