Skip to main content
Log in

Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This paper presents a comprehensive suite of techniques for optimized memory management in multi-GPU systems to accelerate deep learning application execution. We employ a hybrid utilization of GPU and CPU memories in a multi-GPU environment by effectively addressing contention issues in the shared interconnect (e.g., PCIe, NVLink). In addition, we designed and implemented an intelligent prefetching algorithm (from CPU memory to GPU) that achieves the highest processing throughput while sustaining a large mini-batch size. We successfully implemented our optimization techniques on TensorFlow, and performed extensive experiments in various multi-GPU environments including traditional PCIe and the latest high-bandwidth interconnect, NVLink. Evaluation results show that our proposed scheme actually improves computing performance by decreasing the I/O bottleneck, and effectively increasing the mini-batch size without sacrificing overall training throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc., Red Hook (2012)

    Google Scholar 

  2. Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., Keckler, S.W.: vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2016)

  3. TensorFlow: an open source machine learning library for research and production. https://www.tensorflow.org/ (2019)

  4. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (2015)

  5. Kim, Y., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Efficient multi-GPU memory management for deep learning acceleration. In: 6th International Workshop on Autonomic Management of High Performance Grid and Cloud Computing (AMGCC’18) (2018)

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arxiv.org (2015)

  7. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  8. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE (1998)

  9. NVIDIA DGX-1 system architecture. White Paper, NVIDIA

  10. Duc, T.L.: Ibmcaffe: the harmony of CPU and GPU in training deep neural networks. IBM Research, Technical Reports

  11. Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arxiv.org (2014)

  12. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training imagenet in 1 hour. arxiv.org (2018)

  13. NVIDIA: Unified memory in cuda 6

  14. XLA: https://www.tensorflow.org/performance/xla (2019)

  15. NNVM: https://github.com/dmlc/nnvm-fusion (2019)

  16. MXNet’s Optimizing Memory Consumption in Deep Learning. http://mxnet.incubator.apache.org/architecture/note-memory.html (2019)

  17. Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arxiv.org (2016)

  18. Kehne, J., Metter, J., Bellosa, F.: Gpuswap: enabling oversubscription of GPU memory through transparent swapping. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 65–77

  19. Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: Sixth International Conference on Learning Representations (ICLR) (2018)

Download references

Acknowledgements

This research was supported by Basic Science Research Program (NRF-2019R1H1A2039658), and Next-Generation Information Computing Development Program (NRF-2015M3C4A7065646) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, and partly supported by the GRRC program of Gyeonggi province (No. GRRC-KAU-2018-B01, “Study on the Video and Space Convergence Platform for 360VR Services”) and IT R&D program of MOTIE/KEIT (10076476, Scalable Machine Learning Acceleration Hardware Technology for Big-Data Servers).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaehwan Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, Y., Lee, J., Kim, JS. et al. Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration. Cluster Comput 23, 2193–2204 (2020). https://doi.org/10.1007/s10586-019-02974-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02974-6

Keywords

Navigation