Abstract
This paper presents a comprehensive suite of techniques for optimized memory management in multi-GPU systems to accelerate deep learning application execution. We employ a hybrid utilization of GPU and CPU memories in a multi-GPU environment by effectively addressing contention issues in the shared interconnect (e.g., PCIe, NVLink). In addition, we designed and implemented an intelligent prefetching algorithm (from CPU memory to GPU) that achieves the highest processing throughput while sustaining a large mini-batch size. We successfully implemented our optimization techniques on TensorFlow, and performed extensive experiments in various multi-GPU environments including traditional PCIe and the latest high-bandwidth interconnect, NVLink. Evaluation results show that our proposed scheme actually improves computing performance by decreasing the I/O bottleneck, and effectively increasing the mini-batch size without sacrificing overall training throughput.
Similar content being viewed by others
References
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc., Red Hook (2012)
Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., Keckler, S.W.: vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2016)
TensorFlow: an open source machine learning library for research and production. https://www.tensorflow.org/ (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (2015)
Kim, Y., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Efficient multi-GPU memory management for deep learning acceleration. In: 6th International Workshop on Autonomic Management of High Performance Grid and Cloud Computing (AMGCC’18) (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arxiv.org (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE (1998)
NVIDIA DGX-1 system architecture. White Paper, NVIDIA
Duc, T.L.: Ibmcaffe: the harmony of CPU and GPU in training deep neural networks. IBM Research, Technical Reports
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arxiv.org (2014)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training imagenet in 1 hour. arxiv.org (2018)
NVIDIA: Unified memory in cuda 6
XLA: https://www.tensorflow.org/performance/xla (2019)
NNVM: https://github.com/dmlc/nnvm-fusion (2019)
MXNet’s Optimizing Memory Consumption in Deep Learning. http://mxnet.incubator.apache.org/architecture/note-memory.html (2019)
Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arxiv.org (2016)
Kehne, J., Metter, J., Bellosa, F.: Gpuswap: enabling oversubscription of GPU memory through transparent swapping. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 65–77
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: Sixth International Conference on Learning Representations (ICLR) (2018)
Acknowledgements
This research was supported by Basic Science Research Program (NRF-2019R1H1A2039658), and Next-Generation Information Computing Development Program (NRF-2015M3C4A7065646) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, and partly supported by the GRRC program of Gyeonggi province (No. GRRC-KAU-2018-B01, “Study on the Video and Space Convergence Platform for 360VR Services”) and IT R&D program of MOTIE/KEIT (10076476, Scalable Machine Learning Acceleration Hardware Technology for Big-Data Servers).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, Y., Lee, J., Kim, JS. et al. Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration. Cluster Comput 23, 2193–2204 (2020). https://doi.org/10.1007/s10586-019-02974-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02974-6