Minimizing Off-Chip Memory Access for Deep Convolutional Neural Network Training

Wang, Jijun; Li, Hongliang

doi:10.1007/978-981-15-2767-8_42

Jijun Wang⁸ &
Hongliang Li⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1163))

Included in the following conference series:

International Symposium on Parallel Architectures, Algorithms and Programming

1416 Accesses

Abstract

When training convolutional neural networks, a large amount of operations and memory access are in need, which easily lead to the bottleneck of “memory wall” and decrease the computational performance and efficiency. Batch Normalization (BN) can effectively speed up the deep network training convergence, but it has complex data dependence and causes more serious “memory wall” bottleneck. Aiming at the “memory wall” problem occurred in the training for convolutional neural network using BN algorithm, the training method with splitting BN layer and multi-layer fusion calculation is proposed to reduce the memory access in model training. Firstly, by reordering “CONV+BN+RELU” (CBR) block, we trade computation for memory access with extra computation to reduce data accessed during training. Secondly, according to the memory access characteristics of the BN layer, the BN layer is divided into two sub-layers, which are respectively fused with the adjacent layers and the CBR block is recombined into “BN_B+RELU+CONV+BN_A” (BRCB), which further reduces the read-write of the main memory during training and alleviates the “memory wall” bottleneck to improve accelerator computational efficiency. The experimental results show that when using the NVIDIA TESLA V100 GPU to train ResNet-50, Inception V3 and DenseNet models, compared with the original training method, the amount of data accessed using BRCB multi-layer fusion optimization method is reduced by 33%, 22% and 31% respectively, and the actual computing efficiency of V100 is improved by 19%, 18% and 21% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61(4), 85–117 (2015)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456. IEEE, Lile (2015)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. Comput. Vis. Pattern Recognit. 53(2), 770–778 (2016)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. IEEE, Las Vegas (2016)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. IEEE, Honolulu (2017)
Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. Comput. Vis. Pattern Recognit. 53(2), 125–136 (2018)
Google Scholar
Narang, S., Diamos, G., Elsen, E., et al.: Mixed precision training[OL], 25 December 2018. https://arxiv.org/pdf/1710.03740.pdf
NVIDIA TESLA V100 GPU architecture. The world’s most advanced data center GPU[EB/OL], 10 October 2018. https://devblogs.nvidia.com/inside-volta/
NVIDIA TESLA P100. the most advanced datacenter accelerator ever built featuring Pascal GP100[OL], 7 June 2018. https://www.nvidia.com/o-bject/pascal-architecture-whitepaper.html
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. IEEE, Lake Tahoe (2012)
Google Scholar
Google Inc.: TPUv2[OL], 7 January 2019. https://www.tomshardware.com/ne-ws/tpu-v2-google-machine-learning-35370.html
Li, J., Yan, G., Lu, W., et al.: TNPU: an efficient accelerator architecture for training convolutional neural networks. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 450–455. ACM, Tokyo (2019)
Google Scholar
Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE, Cambridge (2014)
Google Scholar
Shen, Y., Ferdman, M., Milder, P.: Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. In: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 93–100. IEEE, Napa (2017)
Google Scholar
Li, J., Yan, G., Lu, W., et al.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 343–348. IEEE, Dresden (2018)
Google Scholar
Chen, T., Xu, B., Zhang, C., et al.: Training deep nets with sublinear memory cost[OL], 5 January 2019. https://arxiv.org/pdf/1604.06174.pdf
Jain, A., Phanishayee, A., Mars, J., et al.: Gist: efficient data encoding for deep neural network training. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 776–789. IEEE, Los Angeles (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Jiangnan Institute of Computing Technology, Wuxi, 214083, China
Jijun Wang & Hongliang Li

Authors

Jijun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongliang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jijun Wang .

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, China
Hong Shen
Sun Yat-sen University, Guangzhou, China
Yingpeng Sang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Li, H. (2020). Minimizing Off-Chip Memory Access for Deep Convolutional Neural Network Training. In: Shen, H., Sang, Y. (eds) Parallel Architectures, Algorithms and Programming. PAAP 2019. Communications in Computer and Information Science, vol 1163. Springer, Singapore. https://doi.org/10.1007/978-981-15-2767-8_42

Download citation

DOI: https://doi.org/10.1007/978-981-15-2767-8_42
Published: 26 January 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2766-1
Online ISBN: 978-981-15-2767-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics