Abstract
When training convolutional neural networks, a large amount of operations and memory access are in need, which easily lead to the bottleneck of “memory wall” and decrease the computational performance and efficiency. Batch Normalization (BN) can effectively speed up the deep network training convergence, but it has complex data dependence and causes more serious “memory wall” bottleneck. Aiming at the “memory wall” problem occurred in the training for convolutional neural network using BN algorithm, the training method with splitting BN layer and multi-layer fusion calculation is proposed to reduce the memory access in model training. Firstly, by reordering “CONV+BN+RELU” (CBR) block, we trade computation for memory access with extra computation to reduce data accessed during training. Secondly, according to the memory access characteristics of the BN layer, the BN layer is divided into two sub-layers, which are respectively fused with the adjacent layers and the CBR block is recombined into “BN_B+RELU+CONV+BN_A” (BRCB), which further reduces the read-write of the main memory during training and alleviates the “memory wall” bottleneck to improve accelerator computational efficiency. The experimental results show that when using the NVIDIA TESLA V100 GPU to train ResNet-50, Inception V3 and DenseNet models, compared with the original training method, the amount of data accessed using BRCB multi-layer fusion optimization method is reduced by 33%, 22% and 31% respectively, and the actual computing efficiency of V100 is improved by 19%, 18% and 21% respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61(4), 85–117 (2015)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456. IEEE, Lile (2015)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. Comput. Vis. Pattern Recognit. 53(2), 770–778 (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. IEEE, Las Vegas (2016)
Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. IEEE, Honolulu (2017)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. Comput. Vis. Pattern Recognit. 53(2), 125–136 (2018)
Narang, S., Diamos, G., Elsen, E., et al.: Mixed precision training[OL], 25 December 2018. https://arxiv.org/pdf/1710.03740.pdf
NVIDIA TESLA V100 GPU architecture. The world’s most advanced data center GPU[EB/OL], 10 October 2018. https://devblogs.nvidia.com/inside-volta/
NVIDIA TESLA P100. the most advanced datacenter accelerator ever built featuring Pascal GP100[OL], 7 June 2018. https://www.nvidia.com/o-bject/pascal-architecture-whitepaper.html
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. IEEE, Lake Tahoe (2012)
Google Inc.: TPUv2[OL], 7 January 2019. https://www.tomshardware.com/ne-ws/tpu-v2-google-machine-learning-35370.html
Li, J., Yan, G., Lu, W., et al.: TNPU: an efficient accelerator architecture for training convolutional neural networks. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 450–455. ACM, Tokyo (2019)
Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE, Cambridge (2014)
Shen, Y., Ferdman, M., Milder, P.: Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. In: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 93–100. IEEE, Napa (2017)
Li, J., Yan, G., Lu, W., et al.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 343–348. IEEE, Dresden (2018)
Chen, T., Xu, B., Zhang, C., et al.: Training deep nets with sublinear memory cost[OL], 5 January 2019. https://arxiv.org/pdf/1604.06174.pdf
Jain, A., Phanishayee, A., Mars, J., et al.: Gist: efficient data encoding for deep neural network training. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 776–789. IEEE, Los Angeles (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, J., Li, H. (2020). Minimizing Off-Chip Memory Access for Deep Convolutional Neural Network Training. In: Shen, H., Sang, Y. (eds) Parallel Architectures, Algorithms and Programming. PAAP 2019. Communications in Computer and Information Science, vol 1163. Springer, Singapore. https://doi.org/10.1007/978-981-15-2767-8_42
Download citation
DOI: https://doi.org/10.1007/978-981-15-2767-8_42
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2766-1
Online ISBN: 978-981-15-2767-8
eBook Packages: Computer ScienceComputer Science (R0)