Skip to main content

Minimizing Off-Chip Memory Access for Deep Convolutional Neural Network Training

  • Conference paper
  • First Online:
Parallel Architectures, Algorithms and Programming (PAAP 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1163))

  • 1416 Accesses

Abstract

When training convolutional neural networks, a large amount of operations and memory access are in need, which easily lead to the bottleneck of “memory wall” and decrease the computational performance and efficiency. Batch Normalization (BN) can effectively speed up the deep network training convergence, but it has complex data dependence and causes more serious “memory wall” bottleneck. Aiming at the “memory wall” problem occurred in the training for convolutional neural network using BN algorithm, the training method with splitting BN layer and multi-layer fusion calculation is proposed to reduce the memory access in model training. Firstly, by reordering “CONV+BN+RELU” (CBR) block, we trade computation for memory access with extra computation to reduce data accessed during training. Secondly, according to the memory access characteristics of the BN layer, the BN layer is divided into two sub-layers, which are respectively fused with the adjacent layers and the CBR block is recombined into “BN_B+RELU+CONV+BN_A” (BRCB), which further reduces the read-write of the main memory during training and alleviates the “memory wall” bottleneck to improve accelerator computational efficiency. The experimental results show that when using the NVIDIA TESLA V100 GPU to train ResNet-50, Inception V3 and DenseNet models, compared with the original training method, the amount of data accessed using BRCB multi-layer fusion optimization method is reduced by 33%, 22% and 31% respectively, and the actual computing efficiency of V100 is improved by 19%, 18% and 21% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61(4), 85–117 (2015)

    Article  Google Scholar 

  2. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456. IEEE, Lile (2015)

    Google Scholar 

  3. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. Comput. Vis. Pattern Recognit. 53(2), 770–778 (2016)

    Google Scholar 

  4. Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. IEEE, Las Vegas (2016)

    Google Scholar 

  5. Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. IEEE, Honolulu (2017)

    Google Scholar 

  6. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. Comput. Vis. Pattern Recognit. 53(2), 125–136 (2018)

    Google Scholar 

  7. Narang, S., Diamos, G., Elsen, E., et al.: Mixed precision training[OL], 25 December 2018. https://arxiv.org/pdf/1710.03740.pdf

  8. NVIDIA TESLA V100 GPU architecture. The world’s most advanced data center GPU[EB/OL], 10 October 2018. https://devblogs.nvidia.com/inside-volta/

  9. NVIDIA TESLA P100. the most advanced datacenter accelerator ever built featuring Pascal GP100[OL], 7 June 2018. https://www.nvidia.com/o-bject/pascal-architecture-whitepaper.html

  10. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM 52(4), 65–76 (2009)

    Article  Google Scholar 

  11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. IEEE, Lake Tahoe (2012)

    Google Scholar 

  12. Google Inc.: TPUv2[OL], 7 January 2019. https://www.tomshardware.com/ne-ws/tpu-v2-google-machine-learning-35370.html

  13. Li, J., Yan, G., Lu, W., et al.: TNPU: an efficient accelerator architecture for training convolutional neural networks. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 450–455. ACM, Tokyo (2019)

    Google Scholar 

  14. Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE, Cambridge (2014)

    Google Scholar 

  15. Shen, Y., Ferdman, M., Milder, P.: Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. In: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 93–100. IEEE, Napa (2017)

    Google Scholar 

  16. Li, J., Yan, G., Lu, W., et al.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 343–348. IEEE, Dresden (2018)

    Google Scholar 

  17. Chen, T., Xu, B., Zhang, C., et al.: Training deep nets with sublinear memory cost[OL], 5 January 2019. https://arxiv.org/pdf/1604.06174.pdf

  18. Jain, A., Phanishayee, A., Mars, J., et al.: Gist: efficient data encoding for deep neural network training. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 776–789. IEEE, Los Angeles (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jijun Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Li, H. (2020). Minimizing Off-Chip Memory Access for Deep Convolutional Neural Network Training. In: Shen, H., Sang, Y. (eds) Parallel Architectures, Algorithms and Programming. PAAP 2019. Communications in Computer and Information Science, vol 1163. Springer, Singapore. https://doi.org/10.1007/978-981-15-2767-8_42

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-2767-8_42

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-2766-1

  • Online ISBN: 978-981-15-2767-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics