Skip to main content

Advertisement

Log in

High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity, which involve enormous communication bandwidth and storage resources requirement. The computation requirement can be addressed effectively to achieve high throughput by highly parallel compute paradigms of current CNNs accelerators. But the energy consumption still remains high as communication can be more expensive than computation, especially for low power embedded platform. To address this problem, this paper proposes a CNNs accelerator based on a novel storage and dataflow on multi-processor system on chip (MPSoC) platform. By minimizing data access and movement and maximizing data reuse, it can achieve the energy efficient CNNs inference acceleration. The optimization strategies mainly involve four aspects. Firstly, an external memory sharing architecture adopting two-dimensional array storage mode for CPU-FPGA collaborative processing is proposed to achieve high data throughput and low bandwidth requirement for off-chip data transmission. Secondly, the minimized data access and movement on chip are realized by designing a multi-level hierarchical storage architecture. Thirdly, a cyclic data shifting method is proposed to achieve maximized data reuse based on both spatial and temporal. In addition, a bit fusion method based on the 8-bit dynamic fixed-point quantization is adopted to achieve double throughput and computational efficiency of a single DSP. The accelerator proposed in this paper is implemented on Zynq UltraScale + MPSoC ZCU102 evaluation board. By running the benchmark network of VGG16 and Tiny-YOLO on the accelerator, the throughput and the energy efficiency are evaluated. Compared with the current typical accelerators, the proposed accelerator can increase system throughput by up to 41x, single DSP throughput by up to 7.63x, and system energy efficiency by up to 6.3x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015). Going deeper with convolutions. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9)

  2. He T, Huang W, Qiao Y, Yao J (2016) Text-attentional convolutional neural network for scene text detection. IEEE Trans Image Process 25(6):2529–2541

    Article  MathSciNet  Google Scholar 

  3. Li H, Lin Z, Shen X, Brandt J, Hua G (2015) A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5325–5334)

  4. Tomè D, Monti F, Baroffio L, Bondi L, Tagliasacchi M, Tubaro S (2016) Deep convolutional neural networks for pedestrian detection. Signal Process Image Commun 47:482–489

    Article  Google Scholar 

  5. Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173–182)

  6. Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput Archit News 44(3):243–254

    Article  Google Scholar 

  7. Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, Guo Q, Chen T, Chen Y (2016) Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 1–12)

  8. Chen YH, Emer J, Sze V (2016) Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput Archit News 44(3):367–379

    Article  Google Scholar 

  9. Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, Feng X, Chen Y, Temam O (2015) ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (pp. 92–104)

  10. Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen YJ (2015) Pudiannao: A polyvalent machine learning accelerator. ACM SIGARCH Comput Archit News 43(1):369–381

    Article  Google Scholar 

  11. Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2015) A high-throughput neural network accelerator. IEEE Micro 35(3):24–32

    Article  Google Scholar 

  12. Mei C, Liu Z, Niu Y, Ji X, Zhou W, Wang D (2017) A 200mhz 202.4 gflops@ 10.8 w vgg16 accelerator in xilinx vx690t. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (pp. 784–788)

  13. Ma J, Chen L, Gao Z (2017) Hardware implementation and optimization of tiny-yolo network. In International Forum on Digital TV and Wireless Multimedia Communications (pp. 224–234).

  14. Wai YJ, Bin Mohd Yussof Z, Bin Salim SI, Chuan LK (2018) Fixed point implementation of tiny-yolo-v2 using opencl on fpga. Int J Adv Comput Sci Appl 9(10):506–512

    Google Scholar 

  15. Chen YH, Krishna T, Emer JS, Sze V (2016) Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52(1):127–138

    Article  Google Scholar 

  16. Guo K, Sui L, Qiu J, Yu J, Wang J, Yao S, Han S, Wang Y, Yang HZ (2018) Angel-eye: a complete design flow for mapping CNN onto embedded FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 37(1):35–47

    Article  Google Scholar 

  17. Du L, Du Y, Li Y, Su J, Kuan YC, Liu CC, Chang MCF (2017) A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans Circuits Syst I Regul Pap 65(1):198–208

    Article  Google Scholar 

  18. Wang J, Lin J, Wang Z (2017) Efficient hardware architectures for deep convolutional neural network. IEEE Trans Circuits Syst I Regul Pap 65(6):1941–1953

    Article  Google Scholar 

  19. Ma MY, Cao Y, Vrudhula S, Seo JS (2018) Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans Very Large Scale Integr VLSI Syst 26(7):1354–1367

    Article  Google Scholar 

  20. Shah N, Chaudhari P, Varghese K (2018) Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans Neural Netw Learn Syst 29(12):5922–5934

    Article  Google Scholar 

  21. Lin YJ, Chang TS (2017) Data and hardware efficient design for convolutional neural network. IEEE Trans Circuits Syst I Regul Pap 65(5):1642–1651

    Article  Google Scholar 

  22. Jo J, Kim S, Park IC (2018) Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans Circuits Syst I Regul Pap 65(12):4196–4207

    Article  Google Scholar 

  23. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S (2016) Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 26–35).

  24. Wang S, Zhou D, Han X, Yoshimura T (2017) Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 (pp. 1032–1037)

  25. Shi R, Xu Z, Sun Z, Peemen M, Li A, Corporaal H, Wu D (2015). A locality aware convolutional neural networks accelerator. In 2015 Euromicro Conference on Digital System Design (pp. 591–598)

  26. Xin C, Chen Q, Tian M, Ji M, Zou C, Wang B (2017) COSY: an energy-efficient hardware architecture for deep convolutional neural networks based on systolic array. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS) (pp. 180–189)

  27. Xiao Q, Liang Y, Lu L, Yan S, Tai Y W (2017) Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (pp. 1–6)

  28. Luo C, Cao W, Wang L, Leong PH (2019) Rna: an accurate residual network accelerator for quantized and reconstructed deep neural networks. IEICE Trans Inf Syst 102(5):1037–1045

    Article  Google Scholar 

  29. Li J, Yan G, Lu W, Jiang S, Gong S, Wu J, Li X (2018) SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 343–348)

  30. Song L, Wang Y, Han Y, Zhao X, Liu B, Li X (2016) C-Brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the 53rd Annual Design Automation Conference (pp. 1–6)

  31. Lu W, Yan G, Li J, Gong S, Han Y, Li X (2017) Flexflow: a flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 553–564)

  32. Liu B, Chen X, Wang Y, Han Y, Li J, Xu H, Li X (2019) Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (pp. 733–738)

  33. Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 161–170)

  34. Yang K, Wang S, Zhou J, Yoshimura T (2017) Energy-efficient scheduling method with cross-loop model for resource-limited CNN accelerator designs. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1–4)

  35. Yang X, Pu J, Rister B B, Bhagdikar N, Richardson S, Kvatinsky S, Ragan-Kelley J, Pedram A, Horowitz M (2016). A systematic approach to blocking convolutional neural networks. arXiv preprint

  36. Peemen M, Mesman B, Corporaal H (2015). Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators. In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 169–174)

  37. Ma Y, Cao Y, Vrudhula S, Seo J S (2017) Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 45–54)

  38. Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 575–580)

  39. Wei X, Yu C H, Zhang P, Chen Y, Wang Y, Hu H, Liang Y, Cong J (2017) Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (pp. 1–6)

  40. Venieris S I, Bouganis C S (2016) fpgaConvNet: a framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 40–47)

  41. Guo K, Han S, Yao S, Wang Y, Xie Y, Yang H (2017) Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37(2):18–25

    Article  Google Scholar 

  42. Yu Y, Wu C, Zhao T, Wang K, He L (2019) Opu: An fpga-based overlay processor for convolutional neural networks. IEEE Trans Very Large Scale Integr VLSI Syst 28(1):35–47

    Article  Google Scholar 

  43. Choi Y, Bae D, Sim J, Choi S, Kim M, Kim LS (2017) Energy-efficient design of processing element for convolutional neural network. IEEE Trans Circuits Syst II Express Briefs 64(11):1332–1336

    Article  Google Scholar 

  44. Lian X, Liu Z, Song Z, Dai J, Zhou W, Ji X (2019) High-performance fpga-based cnn accelerator with block-floating-point arithmetic. IEEE Trans Very Large Scale Integr VLSI Syst 27(8):1874–1885

    Article  Google Scholar 

  45. Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (pp. 281–290)

  46. Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105)

  47. Simonyan K, Zisserman A (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint

  48. Iandola F N, Han S, Moskewicz M W, Ashraf K, Dally W J, Keutzer K (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint

  49. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)

  50. Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789

    Article  Google Scholar 

  51. Fu Y, Wu E, Sirasao A, Attia S, Khan K, Wittig R (2016). Deep learning with int8 optimization on xilinx devices. White Paper.

Download references

Acknowledgements

The authors would like to thanks the anonymous reviewers for their constructive comments. This work was partially supported by the Industry-University-Research Cooperation Fund of the Eighth Research Institute of China Aerospace Science and Technology Corporation (Grant No. SAST2020-068), and the National Natural Science Foundation of China (Grant No. 61872017, Grant No. 61803034).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongxu Jiang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Jiang, H., Liu, X. et al. High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow. J Supercomput 78, 3205–3225 (2022). https://doi.org/10.1007/s11227-021-03909-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03909-y

Keywords

Navigation