Abstract
The convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity, which involve enormous communication bandwidth and storage resources requirement. The computation requirement can be addressed effectively to achieve high throughput by highly parallel compute paradigms of current CNNs accelerators. But the energy consumption still remains high as communication can be more expensive than computation, especially for low power embedded platform. To address this problem, this paper proposes a CNNs accelerator based on a novel storage and dataflow on multi-processor system on chip (MPSoC) platform. By minimizing data access and movement and maximizing data reuse, it can achieve the energy efficient CNNs inference acceleration. The optimization strategies mainly involve four aspects. Firstly, an external memory sharing architecture adopting two-dimensional array storage mode for CPU-FPGA collaborative processing is proposed to achieve high data throughput and low bandwidth requirement for off-chip data transmission. Secondly, the minimized data access and movement on chip are realized by designing a multi-level hierarchical storage architecture. Thirdly, a cyclic data shifting method is proposed to achieve maximized data reuse based on both spatial and temporal. In addition, a bit fusion method based on the 8-bit dynamic fixed-point quantization is adopted to achieve double throughput and computational efficiency of a single DSP. The accelerator proposed in this paper is implemented on Zynq UltraScale + MPSoC ZCU102 evaluation board. By running the benchmark network of VGG16 and Tiny-YOLO on the accelerator, the throughput and the energy efficiency are evaluated. Compared with the current typical accelerators, the proposed accelerator can increase system throughput by up to 41x, single DSP throughput by up to 7.63x, and system energy efficiency by up to 6.3x.
Similar content being viewed by others
References
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015). Going deeper with convolutions. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9)
He T, Huang W, Qiao Y, Yao J (2016) Text-attentional convolutional neural network for scene text detection. IEEE Trans Image Process 25(6):2529–2541
Li H, Lin Z, Shen X, Brandt J, Hua G (2015) A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5325–5334)
Tomè D, Monti F, Baroffio L, Bondi L, Tagliasacchi M, Tubaro S (2016) Deep convolutional neural networks for pedestrian detection. Signal Process Image Commun 47:482–489
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173–182)
Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput Archit News 44(3):243–254
Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, Guo Q, Chen T, Chen Y (2016) Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 1–12)
Chen YH, Emer J, Sze V (2016) Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput Archit News 44(3):367–379
Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, Feng X, Chen Y, Temam O (2015) ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (pp. 92–104)
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen YJ (2015) Pudiannao: A polyvalent machine learning accelerator. ACM SIGARCH Comput Archit News 43(1):369–381
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2015) A high-throughput neural network accelerator. IEEE Micro 35(3):24–32
Mei C, Liu Z, Niu Y, Ji X, Zhou W, Wang D (2017) A 200mhz 202.4 gflops@ 10.8 w vgg16 accelerator in xilinx vx690t. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (pp. 784–788)
Ma J, Chen L, Gao Z (2017) Hardware implementation and optimization of tiny-yolo network. In International Forum on Digital TV and Wireless Multimedia Communications (pp. 224–234).
Wai YJ, Bin Mohd Yussof Z, Bin Salim SI, Chuan LK (2018) Fixed point implementation of tiny-yolo-v2 using opencl on fpga. Int J Adv Comput Sci Appl 9(10):506–512
Chen YH, Krishna T, Emer JS, Sze V (2016) Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52(1):127–138
Guo K, Sui L, Qiu J, Yu J, Wang J, Yao S, Han S, Wang Y, Yang HZ (2018) Angel-eye: a complete design flow for mapping CNN onto embedded FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 37(1):35–47
Du L, Du Y, Li Y, Su J, Kuan YC, Liu CC, Chang MCF (2017) A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans Circuits Syst I Regul Pap 65(1):198–208
Wang J, Lin J, Wang Z (2017) Efficient hardware architectures for deep convolutional neural network. IEEE Trans Circuits Syst I Regul Pap 65(6):1941–1953
Ma MY, Cao Y, Vrudhula S, Seo JS (2018) Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans Very Large Scale Integr VLSI Syst 26(7):1354–1367
Shah N, Chaudhari P, Varghese K (2018) Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans Neural Netw Learn Syst 29(12):5922–5934
Lin YJ, Chang TS (2017) Data and hardware efficient design for convolutional neural network. IEEE Trans Circuits Syst I Regul Pap 65(5):1642–1651
Jo J, Kim S, Park IC (2018) Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans Circuits Syst I Regul Pap 65(12):4196–4207
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S (2016) Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 26–35).
Wang S, Zhou D, Han X, Yoshimura T (2017) Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 (pp. 1032–1037)
Shi R, Xu Z, Sun Z, Peemen M, Li A, Corporaal H, Wu D (2015). A locality aware convolutional neural networks accelerator. In 2015 Euromicro Conference on Digital System Design (pp. 591–598)
Xin C, Chen Q, Tian M, Ji M, Zou C, Wang B (2017) COSY: an energy-efficient hardware architecture for deep convolutional neural networks based on systolic array. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS) (pp. 180–189)
Xiao Q, Liang Y, Lu L, Yan S, Tai Y W (2017) Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (pp. 1–6)
Luo C, Cao W, Wang L, Leong PH (2019) Rna: an accurate residual network accelerator for quantized and reconstructed deep neural networks. IEICE Trans Inf Syst 102(5):1037–1045
Li J, Yan G, Lu W, Jiang S, Gong S, Wu J, Li X (2018) SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 343–348)
Song L, Wang Y, Han Y, Zhao X, Liu B, Li X (2016) C-Brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the 53rd Annual Design Automation Conference (pp. 1–6)
Lu W, Yan G, Li J, Gong S, Han Y, Li X (2017) Flexflow: a flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 553–564)
Liu B, Chen X, Wang Y, Han Y, Li J, Xu H, Li X (2019) Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (pp. 733–738)
Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 161–170)
Yang K, Wang S, Zhou J, Yoshimura T (2017) Energy-efficient scheduling method with cross-loop model for resource-limited CNN accelerator designs. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1–4)
Yang X, Pu J, Rister B B, Bhagdikar N, Richardson S, Kvatinsky S, Ragan-Kelley J, Pedram A, Horowitz M (2016). A systematic approach to blocking convolutional neural networks. arXiv preprint
Peemen M, Mesman B, Corporaal H (2015). Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators. In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 169–174)
Ma Y, Cao Y, Vrudhula S, Seo J S (2017) Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 45–54)
Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 575–580)
Wei X, Yu C H, Zhang P, Chen Y, Wang Y, Hu H, Liang Y, Cong J (2017) Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (pp. 1–6)
Venieris S I, Bouganis C S (2016) fpgaConvNet: a framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 40–47)
Guo K, Han S, Yao S, Wang Y, Xie Y, Yang H (2017) Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37(2):18–25
Yu Y, Wu C, Zhao T, Wang K, He L (2019) Opu: An fpga-based overlay processor for convolutional neural networks. IEEE Trans Very Large Scale Integr VLSI Syst 28(1):35–47
Choi Y, Bae D, Sim J, Choi S, Kim M, Kim LS (2017) Energy-efficient design of processing element for convolutional neural network. IEEE Trans Circuits Syst II Express Briefs 64(11):1332–1336
Lian X, Liu Z, Song Z, Dai J, Zhou W, Ji X (2019) High-performance fpga-based cnn accelerator with block-floating-point arithmetic. IEEE Trans Very Large Scale Integr VLSI Syst 27(8):1874–1885
Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (pp. 281–290)
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105)
Simonyan K, Zisserman A (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint
Iandola F N, Han S, Moskewicz M W, Ashraf K, Dally W J, Keutzer K (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789
Fu Y, Wu E, Sirasao A, Attia S, Khan K, Wittig R (2016). Deep learning with int8 optimization on xilinx devices. White Paper.
Acknowledgements
The authors would like to thanks the anonymous reviewers for their constructive comments. This work was partially supported by the Industry-University-Research Cooperation Fund of the Eighth Research Institute of China Aerospace Science and Technology Corporation (Grant No. SAST2020-068), and the National Natural Science Foundation of China (Grant No. 61872017, Grant No. 61803034).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Y., Jiang, H., Liu, X. et al. High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow. J Supercomput 78, 3205–3225 (2022). https://doi.org/10.1007/s11227-021-03909-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03909-y