High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow

Zhang, Yonghua; Jiang, Hongxu; Liu, Xiaojian; Cao, Haiheng; Du, Yu

doi:10.1007/s11227-021-03909-y

High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow

Published: 21 July 2021

Volume 78, pages 3205–3225, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yonghua Zhang^1,2,4,
Hongxu Jiang^1,2,3,
Xiaojian Liu¹,
Haiheng Cao¹ &
…
Yu Du⁴

447 Accesses
Explore all metrics

Abstract

The convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity, which involve enormous communication bandwidth and storage resources requirement. The computation requirement can be addressed effectively to achieve high throughput by highly parallel compute paradigms of current CNNs accelerators. But the energy consumption still remains high as communication can be more expensive than computation, especially for low power embedded platform. To address this problem, this paper proposes a CNNs accelerator based on a novel storage and dataflow on multi-processor system on chip (MPSoC) platform. By minimizing data access and movement and maximizing data reuse, it can achieve the energy efficient CNNs inference acceleration. The optimization strategies mainly involve four aspects. Firstly, an external memory sharing architecture adopting two-dimensional array storage mode for CPU-FPGA collaborative processing is proposed to achieve high data throughput and low bandwidth requirement for off-chip data transmission. Secondly, the minimized data access and movement on chip are realized by designing a multi-level hierarchical storage architecture. Thirdly, a cyclic data shifting method is proposed to achieve maximized data reuse based on both spatial and temporal. In addition, a bit fusion method based on the 8-bit dynamic fixed-point quantization is adopted to achieve double throughput and computational efficiency of a single DSP. The accelerator proposed in this paper is implemented on Zynq UltraScale + MPSoC ZCU102 evaluation board. By running the benchmark network of VGG16 and Tiny-YOLO on the accelerator, the throughput and the energy efficiency are evaluated. Compared with the current typical accelerators, the proposed accelerator can increase system throughput by up to 41x, single DSP throughput by up to 7.63x, and system energy efficiency by up to 6.3x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

Article 18 October 2022

Massively Parallel Neural Processing Array (MPNA): A CNN Accelerator for Embedded Systems

Resource-constrained FPGA implementation of YOLOv2

Article Open access 29 May 2022

References

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015). Going deeper with convolutions. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9)
He T, Huang W, Qiao Y, Yao J (2016) Text-attentional convolutional neural network for scene text detection. IEEE Trans Image Process 25(6):2529–2541
Article MathSciNet Google Scholar
Li H, Lin Z, Shen X, Brandt J, Hua G (2015) A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5325–5334)
Tomè D, Monti F, Baroffio L, Bondi L, Tagliasacchi M, Tubaro S (2016) Deep convolutional neural networks for pedestrian detection. Signal Process Image Commun 47:482–489
Article Google Scholar
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173–182)
Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput Archit News 44(3):243–254
Article Google Scholar
Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, Guo Q, Chen T, Chen Y (2016) Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 1–12)
Chen YH, Emer J, Sze V (2016) Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput Archit News 44(3):367–379
Article Google Scholar
Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, Feng X, Chen Y, Temam O (2015) ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (pp. 92–104)
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen YJ (2015) Pudiannao: A polyvalent machine learning accelerator. ACM SIGARCH Comput Archit News 43(1):369–381
Article Google Scholar
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2015) A high-throughput neural network accelerator. IEEE Micro 35(3):24–32
Article Google Scholar
Mei C, Liu Z, Niu Y, Ji X, Zhou W, Wang D (2017) A 200mhz 202.4 gflops@ 10.8 w vgg16 accelerator in xilinx vx690t. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (pp. 784–788)
Ma J, Chen L, Gao Z (2017) Hardware implementation and optimization of tiny-yolo network. In International Forum on Digital TV and Wireless Multimedia Communications (pp. 224–234).
Wai YJ, Bin Mohd Yussof Z, Bin Salim SI, Chuan LK (2018) Fixed point implementation of tiny-yolo-v2 using opencl on fpga. Int J Adv Comput Sci Appl 9(10):506–512
Google Scholar
Chen YH, Krishna T, Emer JS, Sze V (2016) Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52(1):127–138
Article Google Scholar
Guo K, Sui L, Qiu J, Yu J, Wang J, Yao S, Han S, Wang Y, Yang HZ (2018) Angel-eye: a complete design flow for mapping CNN onto embedded FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 37(1):35–47
Article Google Scholar
Du L, Du Y, Li Y, Su J, Kuan YC, Liu CC, Chang MCF (2017) A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans Circuits Syst I Regul Pap 65(1):198–208
Article Google Scholar
Wang J, Lin J, Wang Z (2017) Efficient hardware architectures for deep convolutional neural network. IEEE Trans Circuits Syst I Regul Pap 65(6):1941–1953
Article Google Scholar
Ma MY, Cao Y, Vrudhula S, Seo JS (2018) Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans Very Large Scale Integr VLSI Syst 26(7):1354–1367
Article Google Scholar
Shah N, Chaudhari P, Varghese K (2018) Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans Neural Netw Learn Syst 29(12):5922–5934
Article Google Scholar
Lin YJ, Chang TS (2017) Data and hardware efficient design for convolutional neural network. IEEE Trans Circuits Syst I Regul Pap 65(5):1642–1651
Article Google Scholar
Jo J, Kim S, Park IC (2018) Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans Circuits Syst I Regul Pap 65(12):4196–4207
Article Google Scholar
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S (2016) Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 26–35).
Wang S, Zhou D, Han X, Yoshimura T (2017) Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 (pp. 1032–1037)
Shi R, Xu Z, Sun Z, Peemen M, Li A, Corporaal H, Wu D (2015). A locality aware convolutional neural networks accelerator. In 2015 Euromicro Conference on Digital System Design (pp. 591–598)
Xin C, Chen Q, Tian M, Ji M, Zou C, Wang B (2017) COSY: an energy-efficient hardware architecture for deep convolutional neural networks based on systolic array. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS) (pp. 180–189)
Xiao Q, Liang Y, Lu L, Yan S, Tai Y W (2017) Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (pp. 1–6)
Luo C, Cao W, Wang L, Leong PH (2019) Rna: an accurate residual network accelerator for quantized and reconstructed deep neural networks. IEICE Trans Inf Syst 102(5):1037–1045
Article Google Scholar
Li J, Yan G, Lu W, Jiang S, Gong S, Wu J, Li X (2018) SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 343–348)
Song L, Wang Y, Han Y, Zhao X, Liu B, Li X (2016) C-Brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the 53rd Annual Design Automation Conference (pp. 1–6)
Lu W, Yan G, Li J, Gong S, Han Y, Li X (2017) Flexflow: a flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 553–564)
Liu B, Chen X, Wang Y, Han Y, Li J, Xu H, Li X (2019) Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (pp. 733–738)
Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 161–170)
Yang K, Wang S, Zhou J, Yoshimura T (2017) Energy-efficient scheduling method with cross-loop model for resource-limited CNN accelerator designs. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1–4)
Yang X, Pu J, Rister B B, Bhagdikar N, Richardson S, Kvatinsky S, Ragan-Kelley J, Pedram A, Horowitz M (2016). A systematic approach to blocking convolutional neural networks. arXiv preprint
Peemen M, Mesman B, Corporaal H (2015). Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators. In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 169–174)
Ma Y, Cao Y, Vrudhula S, Seo J S (2017) Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 45–54)
Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 575–580)
Wei X, Yu C H, Zhang P, Chen Y, Wang Y, Hu H, Liang Y, Cong J (2017) Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (pp. 1–6)
Venieris S I, Bouganis C S (2016) fpgaConvNet: a framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 40–47)
Guo K, Han S, Yao S, Wang Y, Xie Y, Yang H (2017) Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37(2):18–25
Article Google Scholar
Yu Y, Wu C, Zhao T, Wang K, He L (2019) Opu: An fpga-based overlay processor for convolutional neural networks. IEEE Trans Very Large Scale Integr VLSI Syst 28(1):35–47
Article Google Scholar
Choi Y, Bae D, Sim J, Choi S, Kim M, Kim LS (2017) Energy-efficient design of processing element for convolutional neural network. IEEE Trans Circuits Syst II Express Briefs 64(11):1332–1336
Article Google Scholar
Lian X, Liu Z, Song Z, Dai J, Zhou W, Ji X (2019) High-performance fpga-based cnn accelerator with block-floating-point arithmetic. IEEE Trans Very Large Scale Integr VLSI Syst 27(8):1874–1885
Article Google Scholar
Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (pp. 281–290)
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105)
Simonyan K, Zisserman A (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint
Iandola F N, Han S, Moskewicz M W, Ashraf K, Dally W J, Keutzer K (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789
Article Google Scholar
Fu Y, Wu E, Sirasao A, Attia S, Khan K, Wittig R (2016). Deep learning with int8 optimization on xilinx devices. White Paper.

Download references

Acknowledgements

The authors would like to thanks the anonymous reviewers for their constructive comments. This work was partially supported by the Industry-University-Research Cooperation Fund of the Eighth Research Institute of China Aerospace Science and Technology Corporation (Grant No. SAST2020-068), and the National Natural Science Foundation of China (Grant No. 61872017, Grant No. 61803034).

Author information

Authors and Affiliations

Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Yonghua Zhang, Hongxu Jiang, Xiaojian Liu & Haiheng Cao
Hangzhou Innovation Institute, Beihang University, Hangzhou, 310000, China
Yonghua Zhang & Hongxu Jiang
State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Hongxu Jiang
School of Robotics, Beijing Union University, Beijing, 100101, China
Yonghua Zhang & Yu Du

Authors

Yonghua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongxu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haiheng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yu Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongxu Jiang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Jiang, H., Liu, X. et al. High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow. J Supercomput 78, 3205–3225 (2022). https://doi.org/10.1007/s11227-021-03909-y

Download citation

Accepted: 22 May 2021
Published: 21 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-03909-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow

Abstract

Access this article

Similar content being viewed by others

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

Massively Parallel Neural Processing Array (MPNA): A CNN Accelerator for Embedded Systems

Resource-constrained FPGA implementation of YOLOv2

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow

Abstract

Access this article

Similar content being viewed by others

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

Massively Parallel Neural Processing Array (MPNA): A CNN Accelerator for Embedded Systems

Resource-constrained FPGA implementation of YOLOv2

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation