DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

Wan, Yi; Chen, Junfan; Yang, Xiong; Zhang, Hailong; Huang, Chao; Xie, Xianzhong

doi:10.1007/s10489-024-05898-w

DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

Published: 02 December 2024

Volume 55, article number 65, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yi Wan¹,
Junfan Chen²,
Xiong Yang²,
Hailong Zhang²,
Chao Huang³ &
…
Xianzhong Xie¹

133 Accesses
Explore all metrics

Abstract

Field-Programmable Gate Arrays (FPGAs) are increasingly being explored for accelerating Convolutional Neural Networks (CNNs) due to their efficient energy consumption and robust performance. For low-power edge deployment, FPGA-based CNN accelerators typically adopt spatial unrolling architectures. These designs not only achieve high computational efficiency but also feature reduced latency between data transfer and storage access, with low power consumption. Nonetheless, these accelerators may not perform as well with convolutional layers that have large input sizes but few channels. The complexity involved in managing spatial unrolling can hinder their large-scale implementation in integrated circuits. To meet these challenges, this paper presents a new computing architecture called the Deformation Systolic Array (DSA). It starts by designing configurable processing elements (PEs). The architecture uses a designed feature pumping (F-P) method as its dataflow to minimize delays. Additionally, a data broadcasting approach is employed across PEs using a systolic array, enhancing data reuse. The scalable design allows adaptation to varying resource capacities and computational requirements. Furthermore, a scheduling policy has been developed that enables PEs to follow different parallel processing modes depending on the number of channels, size, and type of the convolutional layer. The evaluation experiments demonstrate that, compared to the NVIDIA RTX 3090 GPU and the SIYUAN370 ASIC, DSA-CNN achieves that speedups of 2.10 $\times $ and 1.89 $\times $ , respectively, when deploying the lightweight object detection network SSD-MobileNetV1-300 on the VU13P.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 2

Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques

Article 29 March 2024

Design Optimization for High-Performance Computing Using FPGA

PRTSM: Hardware Data Arrangement Mechanisms for Convolutional Layer Computation on the Systolic Array

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The datasets used or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

The code is available from the corresponding author on reasonable request.

References

Ali A, Zhu Y, Zakarya M (2021) Exploiting dynamic spatio-temporal correlations for citywide traffic flow prediction using attention based neural networks. Inf Sci 577:852–870
Article MathSciNet MATH Google Scholar
Ali A, Zhu Y, Zakarya M (2022) Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw 145:233–247
Article Google Scholar
Ali A, Zhu Y, Zakarya M (2021) A data aggregation based approach to exploit dynamic spatio-temporal correlations for citywide crowd flows prediction in fog computing. Multimed Tools Appl 80(20):31401–31433
Article Google Scholar
The Ho QN, Do TT, Minh PS, Nguyen VT, Nguyen VTT (2023) Turning chatter detection using a multi-input convolutional neural network via image and sound signal. Mach 11(6):644
Article MATH Google Scholar
Yuan T, Liu W, Han J, Lombardi F (2021) High performance cnn accelerators based on hardware and algorithm co-optimization. IEEE Trans Circ Syst I Regular Papers 68(1):250–263. https://doi.org/10.1109/TCSI.2020.3030663
Article MathSciNet MATH Google Scholar
Choquette J, Gandhi W, Giroux O, Stam N, Krashinsky R (2021) Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro 41(2):29–35
Article Google Scholar
Choquette J, Gandhi W (2020) Nvidia a100 gpu: Performance & innovation for gpu computing. In: 2020 IEEE Hot Chips 32 Symposium (HCS), pp 1–43
Koppe G, Meyer-Lindenberg A, Durstewitz D (2021) Deep learning for small and big data in psychiatry. Neuropsychopharmacology 46(1):176–190
Article MATH Google Scholar
Yu Y, Zhao T, He L (2020) Light-opu: An fpga-based overlay processor for lightweight convolutional neural networks, pp 122–132
Chen X, Li J, Zhao Y (2021) Hardware resource and computational density efficient cnn accelerator design based on fpga. In: 2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), pp 204–205. https://doi.org/10.1109/ICTA53157.2021.9661886
Li H, Gong L, Wang C, Zhou X (2023) A flexible dataflow cnn accelerator on fpga. In: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), pp 302–304. https://doi.org/10.1109/CCGridW59191.2023.00065
Nguyen D.T, Nguyen T.N, Kim H, Lee H.J (2019) A high-throughput and power-efficient fpga implementation of yolo cnn for object detection. IEEE Trans Very Large Scale Integ (VLSI) Syst 1–13
Zhang W, Qiao L, Hsu W, Cui Y, Jiang M, Luo G (2021) Fpga acceleration for 3-d low-dose tomographic reconstruction. IEEE Trans Comput-Aid Des Integ Circ Syst 40(4):666–679
Article MATH Google Scholar
Xia M, Huang Z, Tian L, Wang H, Feng S (2021) Sparknoc: An energy-efficiency fpga-based accelerator using optimized lightweight cnn for edge computing.J Syst Archit 115(4):101991
Liu D, Yang C, Li S, Chen X, Ren J, Liu R, Duan M, Tan Y, Liang L (2019) Fitcnn: A cloud-assisted and low-cost framework for updating cnns on iot devices. Futur Gener Comput Syst 91:277–289
Article MATH Google Scholar
Bai L, Zhao Y, Huang X () A cnn accelerator on fpga using depthwise separable convolution. IEEE Trans Circ Syst II Express Briefs 65(10):1415–1419
Betz V, Rose J (2000) Automatic generation of fpga routing architectures from high-level descriptions. In: Proceedings of the 2000 ACM/SIGDA Eighth International Symposium on Field Programmable Gate Arrays. FPGA ’00, pp 175–184. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/329166.329203
Bing L, Zou D, Lei F, Shou F, Ping F (2019) An fpga-based cnn accelerator integrating depthwise separable convolution. Electr 8(3):281
Google Scholar
Samajdar A, Joseph J.M, Zhu Y, Whatmough P, Mattina M, Krishna T (2020) A systematic methodology for characterizing scalability of dnn accelerators using scale-sim. In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 58–68
Gong L, Wang C, Li X, Chen H, Zhou X (2018) Maloc: A fully pipelined fpga accelerator for convolutional neural networks with all layers mapped on chip. IEEE Trans Comput-Aid Des Integr Circ Syst 37(11):2601–2612
Article MATH Google Scholar
Xu R, Ma S, Guo Y, Li D (2023) A survey of design and optimization for systolic array-based dnn accelerators. ACM Comput Surv 56(1)
Zhang J, Zhang W, Luo G, Wei X, Liang Y, Cong J (2019) Frequency improvement of systolic array-based cnns on fpgas. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–4. https://doi.org/10.1109/ISCAS.2019.8702071
Li B, Wang H, Zhang X, Ren J, Liu L, Sun H, Zheng N (2021) Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration. IEEE transactions on circuits and systems, I. Regular papers: a publication of the IEEE Circuits and Systems Society (8):68
Ding W, Huang Z, Huang Z.A, Tian L.A, Wang H.A, Feng SA (2019) Designing efficient accelerator of depthwise separable convolutional neural network on fpga.J Syst Archit 97:278–286
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. Ieee
Kamal MS, Razzak SA, Hossain MM (2016) Catalytic oxidation of volatile organic compounds (vocs)-a review. Atmos Environ 140:117–134
Article MATH Google Scholar
Krizhevsky A, Sutskever I, Hinton G.E (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Howard AG (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1314–1324
Redmon J (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Naik BT, Hashmi MF (2023) Mobilenet+ ssd: Lightweight network for real-time detection of basketball player. In: Proceedings of the International Conference on Paradigms of Computing, Communication and Data Sciences: PCCDS 2022, pp 11–19. Springer
Cai K, Miao X, Wang W, Pang H, Liu Y, Song J (2020) A modified yolov3 model for fish detection based on mobilenetv1 as backbone. Aquac Eng 91:102117
Article Google Scholar
Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2019) Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput-Aid Des Integr Circ Syst 38(11):2072–2085
Venieris SI, Bouganis CS (2019) fpgaconvnet: Mapping regular and irregular convolutional neural networks on fpgas. IEEE Trans Neural Netw Learn Syst 30(2):326–342
Article MATH Google Scholar
Chang JW, Kang SJ (2018) Optimizing fpga-based convolutional neural networks accelerator for image super-resolution. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp 343–348
Zhang J, Li J (2017) Improving the performance of opencl-based fpga accelerator for convolutional neural network. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA ’17, pp 25–34. Association for Computing Machinery, New York, NY, USA
Suda N, Chandra V, Dasika G, Mohanty A, Ma Y, Vrudhula S, Seo J.S, Cao Y (2016) Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In: Acm/sigda International Symposium, pp 16–25
Wu D, Zhang Y, Jia X, Tian L, Li T, Sui L, Xie D, Shan Y (2019) A high-performance cnn processor based on fpga for mobilenets. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp 136–143. IEEE
Su J, Faraone J, Liu J, Zhao Y, Thomas DB, Leong PH, Cheung PY (2018) Redundancy-reduced mobilenet acceleration on reconfigurable logic for imagenet classification. In: Applied Reconfigurable Computing. Architectures, Tools, and Applications: 14th International Symposium, ARC 2018, Santorini, Greece, May 2-4, 2018, Proceedings 14, pp 16–28. Springer
Yu Y, Zhao T, Wang K, He L (2020) Light-opu: An fpga-based overlay processor for lightweight convolutional neural networks. In: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp 122–132

Download references

Acknowledgements

This work is partially supported by the Special Key Project of Technological Innovation and Application Development of Chongqing (CSTB2022TIAD-KPX0057), the Natural Science Foundation Innovation and Development Joint Fund of Chongqing (CSTB2022NSCQ-LZX0074)

Author information

Authors and Affiliations

College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, 400065, Chongqing, China
Yi Wan & Xianzhong Xie
Chongqing Haiyunjiexun Technology Co., Ltd, Chongqing, China
Junfan Chen, Xiong Yang & Hailong Zhang
Intel FPGA China Innovation Center, Chongqing, China
Chao Huang

Authors

Yi Wan
View author publications
You can also search for this author in PubMed Google Scholar
Junfan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xianzhong Xie
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yi Wan: Writing-original draft, Writing-review editing, Investigation and Software. Junfan Chen: Software. Xiong Yang: Software. Hailong Zhang: Software. Chao Huang: Visualization and Data curation. Xianzhong Xie: Methodology and Conceptualization.

Corresponding author

Correspondence to Xianzhong Xie.

Ethics declarations

Conflict of interest/Competing interests

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wan, Y., Chen, J., Yang, X. et al. DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration. Appl Intell 55, 65 (2025). https://doi.org/10.1007/s10489-024-05898-w

Download citation

Accepted: 24 October 2024
Published: 02 December 2024
DOI: https://doi.org/10.1007/s10489-024-05898-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques

Design Optimization for High-Performance Computing Using FPGA

PRTSM: Hardware Data Arrangement Mechanisms for Convolutional Layer Computation on the Systolic Array

Availability of data and materials

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques

Design Optimization for High-Performance Computing Using FPGA

PRTSM: Hardware Data Arrangement Mechanisms for Convolutional Layer Computation on the Systolic Array

Explore related subjects

Availability of data and materials

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation