research-article

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning

Authors:
Zhengyan Liu

School of Microelectronics, Tianjin University, China

School of Microelectronics, Tianjin University, China

0000-0001-6675-0451
View Profile

,
Qiang Liu

School of Microelectronics, Tianjin University, China

School of Microelectronics, Tianjin University, China

0000-0003-1375-0508
View Profile

,
Shun Yan

School of Microelectronics, Tianjin University, China

School of Microelectronics, Tianjin University, China

0000-0003-2468-5799
View Profile

,
Ray C. C. Cheung

Department of Electrical Engineering, City University of Hong Kong

Department of Electrical Engineering, City University of Hong Kong

0000-0002-6764-0729
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 17 Issue 1Article No.: 15pp 1–20https://doi.org/10.1145/3615661

Published:12 February 2024Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This article proposes an efficient inference accelerator on Field Programmable Gate Array (FPGA) for CNNs with depthwise separable convolutions. To improve the accelerator efficiency, we make four contributions: (1) an efficient convolution engine with multiple strategies for exploiting parallelism and a configurable adder tree are designed to support three types of convolution operations; (2) a dedicated architecture combined with input buffers is designed for the bottleneck network structure to reduce data transmission time; (3) a hardware padding scheme to eliminate invalid padding operations is proposed; and (4) a hardware-assisted pruning method is developed to support online tradeoff between model accuracy and power consumption. Experimental results show that for MobileNetV2 the accelerator achieves 10× and 6× energy efficiency improvement over the CPU and GPU implementation, and 302.3 frames per second and 181.8 GOPS performance that is the best among several existing single-engine accelerators on FPGAs. The proposed hardware-assisted pruning method can effectively reduce 59.7% power consumption at the accuracy loss within 5%.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.Google ScholarDigital Library
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google Scholar
[3] J. Redmon and A. Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.Google Scholar
[4] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.Google ScholarCross Ref
[5] Y. Gong, L. Liu, M. Yang, and L. Bourdev. 2014. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115. Retrieved from https://arxiv.org/abs/1412.6115Google Scholar
[6] F. Li, B. Zhang, and B. Liu. 2016. Ternary weight networks. arXiv:1605.04711. Retrieved from https://arxiv.org/abs/1605.04711Google Scholar
[7] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107–4115.Google Scholar
[8] S. Han, H. Mao, and W. J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149Google Scholar
[9] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887. Retrieved from https://arxiv.org/abs/1712.01887Google Scholar
[10] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. 2736–2744.Google Scholar
[11] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: Alexnet-level accuracy with 50\(\times\) fewer parameters and <0.5 MB model size. arXiv:1602.07360. Retrieved from https://arxiv.org/abs/1602.07360Google Scholar
[12] X. Zhang, X. Zhou, M. Lin, and J. Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google Scholar
[13] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. 2018. ShuffleNet V2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV’18). 116–131.Google Scholar
[14] S. Yan et al. 2021. An FPGA-based MobileNet accelerator considering network structure characteristics. In Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL’21). 17–23.Google Scholar
[15] S.-F. Hsiao and B.-C. Tsai. 2021. Efficient computation of depthwise separable convolution in MoblieNet deep neural network models. In Proceedings of the IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW’21).Google ScholarCross Ref
[16] B. Li et al. 2021. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration. IEEE Trans. Circ. Syst. I: Regul. Pap. 68, 8 (2021), 3279–3292.Google ScholarCross Ref
[17] L. Xiaolin, R. C. Panicker, B. Cardiff, and D. John. 2012. Multistage pruning of CNN based ECG classifiers for edge devices. In Proceedings of the 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC’21). 1965–1968.Google Scholar
[18] S. Sarkar, M. Agarwalla, S. Agarwal, and M. P. Sarma. 2020. An incremental pruning strategy for fast training of CNN models. In Proceeeings of the International Conference on Computational Performance Evaluation (ComPE’20). 371–375.Google Scholar
[19] S. Kim and H. Kim. 2021. Linear domain-aware log-scale post-training quantization. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia’21). 1–3.Google Scholar
[20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861Google Scholar
[21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google ScholarCross Ref
[22] G. Wei, Y. Hou, Z. Zhao, Q. Cui, G. Deng, and X. Tao. 2018. Demo: FPGA-cloud architecture for CNN. In Proceedings of the 24th Asia-Pacific Conference on Communications (APCC’18). 7–8.Google Scholar
[23] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 136–143.Google Scholar
[24] J. Su, J. Faraone, J. Liu, Y. Zhao, D. B. Thomas, P. H. Leong, and P. Y. Cheung. 2018. Redundancy-reduced mobilenet acceleration on reconfigurable logic for imagenet classification. In International Symposium on Applied Reconfigurable Computing. Springer, 16–28.Google Scholar
[25] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and W. Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 14–21.Google Scholar
[26] L. Bai, Y. Zhao, and X. Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circ. Syst. II: Expr. Briefs 65, 10 (2018), 1415–1419.Google ScholarCross Ref
[27] B. Liu, D. Zou, L. Feng, S. Feng, P. Fu, and J. Li. 2019. An FPGA-Based CNN accelerator integrating depthwise separable convolution. Electronics 8, 3 (2019), 281.Google ScholarCross Ref
[28] R. Zhao, X. Niu, and W. Luk. 2018. Automatic optimising CNN with depthwise separable convolution on FPGA: (abstact only). In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 285–285.Google Scholar
[29] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 161–170.Google Scholar
[30] T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy. 2018. VTA: An open hardware-software stack for deep learning. arXiv:1807.04188. Retrieved from https://arxiv.org/abs/1807.04188Google Scholar
[31] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. 2018. Training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.Google Scholar
[32] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, and V. Vasudevan, et al. 2019. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision. 1314–1324.Google Scholar
[33] M. Tan and Q. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.Google Scholar
[34] T. Gale, E. Elsen, and S. Hooker. 2019. The state of sparsity in deep neural networks. arXiv:1902.09574. Retrieved from https://arxiv.org/abs/1902.09574Google Scholar
[35] I. Lazarevich, A. Kozlov, and N. Malinin. 2021. Post-training deep neural network pruning via layer-wise calibration. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW’21). 798–805.Google Scholar
[36] M. S. Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 411–4117.Google Scholar
[37] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–8.Google ScholarCross Ref
[38] R. Kuramochi and H. Nakahara. 2020. An FPGA-based low-latency accelerator for randomly wired neural networks. Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). 298–303.Google Scholar
[39] X. Wei, Y. Liang, and J. Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). 1–6.Google Scholar
[40] Y. Liang et al. 2020. Evaluating fast algorithms for convolutional neural networks on fpgas. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 39, 4 (2020), 857–870.Google ScholarCross Ref
[41] Chen Xiaobo et al. 2011. Real-time affine invariant patch matching using DCT descriptor and affine space quantization. In Proceedings of the IEEE International Conference on Image Processing IEEE.Google Scholar
[42] W. Pang et al. 2020. 8-bit convolutional neural network accelerator for face recognition. In Proceedings of the 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON’20).Google Scholar
[43] Description of Qualcomm Snapdragon 821 MSM8996 Pro. Retrieved from https://rankquality.com/en/qualcomm-snapdragon-821-msm8996-pro/Google Scholar

Index Terms

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Designing efficient accelerator of depthwise separable convolutional neural network on FPGA
Abstract
In recent years, convolutional neural networks (CNNs) have achieved state-of-the-art results for many computer vision tasks. However, the traditional CNNs are computational-intensive and memory-intensive, hence they are unsuitable for ...
Read More
A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator
Abstract
Deep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation ...
Read More
An FPGA-based accelerator platform implements for convolutional neural network
HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications

In recent years, convolutional neural network (CNN) has become widely universal in large number of applications including computer vision, natural language processing and automatic driving. However, the CNN-based methods are computational-intensive and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 17, Issue 1
March 2024
446 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3613534
Editor:
Deming Chen
University of Illinois, Urbana-Champaign, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 February 2024
- Online AM: 13 September 2023
- Accepted: 4 August 2023
- Revised: 23 May 2023
- Received: 2 June 2022
Published in trets Volume 17, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CNN accelerator
depthwise-seperable convolution
bottleneck
model compression
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 381
  Total Downloads
- Downloads (Last 12 months)381
- Downloads (Last 6 weeks)137
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning

ACM Transactions on Reconfigurable Technology and Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Designing efficient accelerator of depthwise separable convolutional neural network on FPGA

A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator

An FPGA-based accelerator platform implements for convolutional neural network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning

ACM Transactions on Reconfigurable Technology and Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Designing efficient accelerator of depthwise separable convolutional neural network on FPGA

A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator

An FPGA-based accelerator platform implements for convolutional neural network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media