skip to main content
research-article

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning

Published:12 February 2024Publication History
Skip Abstract Section

Abstract

Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This article proposes an efficient inference accelerator on Field Programmable Gate Array (FPGA) for CNNs with depthwise separable convolutions. To improve the accelerator efficiency, we make four contributions: (1) an efficient convolution engine with multiple strategies for exploiting parallelism and a configurable adder tree are designed to support three types of convolution operations; (2) a dedicated architecture combined with input buffers is designed for the bottleneck network structure to reduce data transmission time; (3) a hardware padding scheme to eliminate invalid padding operations is proposed; and (4) a hardware-assisted pruning method is developed to support online tradeoff between model accuracy and power consumption. Experimental results show that for MobileNetV2 the accelerator achieves 10× and 6× energy efficiency improvement over the CPU and GPU implementation, and 302.3 frames per second and 181.8 GOPS performance that is the best among several existing single-engine accelerators on FPGAs. The proposed hardware-assisted pruning method can effectively reduce 59.7% power consumption at the accuracy loss within 5%.

REFERENCES

  1. [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google ScholarGoogle Scholar
  3. [3] J. Redmon and A. Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.Google ScholarGoogle Scholar
  4. [4] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Y. Gong, L. Liu, M. Yang, and L. Bourdev. 2014. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115. Retrieved from https://arxiv.org/abs/1412.6115Google ScholarGoogle Scholar
  6. [6] F. Li, B. Zhang, and B. Liu. 2016. Ternary weight networks. arXiv:1605.04711. Retrieved from https://arxiv.org/abs/1605.04711Google ScholarGoogle Scholar
  7. [7] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107–4115.Google ScholarGoogle Scholar
  8. [8] S. Han, H. Mao, and W. J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149Google ScholarGoogle Scholar
  9. [9] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887. Retrieved from https://arxiv.org/abs/1712.01887Google ScholarGoogle Scholar
  10. [10] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. 2736–2744.Google ScholarGoogle Scholar
  11. [11] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: Alexnet-level accuracy with 50\(\times\) fewer parameters and <0.5 MB model size. arXiv:1602.07360. Retrieved from https://arxiv.org/abs/1602.07360Google ScholarGoogle Scholar
  12. [12] X. Zhang, X. Zhou, M. Lin, and J. Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google ScholarGoogle Scholar
  13. [13] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. 2018. ShuffleNet V2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV’18). 116–131.Google ScholarGoogle Scholar
  14. [14] S. Yan et al. 2021. An FPGA-based MobileNet accelerator considering network structure characteristics. In Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL’21). 17–23.Google ScholarGoogle Scholar
  15. [15] S.-F. Hsiao and B.-C. Tsai. 2021. Efficient computation of depthwise separable convolution in MoblieNet deep neural network models. In Proceedings of the IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW’21).Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] B. Li et al. 2021. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration. IEEE Trans. Circ. Syst. I: Regul. Pap. 68, 8 (2021), 3279–3292.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] L. Xiaolin, R. C. Panicker, B. Cardiff, and D. John. 2012. Multistage pruning of CNN based ECG classifiers for edge devices. In Proceedings of the 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC’21). 1965–1968.Google ScholarGoogle Scholar
  18. [18] S. Sarkar, M. Agarwalla, S. Agarwal, and M. P. Sarma. 2020. An incremental pruning strategy for fast training of CNN models. In Proceeeings of the International Conference on Computational Performance Evaluation (ComPE’20). 371–375.Google ScholarGoogle Scholar
  19. [19] S. Kim and H. Kim. 2021. Linear domain-aware log-scale post-training quantization. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia’21). 1–3.Google ScholarGoogle Scholar
  20. [20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861Google ScholarGoogle Scholar
  21. [21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] G. Wei, Y. Hou, Z. Zhao, Q. Cui, G. Deng, and X. Tao. 2018. Demo: FPGA-cloud architecture for CNN. In Proceedings of the 24th Asia-Pacific Conference on Communications (APCC’18). 7–8.Google ScholarGoogle Scholar
  23. [23] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 136–143.Google ScholarGoogle Scholar
  24. [24] J. Su, J. Faraone, J. Liu, Y. Zhao, D. B. Thomas, P. H. Leong, and P. Y. Cheung. 2018. Redundancy-reduced mobilenet acceleration on reconfigurable logic for imagenet classification. In International Symposium on Applied Reconfigurable Computing. Springer, 16–28.Google ScholarGoogle Scholar
  25. [25] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and W. Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 14–21.Google ScholarGoogle Scholar
  26. [26] L. Bai, Y. Zhao, and X. Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circ. Syst. II: Expr. Briefs 65, 10 (2018), 1415–1419.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] B. Liu, D. Zou, L. Feng, S. Feng, P. Fu, and J. Li. 2019. An FPGA-Based CNN accelerator integrating depthwise separable convolution. Electronics 8, 3 (2019), 281.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] R. Zhao, X. Niu, and W. Luk. 2018. Automatic optimising CNN with depthwise separable convolution on FPGA: (abstact only). In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 285–285.Google ScholarGoogle Scholar
  29. [29] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 161–170.Google ScholarGoogle Scholar
  30. [30] T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy. 2018. VTA: An open hardware-software stack for deep learning. arXiv:1807.04188. Retrieved from https://arxiv.org/abs/1807.04188Google ScholarGoogle Scholar
  31. [31] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. 2018. Training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.Google ScholarGoogle Scholar
  32. [32] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, and V. Vasudevan, et al. 2019. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision. 1314–1324.Google ScholarGoogle Scholar
  33. [33] M. Tan and Q. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.Google ScholarGoogle Scholar
  34. [34] T. Gale, E. Elsen, and S. Hooker. 2019. The state of sparsity in deep neural networks. arXiv:1902.09574. Retrieved from https://arxiv.org/abs/1902.09574Google ScholarGoogle Scholar
  35. [35] I. Lazarevich, A. Kozlov, and N. Malinin. 2021. Post-training deep neural network pruning via layer-wise calibration. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW’21). 798–805.Google ScholarGoogle Scholar
  36. [36] M. S. Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 411–4117.Google ScholarGoogle Scholar
  37. [37] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–8.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] R. Kuramochi and H. Nakahara. 2020. An FPGA-based low-latency accelerator for randomly wired neural networks. Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). 298–303.Google ScholarGoogle Scholar
  39. [39] X. Wei, Y. Liang, and J. Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). 1–6.Google ScholarGoogle Scholar
  40. [40] Y. Liang et al. 2020. Evaluating fast algorithms for convolutional neural networks on fpgas. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 39, 4 (2020), 857–870.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Chen Xiaobo et al. 2011. Real-time affine invariant patch matching using DCT descriptor and affine space quantization. In Proceedings of the IEEE International Conference on Image Processing IEEE.Google ScholarGoogle Scholar
  42. [42] W. Pang et al. 2020. 8-bit convolutional neural network accelerator for face recognition. In Proceedings of the 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON’20).Google ScholarGoogle Scholar
  43. [43] Description of Qualcomm Snapdragon 821 MSM8996 Pro. Retrieved from https://rankquality.com/en/qualcomm-snapdragon-821-msm8996-pro/Google ScholarGoogle Scholar

Index Terms

  1. An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 1
      March 2024
      446 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3613534
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 February 2024
      • Online AM: 13 September 2023
      • Accepted: 4 August 2023
      • Revised: 23 May 2023
      • Received: 2 June 2022
      Published in trets Volume 17, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)381
      • Downloads (Last 6 weeks)137

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text