Abstract
Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This article proposes an efficient inference accelerator on Field Programmable Gate Array (FPGA) for CNNs with depthwise separable convolutions. To improve the accelerator efficiency, we make four contributions: (1) an efficient convolution engine with multiple strategies for exploiting parallelism and a configurable adder tree are designed to support three types of convolution operations; (2) a dedicated architecture combined with input buffers is designed for the bottleneck network structure to reduce data transmission time; (3) a hardware padding scheme to eliminate invalid padding operations is proposed; and (4) a hardware-assisted pruning method is developed to support online tradeoff between model accuracy and power consumption. Experimental results show that for MobileNetV2 the accelerator achieves 10× and 6× energy efficiency improvement over the CPU and GPU implementation, and 302.3 frames per second and 181.8 GOPS performance that is the best among several existing single-engine accelerators on FPGAs. The proposed hardware-assisted pruning method can effectively reduce 59.7% power consumption at the accuracy loss within 5%.
- [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.Google ScholarDigital Library
- [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google Scholar
- [3] J. Redmon and A. Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.Google Scholar
- [4] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.Google ScholarCross Ref
- [5] Y. Gong, L. Liu, M. Yang, and L. Bourdev. 2014. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115. Retrieved from https://arxiv.org/abs/1412.6115Google Scholar
- [6] F. Li, B. Zhang, and B. Liu. 2016. Ternary weight networks. arXiv:1605.04711. Retrieved from https://arxiv.org/abs/1605.04711Google Scholar
- [7] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107–4115.Google Scholar
- [8] S. Han, H. Mao, and W. J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149Google Scholar
- [9] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887. Retrieved from https://arxiv.org/abs/1712.01887Google Scholar
- [10] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. 2736–2744.Google Scholar
- [11] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: Alexnet-level accuracy with 50\(\times\) fewer parameters and <0.5 MB model size. arXiv:1602.07360. Retrieved from https://arxiv.org/abs/1602.07360Google Scholar
- [12] X. Zhang, X. Zhou, M. Lin, and J. Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google Scholar
- [13] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. 2018. ShuffleNet V2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV’18). 116–131.Google Scholar
- [14] S. Yan et al. 2021. An FPGA-based MobileNet accelerator considering network structure characteristics. In Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL’21). 17–23.Google Scholar
- [15] S.-F. Hsiao and B.-C. Tsai. 2021. Efficient computation of depthwise separable convolution in MoblieNet deep neural network models. In Proceedings of the IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW’21).Google ScholarCross Ref
- [16] B. Li et al. 2021. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration. IEEE Trans. Circ. Syst. I: Regul. Pap. 68, 8 (2021), 3279–3292.Google ScholarCross Ref
- [17] L. Xiaolin, R. C. Panicker, B. Cardiff, and D. John. 2012. Multistage pruning of CNN based ECG classifiers for edge devices. In Proceedings of the 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC’21). 1965–1968.Google Scholar
- [18] S. Sarkar, M. Agarwalla, S. Agarwal, and M. P. Sarma. 2020. An incremental pruning strategy for fast training of CNN models. In Proceeeings of the International Conference on Computational Performance Evaluation (ComPE’20). 371–375.Google Scholar
- [19] S. Kim and H. Kim. 2021. Linear domain-aware log-scale post-training quantization. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia’21). 1–3.Google Scholar
- [20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861Google Scholar
- [21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google ScholarCross Ref
- [22] G. Wei, Y. Hou, Z. Zhao, Q. Cui, G. Deng, and X. Tao. 2018. Demo: FPGA-cloud architecture for CNN. In Proceedings of the 24th Asia-Pacific Conference on Communications (APCC’18). 7–8.Google Scholar
- [23] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 136–143.Google Scholar
- [24] J. Su, J. Faraone, J. Liu, Y. Zhao, D. B. Thomas, P. H. Leong, and P. Y. Cheung. 2018. Redundancy-reduced mobilenet acceleration on reconfigurable logic for imagenet classification. In International Symposium on Applied Reconfigurable Computing. Springer, 16–28.Google Scholar
- [25] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and W. Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 14–21.Google Scholar
- [26] L. Bai, Y. Zhao, and X. Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circ. Syst. II: Expr. Briefs 65, 10 (2018), 1415–1419.Google ScholarCross Ref
- [27] B. Liu, D. Zou, L. Feng, S. Feng, P. Fu, and J. Li. 2019. An FPGA-Based CNN accelerator integrating depthwise separable convolution. Electronics 8, 3 (2019), 281.Google ScholarCross Ref
- [28] R. Zhao, X. Niu, and W. Luk. 2018. Automatic optimising CNN with depthwise separable convolution on FPGA: (abstact only). In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 285–285.Google Scholar
- [29] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 161–170.Google Scholar
- [30] T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy. 2018. VTA: An open hardware-software stack for deep learning. arXiv:1807.04188. Retrieved from https://arxiv.org/abs/1807.04188Google Scholar
- [31] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. 2018. Training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.Google Scholar
- [32] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, and V. Vasudevan, et al. 2019. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision. 1314–1324.Google Scholar
- [33] M. Tan and Q. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.Google Scholar
- [34] T. Gale, E. Elsen, and S. Hooker. 2019. The state of sparsity in deep neural networks. arXiv:1902.09574. Retrieved from https://arxiv.org/abs/1902.09574Google Scholar
- [35] I. Lazarevich, A. Kozlov, and N. Malinin. 2021. Post-training deep neural network pruning via layer-wise calibration. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW’21). 798–805.Google Scholar
- [36] M. S. Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 411–4117.Google Scholar
- [37] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–8.Google ScholarCross Ref
- [38] R. Kuramochi and H. Nakahara. 2020. An FPGA-based low-latency accelerator for randomly wired neural networks. Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). 298–303.Google Scholar
- [39] X. Wei, Y. Liang, and J. Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). 1–6.Google Scholar
- [40] Y. Liang et al. 2020. Evaluating fast algorithms for convolutional neural networks on fpgas. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 39, 4 (2020), 857–870.Google ScholarCross Ref
- [41] Chen Xiaobo et al. 2011. Real-time affine invariant patch matching using DCT descriptor and affine space quantization. In Proceedings of the IEEE International Conference on Image Processing IEEE.Google Scholar
- [42] W. Pang et al. 2020. 8-bit convolutional neural network accelerator for face recognition. In Proceedings of the 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON’20).Google Scholar
- [43] Description of Qualcomm Snapdragon 821 MSM8996 Pro. Retrieved from https://rankquality.com/en/qualcomm-snapdragon-821-msm8996-pro/Google Scholar
Index Terms
- An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning
Recommendations
Designing efficient accelerator of depthwise separable convolutional neural network on FPGA
AbstractIn recent years, convolutional neural networks (CNNs) have achieved state-of-the-art results for many computer vision tasks. However, the traditional CNNs are computational-intensive and memory-intensive, hence they are unsuitable for ...
A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator
AbstractDeep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation ...
An FPGA-based accelerator platform implements for convolutional neural network
HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and CommunicationsIn recent years, convolutional neural network (CNN) has become widely universal in large number of applications including computer vision, natural language processing and automatic driving. However, the CNN-based methods are computational-intensive and ...
Comments