Elsevier

Neurocomputing

Volume 490, 14 June 2022, Pages 1-16
Neurocomputing

Efficient depthwise separable convolution accelerator for classification and UAV object detection

https://doi.org/10.1016/j.neucom.2022.02.071Get rights and content

Abstract

Depthwise separable convolutions (DSC) have been widely deployed in lightweight convolutional neural networks due to high efficiency. But the acceleration performance of the Graphics Processing Unit for DSC was not as well as in theory. In this paper, some approaches were proposed for accelerating DSC based on Field-Programmable Gate Array (FPGA). For the preceding layers, S2C (spatial to channel) was proposed to accelerate computing and improve the utilization rate of computational resources and bandwidth. An efficient SharePE was proposed to accelerate the DSC, which can improve the efficiency of the computing resource. The regulable parallelism approach was proposed to compute efficiently the different pointwise convolutional layers. P2D&D2P approach is proposed to reduce the external memory access. For the entire accelerating system, the pre-load workflow was proposed to reduce the waiting time of the accelerator between two images. We demonstrated our approaches on the SkyNet using the Ultra96V2 development board. Results indicated that our proposed accelerator obtained 80.030 frames per second and 0.072 Joule per image for UAV object detection, which achieved the state-of-the-art results for SkyNet. Besides, the MobileNetV2 model was implemented on a larger XC7Z100 FPGA, and the results showed our accelerator classified each picture from ImageNet in 2.69 ms. Code is available at https://github.com/AILearnerLi/DAC-SDC-2020-SEUer.

Introduction

Convolutional neural networks (CNNs) have become the mainstream approach in the computational vision field [1], [2], [3], [4] because of their high accuracy. Generally, CNN is a computing-intensive model, which has abundant parameters and floating-point operations (FLOPs). For instance, VGG-19 [5] had 16 convolutional layers and 3 fully connected layers with 144 million parameters and 19.6 billion FLOPs for one 224×224 image. ResNet-101 [6] and DenseNet-121 [7] had 25 million parameters and 8.1 million parameters, respectively. These CNNs were difficult to be deployed to resource and power limited devices and obtain real-time performance. Therefore, lightweight CNNs [8], [9], [10] caught the attention of researchers. Some lightweight convolution operations were proposed to reduce the parameters and FLOPs.

The group convolution is a popular lightweight convolution. ResNeXt [11] adopted group convolution and obtained higher accuracy than ResNet with a similar computational cost. Chen et al. [2] reduced the parameters of group convolution by sharing weights via Bayesian Learning, which obtained higher accuracy than ResNeXt with similar model size. Dense2Net [12] obtained higher accuracy than DenseNet [7] with fewer parameters by using group convolution. ShuffleNet [13], a famous lightweight CNN, used group convolution to reduce the parameters and FLOPs of 1×1 convolutional layer, which achieved about 13× actual speedup over AlexNet while maintaining comparable accuracy.

Depthwise convolution (DWC) is an extreme case of group convolution, in which one group only contains a feature map channel. DWC is more efficient than group convolution due to fewer parameters and FLOPs. DWC has been applied in MobileNetV1 [14] and later MobileNetV2 [15], and achieved comparable results with much fewer FLOPs and parameters. Now, DWC is a very popular approach for designing lightweight CNNs. Many CNNs [16], [17] based on network architecture search also adopted DWC to improve the efficiencies.

DWC has a high memory access cost (MAC)/FLOPs ratio because of the large feature maps and few FLOPs. The acceleration performance of Graphics Processing Unit (GPU) for depthwise convolution cannot reach the theoretical value because of its high MAC/FLOPs ratio [16], [18]. FPGAs excel at low-precision computation, and their adaptability to new algorithms lends themselves to supporting rapidly changing CNN architectures.

Many hardware accelerators [19], [20], [21], [22] have been developed to improve the speed and power performance of the compute-intensive CNNs. Moini et al. [23] exploited the inherent parallelism in CNNs to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations, and the proposed hardware accelerator only used 391 DSP48 and obtained 19.2 Giga multiply accumulation operations per second while consuming less than 10 watts (W) in power. Wang et al. [24] implemented the VGG16 model on Xilinx Virtex VC707 platforms and achieved a frame rate of 33.80 frames per second (FPS) with average performances of 1250.21 Giga operations per second (GOPS). Ma et al. [25] studied the convolution loop optimization before the hardware design phase and proposed a specific dataflow. The proposed accelerator implemented NiN, VGG-16, and ResNet-50/ResNet152 and it achieved 707.2 GOPS for ResNet-152. Azizimazreah et al. [26] exploited cross-layer shortcut reuse in CNN accelerators, and experiment results showed that the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152. However, most hardware accelerators were designed for large CNNs, which were not very suitable for depthwise separable convolution (DSC) because of the special operation and low MAC/FLOPs of the DWC.

Currently, with the wide application of DSC in lightweight CNNs [16], [27], the hardware accelerator for DSC has been very urgent. The existing accelerators [17], [28] for DSC have a low utilization rate of computational resources during the entire running phase. In this paper, we gave a roofline model analysis for DSC, which helps to design an efficient accelerator. Then, spatial to channel (S2C), D2P&P2D, sharing processing element (SharePE) between DWC and pointwise convolution (PWC), regulable parallelism (R-Parallel) in computing unit, and pre-load workflow are proposed to improve the resource utilization rate, reduce the external memory access and speed up the accelerator. Based on iSmart3 (the champion model for IEEE/ACM Design Automation Conference System Design Contest, DAC’2019-SDC) [17], we designed an accelerator for SkyNet by using partial methods, which obtained the 6th place in the DAC’2020-SDC. Using all of the proposed approaches on Skrkr-SkyNet (the 2nd place in the DAC’2020-SDC) [28], we obtained the state-of-the-art results for SkyNet.

The Energy score (according to the evaluation of DAC’2020-SDC, ES=max{0,1+0.2×log2EE},E is the average energy usage across all teams.) and inference accuracy (IOU) on FPGA of our accelerators and DAC-SDC Top-3 design for the last two years are compared in Fig. 1. The higher energy score corresponds to lower energy consumption. Our accelerator based on the Skrskr achieves 80.030 FPS with 0.731 IOU, which score will surpass the 1st place solution in the DAC’2020-SDC. Furthermore, we implemented the MobileNetV2 model on Xilinx XC7Z100 FPGA platform and achieved a frame rate of 371.4 FPS and 0.19 GOPS per DSP (GOPS/DSP). The contribution of this work can be summarized as follows.

  • An efficient SharePE was proposed to compute depthwise separable convolution, which could compute efficiently DWC and PWC, and had a high computing resource utilization rate for DWC and PWC.

  • The regulable parallelism for computing unit was proposed to compute the different PWC layers in the residual block, which could improve the utilization rate of the computational resource.

  • A spatial to channel approach was proposed to accelerate the computing of preceding DWC layers in CNNs, which could improve the utilization rate of computational resources and bandwidth.

  • D2P&P2D was proposed to reduce the external memory access. When feature map channels are few, D2P is adopted. On the contrary, P2D is adopted when the number of feature map channels are large.

  • The pre-load workflow was proposed for accelerating the entire system, which could reduce the waiting times between computing two images.

  • SkyNet was implemented by using the proposed approach on the resource-constraint Ultra96V2 FPGA platform for object detection, which achieved state-of-the-art results.

  • MobileNetV2 was implemented on the XC7Z100 FPGA platform based on the proposed improvement approaches, which obtained a high FPS and throughput per DSP.

The remainder of this paper are organized as follows. Section 2 reviews two typical DSC CNNs, MobileNetV2 for image classification and SkyNet for object detection. Then, some related works about the hardware accelerator for depthwise separable convolution and roofline model are introduced. In Section 3, the roofline model analysis of DSC is presented, and Skynet is analyzed by the roofline model. In Section 4, the system architectures, including dedicated accelerator architecture and improvement approaches for accelerating depthwise separable convolution are described. The experiment results of our proposed accelerators for the classification task (MobileNetV2) and object detection (SkyNet) are analyzed and discussed in Section 5. The conclusions are given in Section 6.

Section snippets

MobileNetV2

MobileNetV2 [15] was a typical lightweight CNNs, which was constructed by DSC. MobileNetV2 proposed the inverted residuals and linear bottlenecks, which significantly decreased the number of operations and memory needed while retaining the high accuracy. MobileNetV2 had 3.4 million (M) weights, 300 M Multiply–Adds and achieved 72% accuracy on the ImageNet dataset. Due to the excellent performance, most DSC hardware accelerators [29], [30], [31], [32] choose MobileNetV2 to verify the performance

Roofline model Analysis for Depthwise separable convolution

Depthwise separable convolution was first introduced in Xception [27] architecture, which is wider than ResNet[6] but has a similar number of parameters because of the efficient depthwise separable convolution. Then, DSC was widely adopted by lightweight CNNs [14], [18], [17]. A depthwise separable convolution performs a DWC and PWC. DWC can extract spatial features and PWC can fuse channel information, so the standard convolution layer can be replaced by a DSC layer. Fig. 2 illustrates how the

System Architecture

In this section, the quantization scheme is firstly presented, and the convolutional layer and batch normalization are fused before the quantization. Then the overall architecture of the accelerating system is introduced. The spatial to channel approach for accelerating computing and improving the computing source utilization is described. D2P&P2D for reducing external memory access and increasing the CTC ratio is introduced. The SharePE for computing DWC and PWC, and the regulable parallelism

Results and Analysis

In this section, the roofline model analysis of the proposed accelerator based on the quantization method in Skrskr is presented. Then the effectiveness of pre-load workflow, SharePE, R-parallel, D2P&P2D and S2C approaches based on SkyNet are evaluated, separately. Finally, our accelerators are compared with other exiting accelerators for DSC based on FPGA.

Conclusion

Depthwise separable convolution has fewer parameters and computing costs than standard convolution, which is widely used in lightweight convolutional neural networks. However, The acceleration performance of GPU is not as good as in theory and the computation resource efficiency is not high, which limits the accelerating performance. In this paper, the depthwise separable convolution is analyzed by roofline model. Furthermore, several approaches are proposed to improve the efficiency of

CRediT authorship contribution statement

Guoqing Li: Conceptualization, Methodology, Writing - review & editing. Jingwei Zhang: Software, Data curation, Writing - original draft. Meng Zhang: Project administration, Supervision, Funding acquisition, Resources. Ruixia Wu: Software, Investigation, Visualization. Xinye Cao: Software, Formal analysis, Validation. Wenzhao Liu: Writing - review & editing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research work was partly supported by National Key R&D Program of China (Project No. 2018YFB2202703) and the Key R&D Program of Guangdong Province (Project No. 2021B1101270006), and the Natural Science Foundation of Jiangsu Province (Project No. BK20201145).

Guoqing Li received the B.S. degree from Qingdao University, Qingdao, China, in 2014, the M.S. degree from South China Normal University, Guangzhou, China, in 2017. He is currently pursuing the Ph.D. degree with the National ASIC Engineering Technology Research Center, School of Electronics Science and Engineering, Southeast University, Nanjing, China. His current research interests include computer vision, convolutional neural networks, deep learning hardware accelerators.

References (49)

  • Y. Ma et al.

    ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

    Integr.

    (2018)
  • T. Chen et al.

    An efficient sharing grouped convolution via bayesian learning

    IEEE Trans. Neural Networks Learn. Syst.

    (2021)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

  • K. He et al.

    Deep residual learning for image recognition

  • G. Huang et al.

    Densely connected convolutional networks

  • S. Xie et al.

    Aggregated residual transformations for deep neural networks

  • X. Zhang et al.

    Shufflenet: An extremely efficient convolutional neural network for mobile devices

  • A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., Mobilenets: Efficient convolutional neural...
  • M. Sandler et al.

    Mobilenetv 2: Inverted residuals and linear bottlenecks

  • A. Howard, R. Pang, H. Adam, Q.V. Le, M. Sandler, B. Chen, W. Wang, L. Chen, M. Tan, G. Chu, V. Vasudevan, Y. Zhu,...
  • X. Zhang, H. Lu, C. Hao, J. Li, B. Cheng, Y. Li, K. Rupnow, J. Xiong, T. Huang, H. Shi, W.-M. Hwu, D. Chen, SkyNet: a...
  • N. Ma et al.

    Shufflenet V2: practical guidelines for efficient CNN architecture design

  • G. Li et al.

    Efficient binary 3d convolutional neural network and hardware accelerator

    J. Real-Time Image Process.

    (2021)
  • S. Moini et al.

    A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications

    IEEE Trans. Circuits Syst. II Express Briefs

    (2017)
  • Cited by (19)

    • Gaussian-type activation function with learnable parameters in complex-valued convolutional neural network and its application for PolSAR classification

      2023, Neurocomputing
      Citation Excerpt :

      As a classic algorithm of deep learning, convolutional neural network (CNN) has been widely used in computer vision [1], speech recognition [2], natural language processing [3], and so on [4–6].

    • SAR marine oil spill detection based on an encoder-decoder network

      2024, International Journal of Remote Sensing
    View all citing articles on Scopus

    Guoqing Li received the B.S. degree from Qingdao University, Qingdao, China, in 2014, the M.S. degree from South China Normal University, Guangzhou, China, in 2017. He is currently pursuing the Ph.D. degree with the National ASIC Engineering Technology Research Center, School of Electronics Science and Engineering, Southeast University, Nanjing, China. His current research interests include computer vision, convolutional neural networks, deep learning hardware accelerators.

    Jingwei Zhang is an Eng.D student at the National ASIC Center in School of Electronic Science & Engineering, Southeast University, China. His research interests include design space exploration for Integrated Circuits and deep learning hardware accelerators.

    Meng Zhang received the B.S. degree in electrical engineering from the China University of Mining and Technology, Xuzhou, China, in 1986, and the M.S. degree in bioelectronics and the Ph.D. degree in microelectronic engineering, as an on-the-job postgraduate student, from Southeast University, Nanjing, China, in 1993 and 2014, respectively. He is currently a Professor and a Faculty Adviser of Ph.D. graduates at the National ASIC System Research Center, School of Electronic Science and Engineering, Southeast University. He has published more than 40 refereed journal articles and international conference papers. He holds more than 90 patents, including some PCT and U.S. patents. His research interests include deep learning, machine learning, digital signal processing, digital communication systems, and digital integrated circuit design.

    Ruixia Wu is a M.S. student at the National ASIC Center in School of Microelectronics, Southeast University, China. She received the B.S. degree in Xi’an University of Posts and Telecommunications, Xi’an, China, in 2019. Her research interests include deep learning techniques, Neural Architecture Search, etc.

    Xinye Cao is a master student at National ASIC Engineering Technology Research Center in School of Electronics Science and Engineering, Southeast University, Nanjing, China. He also received the bachelor’s degree from Southeast University, Nanjing, China, in 2020. His research fields include computer vision, FPGA development, etc.

    WenZhao Liu received the bachelor’s degree in School of Electronic Science and Engineering from Southeast University in 2017. She is currently a master student of the National ASIC Engineering Technology Research Center of Southeast University. Her research interests include computer vision, image processing and network accelerator.

    View full text