Efficient depthwise separable convolution accelerator for classification and UAV object detection

doi:10.1016/j.neucom.2022.02.071

Neurocomputing

Volume 490, 14 June 2022, Pages 1-16

https://doi.org/10.1016/j.neucom.2022.02.071 Get rights and content

Abstract

Depthwise separable convolutions (DSC) have been widely deployed in lightweight convolutional neural networks due to high efficiency. But the acceleration performance of the Graphics Processing Unit for DSC was not as well as in theory. In this paper, some approaches were proposed for accelerating DSC based on Field-Programmable Gate Array (FPGA). For the preceding layers, S2C (spatial to channel) was proposed to accelerate computing and improve the utilization rate of computational resources and bandwidth. An efficient SharePE was proposed to accelerate the DSC, which can improve the efficiency of the computing resource. The regulable parallelism approach was proposed to compute efficiently the different pointwise convolutional layers. P2D&D2P approach is proposed to reduce the external memory access. For the entire accelerating system, the pre-load workflow was proposed to reduce the waiting time of the accelerator between two images. We demonstrated our approaches on the SkyNet using the Ultra96V2 development board. Results indicated that our proposed accelerator obtained 80.030 frames per second and 0.072 Joule per image for UAV object detection, which achieved the state-of-the-art results for SkyNet. Besides, the MobileNetV2 model was implemented on a larger XC7Z100 FPGA, and the results showed our accelerator classified each picture from ImageNet in 2.69 ms. Code is available at https://github.com/AILearnerLi/DAC-SDC-2020-SEUer.

Introduction

Convolutional neural networks (CNNs) have become the mainstream approach in the computational vision field [1], [2], [3], [4] because of their high accuracy. Generally, CNN is a computing-intensive model, which has abundant parameters and floating-point operations (FLOPs). For instance, VGG-19 [5] had 16 convolutional layers and 3 fully connected layers with 144 million parameters and 19.6 billion FLOPs for one $224 \times 224$ image. ResNet-101 [6] and DenseNet-121 [7] had 25 million parameters and 8.1 million parameters, respectively. These CNNs were difficult to be deployed to resource and power limited devices and obtain real-time performance. Therefore, lightweight CNNs [8], [9], [10] caught the attention of researchers. Some lightweight convolution operations were proposed to reduce the parameters and FLOPs.

The group convolution is a popular lightweight convolution. ResNeXt [11] adopted group convolution and obtained higher accuracy than ResNet with a similar computational cost. Chen et al. [2] reduced the parameters of group convolution by sharing weights via Bayesian Learning, which obtained higher accuracy than ResNeXt with similar model size. Dense2Net [12] obtained higher accuracy than DenseNet [7] with fewer parameters by using group convolution. ShuffleNet [13], a famous lightweight CNN, used group convolution to reduce the parameters and FLOPs of $1 \times 1$ convolutional layer, which achieved about $13 \times$ actual speedup over AlexNet while maintaining comparable accuracy.

Depthwise convolution (DWC) is an extreme case of group convolution, in which one group only contains a feature map channel. DWC is more efficient than group convolution due to fewer parameters and FLOPs. DWC has been applied in MobileNetV1 [14] and later MobileNetV2 [15], and achieved comparable results with much fewer FLOPs and parameters. Now, DWC is a very popular approach for designing lightweight CNNs. Many CNNs [16], [17] based on network architecture search also adopted DWC to improve the efficiencies.

DWC has a high memory access cost (MAC)/FLOPs ratio because of the large feature maps and few FLOPs. The acceleration performance of Graphics Processing Unit (GPU) for depthwise convolution cannot reach the theoretical value because of its high MAC/FLOPs ratio [16], [18]. FPGAs excel at low-precision computation, and their adaptability to new algorithms lends themselves to supporting rapidly changing CNN architectures.

Many hardware accelerators [19], [20], [21], [22] have been developed to improve the speed and power performance of the compute-intensive CNNs. Moini et al. [23] exploited the inherent parallelism in CNNs to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations, and the proposed hardware accelerator only used 391 DSP48 and obtained 19.2 Giga multiply accumulation operations per second while consuming less than 10 watts (W) in power. Wang et al. [24] implemented the VGG16 model on Xilinx Virtex VC707 platforms and achieved a frame rate of 33.80 frames per second (FPS) with average performances of 1250.21 Giga operations per second (GOPS). Ma et al. [25] studied the convolution loop optimization before the hardware design phase and proposed a specific dataflow. The proposed accelerator implemented NiN, VGG-16, and ResNet-50/ResNet152 and it achieved 707.2 GOPS for ResNet-152. Azizimazreah et al. [26] exploited cross-layer shortcut reuse in CNN accelerators, and experiment results showed that the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152. However, most hardware accelerators were designed for large CNNs, which were not very suitable for depthwise separable convolution (DSC) because of the special operation and low MAC/FLOPs of the DWC.

Currently, with the wide application of DSC in lightweight CNNs [16], [27], the hardware accelerator for DSC has been very urgent. The existing accelerators [17], [28] for DSC have a low utilization rate of computational resources during the entire running phase. In this paper, we gave a roofline model analysis for DSC, which helps to design an efficient accelerator. Then, spatial to channel (S2C), D2P&P2D, sharing processing element (SharePE) between DWC and pointwise convolution (PWC), regulable parallelism (R-Parallel) in computing unit, and pre-load workflow are proposed to improve the resource utilization rate, reduce the external memory access and speed up the accelerator. Based on iSmart3 (the champion model for IEEE/ACM Design Automation Conference System Design Contest, DAC’2019-SDC) [17], we designed an accelerator for SkyNet by using partial methods, which obtained the $6^{th}$ place in the DAC’2020-SDC. Using all of the proposed approaches on Skrkr-SkyNet (the $2^{nd}$ place in the DAC’2020-SDC) [28], we obtained the state-of-the-art results for SkyNet.

The Energy score (according to the evaluation of DAC’2020-SDC, $ES = \max {0, 1 + 0.2 \times \log_{2} \frac{\overline{E}}{E}}, \overline{E}$ is the average energy usage across all teams.) and inference accuracy (IOU) on FPGA of our accelerators and DAC-SDC Top-3 design for the last two years are compared in Fig. 1. The higher energy score corresponds to lower energy consumption. Our accelerator based on the Skrskr achieves 80.030 FPS with 0.731 IOU, which score will surpass the 1st place solution in the DAC’2020-SDC. Furthermore, we implemented the MobileNetV2 model on Xilinx XC7Z100 FPGA platform and achieved a frame rate of 371.4 FPS and 0.19 GOPS per DSP (GOPS/DSP). The contribution of this work can be summarized as follows.

•
An efficient SharePE was proposed to compute depthwise separable convolution, which could compute efficiently DWC and PWC, and had a high computing resource utilization rate for DWC and PWC.
•
The regulable parallelism for computing unit was proposed to compute the different PWC layers in the residual block, which could improve the utilization rate of the computational resource.
•
A spatial to channel approach was proposed to accelerate the computing of preceding DWC layers in CNNs, which could improve the utilization rate of computational resources and bandwidth.
•
D2P&P2D was proposed to reduce the external memory access. When feature map channels are few, D2P is adopted. On the contrary, P2D is adopted when the number of feature map channels are large.
•
The pre-load workflow was proposed for accelerating the entire system, which could reduce the waiting times between computing two images.
•
SkyNet was implemented by using the proposed approach on the resource-constraint Ultra96V2 FPGA platform for object detection, which achieved state-of-the-art results.
•
MobileNetV2 was implemented on the XC7Z100 FPGA platform based on the proposed improvement approaches, which obtained a high FPS and throughput per DSP.

The remainder of this paper are organized as follows. Section 2 reviews two typical DSC CNNs, MobileNetV2 for image classification and SkyNet for object detection. Then, some related works about the hardware accelerator for depthwise separable convolution and roofline model are introduced. In Section 3, the roofline model analysis of DSC is presented, and Skynet is analyzed by the roofline model. In Section 4, the system architectures, including dedicated accelerator architecture and improvement approaches for accelerating depthwise separable convolution are described. The experiment results of our proposed accelerators for the classification task (MobileNetV2) and object detection (SkyNet) are analyzed and discussed in Section 5. The conclusions are given in Section 6.

Section snippets

MobileNetV2

MobileNetV2 [15] was a typical lightweight CNNs, which was constructed by DSC. MobileNetV2 proposed the inverted residuals and linear bottlenecks, which significantly decreased the number of operations and memory needed while retaining the high accuracy. MobileNetV2 had 3.4 million (M) weights, 300 M Multiply–Adds and achieved 72% accuracy on the ImageNet dataset. Due to the excellent performance, most DSC hardware accelerators [29], [30], [31], [32] choose MobileNetV2 to verify the performance

Roofline model Analysis for Depthwise separable convolution

Depthwise separable convolution was first introduced in Xception [27] architecture, which is wider than ResNet[6] but has a similar number of parameters because of the efficient depthwise separable convolution. Then, DSC was widely adopted by lightweight CNNs [14], [18], [17]. A depthwise separable convolution performs a DWC and PWC. DWC can extract spatial features and PWC can fuse channel information, so the standard convolution layer can be replaced by a DSC layer. Fig. 2 illustrates how the

System Architecture

In this section, the quantization scheme is firstly presented, and the convolutional layer and batch normalization are fused before the quantization. Then the overall architecture of the accelerating system is introduced. The spatial to channel approach for accelerating computing and improving the computing source utilization is described. D2P&P2D for reducing external memory access and increasing the CTC ratio is introduced. The SharePE for computing DWC and PWC, and the regulable parallelism

Results and Analysis

In this section, the roofline model analysis of the proposed accelerator based on the quantization method in Skrskr is presented. Then the effectiveness of pre-load workflow, SharePE, R-parallel, D2P&P2D and S2C approaches based on SkyNet are evaluated, separately. Finally, our accelerators are compared with other exiting accelerators for DSC based on FPGA.

Conclusion

Depthwise separable convolution has fewer parameters and computing costs than standard convolution, which is widely used in lightweight convolutional neural networks. However, The acceleration performance of GPU is not as good as in theory and the computation resource efficiency is not high, which limits the accelerating performance. In this paper, the depthwise separable convolution is analyzed by roofline model. Furthermore, several approaches are proposed to improve the efficiency of

CRediT authorship contribution statement

Guoqing Li: Conceptualization, Methodology, Writing - review & editing. Jingwei Zhang: Software, Data curation, Writing - original draft. Meng Zhang: Project administration, Supervision, Funding acquisition, Resources. Ruixia Wu: Software, Investigation, Visualization. Xinye Cao: Software, Formal analysis, Validation. Wenzhao Liu: Writing - review & editing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research work was partly supported by National Key R&D Program of China (Project No. 2018YFB2202703) and the Key R&D Program of Guangdong Province (Project No. 2021B1101270006), and the Natural Science Foundation of Jiangsu Province (Project No. BK20201145).

Guoqing Li received the B.S. degree from Qingdao University, Qingdao, China, in 2014, the M.S. degree from South China Normal University, Guangzhou, China, in 2017. He is currently pursuing the Ph.D. degree with the National ASIC Engineering Technology Research Center, School of Electronics Science and Engineering, Southeast University, Nanjing, China. His current research interests include computer vision, convolutional neural networks, deep learning hardware accelerators.

References (49)

Q. Zhang et al.
Recent advances in convolutional neural network acceleration
Neurocomputing
(2019)
G. Li et al.
Scwc: Structured channel weight sharing to compress convolutional neural networks
Inf. Sci.
(2022)
M. Alam et al.
Survey on deep neural networks in speech and vision systems
Neurocomputing
(2020)
G.C. Qiao et al.
STBNN: hardware-friendly spatio-temporal binary neural network with high pattern recognition accuracy
Neurocomputing
(2020)
G. Li et al.
Diagonal-kernel convolutional neural networks for image classification
Digit. Signal Process.
(2021)
J. Zhang et al.
Coarse-to-fine object detection in unmanned aerial vehicle imagery using lightweight convolutional neural network and deep motion saliency
Neurocomputing
(2020)
G. Li et al.
Efficient densely connected convolutional neural networks
Pattern Recogn.
(2021)
S. Liang et al.
FP-BNN: binarized neural network on FPGA
Neurocomputing
(2018)
L. Zhang et al.
Memristive deeplab: A hardware friendly deep cnn for semantic segmentation
Neurocomputing
(2021)
Z. Hajduk
Reconfigurable FPGA implementation of neural networks
Neurocomputing
(2018)

Y. Ma et al.

ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

Integr.

(2018)

T. Chen et al.

An efficient sharing grouped convolution via bayesian learning

IEEE Trans. Neural Networks Learn. Syst.

(2021)

K. Simonyan et al.

Very deep convolutional networks for large-scale image recognition

K. He et al.

Deep residual learning for image recognition

G. Huang et al.

Densely connected convolutional networks

S. Xie et al.

Aggregated residual transformations for deep neural networks

X. Zhang et al.

Shufflenet: An extremely efficient convolutional neural network for mobile devices

A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., Mobilenets: Efficient convolutional neural...

M. Sandler et al.

Mobilenetv 2: Inverted residuals and linear bottlenecks

A. Howard, R. Pang, H. Adam, Q.V. Le, M. Sandler, B. Chen, W. Wang, L. Chen, M. Tan, G. Chu, V. Vasudevan, Y. Zhu,...

X. Zhang, H. Lu, C. Hao, J. Li, B. Cheng, Y. Li, K. Rupnow, J. Xiong, T. Huang, H. Shi, W.-M. Hwu, D. Chen, SkyNet: a...

N. Ma et al.

Shufflenet V2: practical guidelines for efficient CNN architecture design

G. Li et al.

Efficient binary 3d convolutional neural network and hardware accelerator

J. Real-Time Image Process.

(2021)

S. Moini et al.

A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications

IEEE Trans. Circuits Syst. II Express Briefs

(2017)

Cited by (19)

Multi-scale information distillation network for efficient image super-resolution
2023, Knowledge-Based Systems
Efficient image super-resolution (SR), being preferred in the resource-constrained scenarios, aims at not only higher super-resolving accuracy but also lower computational complexity. Taking the perception capability of deep networks into account, efficiently and effectively obtaining the large receptive field is a key principle for this task. Thus, in this paper, we integrate the multi-scale receptive field design with information distillation structure and attention mechanism, and develop a lightweight Multi-Scale Information Distillation (MSID) network. In detail, we design a multi-scale feature distillation (MSFD) block by employing multi-scale convolutions with different kernels into feature distillation connection, which effectively distills information from multiple receptive fields with low computational cost for better feature refinement. Moreover, we construct a scalable large kernel attention (SLKA) block via scaling attentive fields across network layers, that possesses large and scalable receptive field in attention to discriminatively enhance the distilled features. Extensive quantitative and qualitative evaluations on benchmark datasets validate the effectiveness of each proposed component and also demonstrate the superiority of our MSID network over state-of-the-art efficient SR methods. The code is available at https://github.com/YuanfeiHuang/MSID.
Deep learning-based visual detection of marine organisms: A survey
2023, Neurocomputing
Most recently, deep learning-based visual detection has attracted rapidly increasing attention paid to marine organisms, thereby expecting to significantly benefit ocean ecology. Suffering from underwater visual degradation including low contrast, color distortion and blur, etc., both advances and challenges on visual detection of marine organisms (VDMO) co-exist in the literature. In this survey, deep learning-based VDMO techniques are comprehensively revisited from a systematic viewpoint covering advances in underwater image preprocessing, deep learning-based detection approaches, benchmark dataset and intensively quantitative comparisons. Furthermore, in terms of inherent features of marine organisms and complexity of underwater visual environments, underlying challenges are unfolded in depth. Such a self-contained survey is expected to exploit potential breakthroughs and explore probable trends in deep learning-based VDMO techniques.
Gaussian-type activation function with learnable parameters in complex-valued convolutional neural network and its application for PolSAR classification
2023, Neurocomputing
Citation Excerpt :
As a classic algorithm of deep learning, convolutional neural network (CNN) has been widely used in computer vision [1], speech recognition [2], natural language processing [3], and so on [4–6].
To process complex-valued information such as SAR signals conveniently, the complex-valued convolutional neural network (CV-CNN) has been proposed in recent years, and it has achieved great success in SAR image recognition. This paper proposes an activation function with learnable parameters based on the Gaussian-type activation function (GTAF) in CV-CNN to improve the utilization of information in the real and imaginary parts of the neuro. For the multi-channel input of the feature map, this paper discusses two ways to set the parameters of the Gaussian-type activation function. One is that all channels share the same parameters, called the channel-sharing Gaussian-type activation function (CSGTAF). The other is that each channel has its independent parameters, called the channel-exclusive Gaussian-type activation function (CEGTAF). In addition, this paper derives the backpropagation formula of both CSGTAF and CEGTAF in detail for the training process of CV-CNN. This paper performs experimental analysis on three L-band standard PolSAR datasets. The experimental results show that, compared with the traditional method and the Gaussian activation function with fixed parameters, both CSGTAF and CEGTAF achieve higher recognition accuracy, and the difference in the recognition effect of different targets in the same dataset is little. Both show good recognition performance and have good stability and versatility.
OGCNet: Overlapped group convolution for deep convolutional neural networks
2022, Knowledge-Based Systems
The deployment of deep convolutional neural networks (CNNs) is heavily constrained by its high computational costs and parameter redundancy. For this reason, general group convolution (GGC) and depthwise convolution (DWC) were proposed, but they limited the information transfer in the channel dimension. In this paper, a novel and efficient overlapped group convolution (OGC) is proposed to improve the information transfer between channels. In OGC, the input feature maps can be overlapped between different groups. Compared with GGC, OGC has better information transfer in the channel dimension without additional parameters and computational cost. In theory, OGC unifies the standard convolution (SDC), GGC, and DWC. In other words, SDC, GGC, and DWC all belong to the special cases of OGC. In OGC, two flexible hyperparameters are defined, the number of input feature maps in each group ( $g$ ) and the stride between adjacent groups ( $s$ ), which make OGC more flexible and can make the trade-off between accuracy and parameters. The performance of OGC is analyzed in terms of parameters, computational cost, accuracy, run time, etc. The classification and object detection tasks are used to evaluate the performance of OGC. Experimental results show that the OGC has higher accuracy and is more efficient than the corresponding SDC, GGC, and DWC. The ratio of the two hyperparameters in OGC has a great impact on accuracy. When $\frac{2}{3} < \frac{s}{g} < \frac{6}{7}$ , OGC has higher accuracy than others. The proposed OGC is more stable and robust than GGC.
Modern Trends in Improving the Technical Characteristics of Devices and Systems for Digital Image Processing
2024, IEEE Access
SAR marine oil spill detection based on an encoder-decoder network
2024, International Journal of Remote Sensing

View all citing articles on Scopus

Jingwei Zhang is an Eng.D student at the National ASIC Center in School of Electronic Science & Engineering, Southeast University, China. His research interests include design space exploration for Integrated Circuits and deep learning hardware accelerators.

Meng Zhang received the B.S. degree in electrical engineering from the China University of Mining and Technology, Xuzhou, China, in 1986, and the M.S. degree in bioelectronics and the Ph.D. degree in microelectronic engineering, as an on-the-job postgraduate student, from Southeast University, Nanjing, China, in 1993 and 2014, respectively. He is currently a Professor and a Faculty Adviser of Ph.D. graduates at the National ASIC System Research Center, School of Electronic Science and Engineering, Southeast University. He has published more than 40 refereed journal articles and international conference papers. He holds more than 90 patents, including some PCT and U.S. patents. His research interests include deep learning, machine learning, digital signal processing, digital communication systems, and digital integrated circuit design.

Ruixia Wu is a M.S. student at the National ASIC Center in School of Microelectronics, Southeast University, China. She received the B.S. degree in Xi’an University of Posts and Telecommunications, Xi’an, China, in 2019. Her research interests include deep learning techniques, Neural Architecture Search, etc.

Xinye Cao is a master student at National ASIC Engineering Technology Research Center in School of Electronics Science and Engineering, Southeast University, Nanjing, China. He also received the bachelor’s degree from Southeast University, Nanjing, China, in 2020. His research fields include computer vision, FPGA development, etc.

WenZhao Liu received the bachelor’s degree in School of Electronic Science and Engineering from Southeast University in 2017. She is currently a master student of the National ASIC Engineering Technology Research Center of Southeast University. Her research interests include computer vision, image processing and network accelerator.

View full text

Efficient depthwise separable convolution accelerator for classification and UAV object detection

Abstract

Introduction

Section snippets

MobileNetV2

Roofline model Analysis for Depthwise separable convolution

System Architecture

Results and Analysis

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Neurocomputing

Inf. Sci.

Neurocomputing

Neurocomputing

Digit. Signal Process.

Neurocomputing

Pattern Recogn.

Neurocomputing

Neurocomputing

Neurocomputing

Integr.

An efficient sharing grouped convolution via bayesian learning

IEEE Trans. Neural Networks Learn. Syst.

Very deep convolutional networks for large-scale image recognition

Deep residual learning for image recognition

Densely connected convolutional networks

Aggregated residual transformations for deep neural networks

Shufflenet: An extremely efficient convolutional neural network for mobile devices

Mobilenetv 2: Inverted residuals and linear bottlenecks

Shufflenet V2: practical guidelines for efficient CNN architecture design

Efficient binary 3d convolutional neural network and hardware accelerator

J. Real-Time Image Process.

A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications

IEEE Trans. Circuits Syst. II Express Briefs