Real-time semantic segmentation with weighted factorized-depthwise convolution☆
Introduction
As one of the most challenging tasks in computer vision, the goal of semantic segmentation is to assign a label to each pixel of image. Some state-of-the-art semantic segmentation networks [[1], [2], [3], [4], [5]] have achieved high accuracy score on public datasets. For instance, MDCCNet [3] works to explore multi-scale context information and uses dense connected CRFs to enhance output features, resulting in an amazing 75.5% mIoU on PASCAL VOC 2012. However, these methods are not suitable for automatic driving and robot sensing tasks due to the high consumption of memory and long inference time are required for network learning. To reduce overall computation, CCNet [6] adopts RCCA module instead of nonlocal block to build relationship between pixels, but the response time still does not satisfy the demands. Recently, real-time semantic segmentation algorithms have been proposed to pursue a balance between speed and accuracy with few parameters, which has attracted more attention.
Some research focuses on designing basic feature extraction units. ERFNet [7] proposes non-bottleneck-1D module and adopts factorized convolution to reduce the calculation time. Similarly, ESPNet-v2 [8] introduces the EESP module that utilizes both depthwise separable and group point-wise convolutions. But the accuracy of these two networks on datasets is not satisfactory. To improve prediction results, some modules [9,10] with better extraction capabilities are presented. As a representative of the channel splitting and shuffling operations, AGLNet [11] adopts SS-nbt to achieve prominent results on self-driving datasets. While each module has expensive calculation due to a considerable number of convolution operations. Continuous shuffle dilated convolution module [12] is proved to solve the calculation problem perfectly, but it only extracts context and ignores other information. [13] utilizes class-aware edge loss module to capture edge information, and [14] proposes DASPP to extract local and nonlocal information. However, the complex components in these two modules cause great sacrifices to the inference speed.
Exclusively modifying feature extraction module dissatisfy needs of real-time semantic segmentation tasks. Multi- scale feature maps fusion and information reuse also play a vital role. Some networks [15,16] utilize dense connections to achieve cooperation of all layers, but this crude combination ignores the semantic divergence. To solve this problem, NDNet [17] adopts auxiliary loss to adjust the feature maps before fusion but causes a drop in speed. [[18], [19], [20]] use simple fusion methods in back-end network to improve inference speed, while the accuracy is obviously sacrificed. For the purpose of enhancing performance, DFANet [21] considers reusing multiple information by multi-scale interconnected encoder structure. SFNet [22] presents FAM to learn semantic flow of feature maps at adjacent scales and transfer them to high-pixel features. Nevertheless, these two methods bring millions of parameters.
The above methods provide some inspiration for subsequent research, but a better trade-off is still needed between real-time performance and prediction ability. Based on this purpose, a lightweight and efficient real-time semantic segmentation network, called WFDCNet, is presented in this paper. As shown in Fig. 1, our work focuses on designing the encoder structure and does not rely on any existing backbone network. The encoder is mainly composed of two core modules: FCS and LAPF. Notably, both modules have high computational efficiency and outstanding performance. More specifically, FCS separates the operations on height, width and channel of the feature map, and the process is continuous. This unique method effectively reduces the parameters and calculation amount. And the simplified channel attention layer redistributes the weight of each separated channel to make the extracted features more representative. In addition, feature maps of the front-end and back-end network have different feature information, and semantic divergence exists in different depths. Considering this problem, LAPF module selectively extracts three different scale feature maps as the inputs and eliminates their divergence separately according to the characteristics of each feature map, achieving rapid calculation and perfect integration. The contributions are summarized in the following aspects:
- 1.
A full-dimensional continuous separation convolution (FCS) module is proposed to extract abundant information from difference receptive fields, where all dimensions involving channel, height and width are separated in continuous convolution operations and SSE is adopted to weight each independent channel.
- 2.
A novel lateral asymmetric pyramid fusion (LAPF) module is designed to fuse feature maps of different scales, which utilizes detail and boundary information of front-end network and semantic information of back-end simultaneously.
- 3.
A weighted factorized-depthwise convolution network (WFDCNet) is presented. The evaluation results show WFDCNet has superior performance on benchmark datasets, including Cityscapes, Camvid, Mapillary Vistas and COCO-Stuff.
Section snippets
Related work
In this section, we briefly review about the current state of real-time semantic segmentation, and then introduce some common convolutions that improve the calculation efficiency.
Network architecture
In this section, firstly, the FCS module is described as the basic unit of feature extraction. Then the LAPF module that fuses different levels of feature maps in network is proposed. Finally, an overall real-time semantic segmentation network based on the above two modules, named WFDCNet, is introduced.
Experiment
In this section, the datasets and the experimental details are briefly introduced. And then ablation and comparison experiments are described elaborately. The network parameters, frames per second (FPS), floating point operations (FLOPs) and mean intersection over union (mIoU) are mainly compared in our experiments.
Conclusion
In this paper, a novel real-time semantic segmentation network, called WFDCNet, is proposed. The presented network is mainly composed of two core modules, which are full-dimensional continuous separation convolution module and lateral asymmetric pyramid fusion module, respectively. Experiments show the following conclusions: (a) The FCS can realize continuous separation and independence of all dimensions in calculation and extract substantial feature information; (b) The LAPF module is able to
Declaration of Competing Interest
None.
References (44)
- et al.
Pyramid scene parsing network
- et al.
Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
- et al.
Multi-scale deep context convolutional neural networks for semantic segmentation
- et al.
U-net: Convolutional networks for biomedical image segmentation
- et al.
Refinenet: multipath refinement networks with identity mappings for high resolution semantic segmentation
IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Ccnet: Criss-cross attention for semantic segmentation
- et al.
Erfnet: Efficient residual factorized convnet for real-time semantic segmentation
- et al.
Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network
- et al.
Lrnnet: A light-weighted network with efficient reduced non-local operation for real-time semantic segmentation
- et al.
Ccfcnet: Channel-communication factorization convnet for real-time semantic segmentation
J. Phys. Conf. Ser.
(2020)
Aglnet: Towards real-time semantic segmentation of self-driving images via attention-guided lightweight network
Efficient fast semantic segmentation using continuous shuffle dilated convolutions
Using channel-wise attention for deep cnn based real-time semantic segmentation with class-aware edge information
Real-time high-performance semantic image segmentation of urban street scenes
Efficient dense modules of asymmetric convolution for real-time semantic segmentation
Densely connected convolutional networks
Small object augmentation of urban scenes for real-time semantic segmentation
Efficient semantic segmentation using spatio-channel dilated convolutions
Rtsnet: Real-time semantic segmentation network for outdoor scenes
Adscnet: Asymmetric depthwise separable convolution for semantic segmentation in real-time
Dfanet: Deep feature aggregation for real-time semantic segmentation
Semantic flow for fast and accurate scene parsing
Cited by (0)
- ☆
This work is supported by the Natural Science Foundation of Hebei Province (No. F2019203320)