Real-time semantic segmentation with weighted factorized-depthwise convolution

https://doi.org/10.1016/j.imavis.2021.104269Get rights and content

Highlights

  • Separate calculation in all dimensions and use simplified attention layer.

  • Present the full-dimensional continuous separation convolution module.

  • Propose the lateral asymmetric pyramid fusion module.

  • Design an asymmetric encoder-decoder network for real-time semantic segmentation.

Abstract

Semantic segmentation has achieved great success with the popularity of convolutional neural networks (CNNs). However, the huge computational burden restricts the application of most existing networks on edge devices with strict inference time constraints. To solve this problem, a weighted factorized-depthwise convolution network (WFDCNet) is presented in this paper, which contains full- dimensional continuous separation convolution (FCS) modules and a lateral asymmetric pyramid fusion (LAPF) module, aiming to obtain high accuracy without damaging inference speed. Specifically, the FCS module enables the calculation of each dimension to be completed independently in a continuous separation process and uses simplified SE (SSE) attention layer to adjust the channels, achieving the extensive extraction of feature information. The LAPF module is able to eliminate semantic divergence and fuse feature maps of three different scales to realize the combination of multiple information from the front-end and the back-end network. WFDCNet shows superior performance on Cityscapes, Camvid, Mapillary Vistas and COCO-Stuff datasets. Especially, the experimental results demonstrate that our network achieves 73.7% mIoU on Cityscapes dataset, with the inference speed of 102.6FPS on a single RTX 2080 Ti GPU, and 17.2FPS on Jetson TX2.

Introduction

As one of the most challenging tasks in computer vision, the goal of semantic segmentation is to assign a label to each pixel of image. Some state-of-the-art semantic segmentation networks [[1], [2], [3], [4], [5]] have achieved high accuracy score on public datasets. For instance, MDCCNet [3] works to explore multi-scale context information and uses dense connected CRFs to enhance output features, resulting in an amazing 75.5% mIoU on PASCAL VOC 2012. However, these methods are not suitable for automatic driving and robot sensing tasks due to the high consumption of memory and long inference time are required for network learning. To reduce overall computation, CCNet [6] adopts RCCA module instead of nonlocal block to build relationship between pixels, but the response time still does not satisfy the demands. Recently, real-time semantic segmentation algorithms have been proposed to pursue a balance between speed and accuracy with few parameters, which has attracted more attention.

Some research focuses on designing basic feature extraction units. ERFNet [7] proposes non-bottleneck-1D module and adopts factorized convolution to reduce the calculation time. Similarly, ESPNet-v2 [8] introduces the EESP module that utilizes both depthwise separable and group point-wise convolutions. But the accuracy of these two networks on datasets is not satisfactory. To improve prediction results, some modules [9,10] with better extraction capabilities are presented. As a representative of the channel splitting and shuffling operations, AGLNet [11] adopts SS-nbt to achieve prominent results on self-driving datasets. While each module has expensive calculation due to a considerable number of convolution operations. Continuous shuffle dilated convolution module [12] is proved to solve the calculation problem perfectly, but it only extracts context and ignores other information. [13] utilizes class-aware edge loss module to capture edge information, and [14] proposes DASPP to extract local and nonlocal information. However, the complex components in these two modules cause great sacrifices to the inference speed.

Exclusively modifying feature extraction module dissatisfy needs of real-time semantic segmentation tasks. Multi- scale feature maps fusion and information reuse also play a vital role. Some networks [15,16] utilize dense connections to achieve cooperation of all layers, but this crude combination ignores the semantic divergence. To solve this problem, NDNet [17] adopts auxiliary loss to adjust the feature maps before fusion but causes a drop in speed. [[18], [19], [20]] use simple fusion methods in back-end network to improve inference speed, while the accuracy is obviously sacrificed. For the purpose of enhancing performance, DFANet [21] considers reusing multiple information by multi-scale interconnected encoder structure. SFNet [22] presents FAM to learn semantic flow of feature maps at adjacent scales and transfer them to high-pixel features. Nevertheless, these two methods bring millions of parameters.

The above methods provide some inspiration for subsequent research, but a better trade-off is still needed between real-time performance and prediction ability. Based on this purpose, a lightweight and efficient real-time semantic segmentation network, called WFDCNet, is presented in this paper. As shown in Fig. 1, our work focuses on designing the encoder structure and does not rely on any existing backbone network. The encoder is mainly composed of two core modules: FCS and LAPF. Notably, both modules have high computational efficiency and outstanding performance. More specifically, FCS separates the operations on height, width and channel of the feature map, and the process is continuous. This unique method effectively reduces the parameters and calculation amount. And the simplified channel attention layer redistributes the weight of each separated channel to make the extracted features more representative. In addition, feature maps of the front-end and back-end network have different feature information, and semantic divergence exists in different depths. Considering this problem, LAPF module selectively extracts three different scale feature maps as the inputs and eliminates their divergence separately according to the characteristics of each feature map, achieving rapid calculation and perfect integration. The contributions are summarized in the following aspects:

  • 1.

    A full-dimensional continuous separation convolution (FCS) module is proposed to extract abundant information from difference receptive fields, where all dimensions involving channel, height and width are separated in continuous convolution operations and SSE is adopted to weight each independent channel.

  • 2.

    A novel lateral asymmetric pyramid fusion (LAPF) module is designed to fuse feature maps of different scales, which utilizes detail and boundary information of front-end network and semantic information of back-end simultaneously.

  • 3.

    A weighted factorized-depthwise convolution network (WFDCNet) is presented. The evaluation results show WFDCNet has superior performance on benchmark datasets, including Cityscapes, Camvid, Mapillary Vistas and COCO-Stuff.

Section snippets

Related work

In this section, we briefly review about the current state of real-time semantic segmentation, and then introduce some common convolutions that improve the calculation efficiency.

Network architecture

In this section, firstly, the FCS module is described as the basic unit of feature extraction. Then the LAPF module that fuses different levels of feature maps in network is proposed. Finally, an overall real-time semantic segmentation network based on the above two modules, named WFDCNet, is introduced.

Experiment

In this section, the datasets and the experimental details are briefly introduced. And then ablation and comparison experiments are described elaborately. The network parameters, frames per second (FPS), floating point operations (FLOPs) and mean intersection over union (mIoU) are mainly compared in our experiments.

Conclusion

In this paper, a novel real-time semantic segmentation network, called WFDCNet, is proposed. The presented network is mainly composed of two core modules, which are full-dimensional continuous separation convolution module and lateral asymmetric pyramid fusion module, respectively. Experiments show the following conclusions: (a) The FCS can realize continuous separation and independence of all dimensions in calculation and extract substantial feature information; (b) The LAPF module is able to

Declaration of Competing Interest

None.

References (44)

  • H. Zhao et al.

    Pyramid scene parsing network

  • L. Chen et al.

    Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

  • Q. Zhou et al.

    Multi-scale deep context convolutional neural networks for semantic segmentation

  • O. Ronneberger et al.

    U-net: Convolutional networks for biomedical image segmentation

  • G. Lin et al.

    Refinenet: multipath refinement networks with identity mappings for high resolution semantic segmentation

    IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Z. Huang et al.

    Ccnet: Criss-cross attention for semantic segmentation

  • R. Eduardo et al.

    Erfnet: Efficient residual factorized convnet for real-time semantic segmentation

  • M. Sachin et al.

    Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network

  • W. Jiang et al.

    Lrnnet: A light-weighted network with efficient reduced non-local operation for real-time semantic segmentation

  • J. Lin et al.

    Ccfcnet: Channel-communication factorization convnet for real-time semantic segmentation

    J. Phys. Conf. Ser.

    (2020)
  • Q. Zhou et al.

    Aglnet: Towards real-time semantic segmentation of self-driving images via attention-guided lightweight network

  • X. Hu et al.

    Efficient fast semantic segmentation using continuous shuffle dilated convolutions

  • H. Han et al.

    Using channel-wise attention for deep cnn based real-time semantic segmentation with class-aware edge information

  • G. Dong et al.

    Real-time high-performance semantic image segmentation of urban street scenes

  • S. Lo et al.

    Efficient dense modules of asymmetric convolution for real-time semantic segmentation

  • G. Huang et al.

    Densely connected convolutional networks

  • Y. Zhengeng et al.

    Small object augmentation of urban scenes for real-time semantic segmentation

  • J. Kim et al.

    Efficient semantic segmentation using spatio-channel dilated convolutions

  • M. Ma et al.

    Rtsnet: Real-time semantic segmentation network for outdoor scenes

  • J. Wang et al.

    Adscnet: Asymmetric depthwise separable convolution for semantic segmentation in real-time

  • H. Li et al.

    Dfanet: Deep feature aggregation for real-time semantic segmentation

  • X. Li et al.

    Semantic flow for fast and accurate scene parsing

  • Cited by (0)

    This work is supported by the Natural Science Foundation of Hebei Province (No. F2019203320)

    View full text