Real-time semantic segmentation with local spatial pixel adjustment

https://doi.org/10.1016/j.imavis.2022.104470Get rights and content

Highlights

  • Present dual-branch decoding fusion module to fuse a variety of information.

  • Propose spatial pixel cross-correlation block to capture relationship in local space.

  • Design a local spatial pixel adjustment network for real-time semantic segmentation.

Abstract

The research of semantic segmentation networks has achieved a significant breakthrough recently. However, most part of methods have difficulty in utilizing information generated at each stage, which resulting in pixel value dislocation and blurred boundaries for small-scale objects. To overcome these challenges, a local spatial pixel adjustment network (LSPANet) is proposed in this paper, which mainly consists of a dual-branch decoding fusion (DDF) module and a spatial pixel cross-correlation (SPCC) block. Specifically, the DDF module takes the high-level and low-level feature maps with different stages as the input, and gradually eliminates the discrepancy in the information of the feature map to fuse a variety of information extracted in the encoder stage. The SPCC block adopts the horizontal spatial pixel adjustment (HSPA) module and the vertical spatial pixel adjustment (VSPA) module to capture the relationship of each pixel value in the local horizontal and vertical space respectively, and then assign the importance to all values based on this relationship. LSPANet is evaluated on Cityscapes and Camvid datasets. The experimental results show that our network achieves 77.1% mIoU with 2 M parameters on the challenging Cityscapes dataset and the inference speed exceeds 30 FPS in a single GTX 2080 Ti GPU.

Introduction

Semantic segmentation achieves the pixel-wise classification by assigning a label to each pixel of the image. Some networks [[1], [2], [3], [4], [5]] with excellent accuracy have heavy computation costs and cannot be applied to real-time tasks, such as autonomous driving. To achieve a better balance between accuracy and inference speed of the network with fewer parameters, ENet [6] and ICNet [7] employ lightweight feature extraction modules to reduce network complexity and improve real-time performance, respectively. However, these methods often lose a lot of feature information and do not meet the accuracy requirements. Therefore, some researchers focus on the design of efficient fusion modules. [[8], [9], [10]] use simple fusion methods such as up-sampling and skip connection to merge multi-level feature maps. Regretfully, it is insufficient to learn multi-scale information in the fusion process. G. Dong [11] proposes a feature fusion network and uses dilated convolution to eliminate the difference in pixel positions of feature maps. CANet [12] fuses a high-resolution branch for effective spatial detail and a context branch with global aggregation and local distribution blocks. In addition, [13,14] refine the information contained in feature maps at different levels and selectively fuse different information. RegSeg [15] adopts the decoder to preserve more local information by fusing three scale feature maps in the encoder. However, even if these methods dramatically improve the accuracy, they also bring more parameters and calculations. To reduce the complexity of fusion, DFFNet [16] proposes a multi-level feature fusion module, which enhances the semantic consistency between feature maps through two attention refinement blocks to realize the joint learning of spatial and semantic information. Meanwhile, WFDCNet [17] fuses partial spatial detail and rich semantic information in fast decoding blocks to improve the accuracy of segmentation. Nevertheless, the boundary and detail information in the shallow layer is not adequately learned, which inevitably results in information loss.

In terms of extracting more fine-grained information, it is not enough to use feature fusion alone. On the contrary, it is essential to assigning the importance to each pixel value in feature maps. To distinguish the role of pixel values between channels, MSSANet [18] uses the channel attention mechanism to adjust the feature map at the encoder and decoder respectively. Similarly, [11] proposes convolutional attention modules and sequentially inserts them into four different blocks in the decoder stage to capture important information. However, these approaches make the learning time longer and do not address interaction between feature maps at different levels and channels. Considering the differences of spatial and semantic information between low-level and high-level feature maps, [19] proposes a channel attention based feature fusion module to guide the channel adjustment of feature maps at different levels. CARNet [20] adopts the merged feature map to generate attention and semantic vectors through convolution and nonlinear operation, where the attention vector calculates the fusion weight of multiple feature maps to promote fusion and the semantic vector constructs the semantic context loss for regularization training. DABNet [21] uses the channel weights generated at the lower level to guide the up-sampling of the higher-level feature maps. But these methods are limited to solve the spatial pixel discrepancy problem. Therefore, [22,23] present adjusting pixel values in both spatial and channel dimensions. The former makes the spatial and channel attention mechanism guide each other's learning. The latter adjusts the importance of pixel values in spatial and channel dimensions sequentially. Moreover, [24,25] use attention mechanisms of different dimensions to promote the interaction between different contextual information, but the dual attention mechanism that works together in the same stage does not improve accuracy well. WFDCNet [17] adjusts the channels at each stage of the encoder and achieves a high accuracy of 73.7%. However, the local information in cross space is not taken seriously, which causes pixel value dislocation.

The above researches have made contributed to real-time semantic segmentation, but there are still some problems in dealing with small objects and pixel value dislocation. In this paper, a lightweight local spatial pixel adjustment network (LSPANet) is proposed. LSPANet is an encoder-decoder structure, in which the WFDCNet [17] is employed as the encoder and the stage of feature extraction is improved. Dual-branch decoding fusion (DDF) module and spatial pixel cross-correlation (SPCC) block are proposed in the decoder respectively. The DDF module can fuse rich semantic information and sufficient spatial detail to alleviate the boundary ambiguity of small targets. SPCC block relieves the problem of pixel value misalignment by sequentially establishing the relationship between pixel values in the horizontal and vertical direction. The contributions are summarized in the following aspects:

(1) A dual-branch decoding fusion module is presented to fuse different stages feature maps and eliminate the semantic and pixel position differences in an effective calculation, which can refine a variety of important information from the deep and shallow feature.

(2) A spatial pixel cross-correlation block is adopted to establish the space relationship between the pixel values in the horizontal and the vertical through the horizontal and vertical spatial pixel adjustment modules sequentially. And the pixel values with learning errors are corrected and adjusted to the optimum.

(3) A novel lightweight local spatial pixel adjustment network is proposed. LSPANet achieves competitive performance on Cityscapes [26] and Camvid [27] with fewer parameters and lower computational costs than start-of-the-arts.

Section snippets

Related work

In this section, we briefly review the current research state of real-time semantic segmentation and then introduce the attention mechanisms applied in compute vision tasks.

Network architecture

In this section, firstly, the DDF module that integrates two levels of feature maps in the decoder is proposed. Then the SPCC block is described as the basic unit of the pixel value adjustment. Finally, a real-time semantic segmentation network based on the above two modules, named LSPANet, is introduced.

Datasets

We experiment on two popular benchmarks: Cityscape, and Camvid. And ablation experiments were done on the Cityscapes. In the experiments, only 5000 fine-annotated images from the Cityscapes are used, and the dataset is divided into training, validation, and test sets in a ratio of approximately 6:1:3. The Camvid dataset includes 701 images with a resolution of 720 × 960 pixels, of which 367, 233, and 101 are used for training, validation, and testing, respectively.

Experiment details

All experiments are

Conclusion

In this paper, a local spatial pixel adjustment network is proposed for real-time semantic segmentation. The presented network is mainly composed of two core structures, which are dual-branch decoding fusion (DDF) module and spatial pixel cross-correlation (SPCC) block, respectively. Among them, the DDF module can simultaneously learn the low-level and high-level feature maps to maximize the capture of boundary and detail information while preserve the integrity of semantic information at the

Credit authorship contribution statement

Cunjun Xiao: Conceptualization, Methodology, Writing – original draft. Xingjun Hao: Validation, Writing – review & editing. Haibin Li: Conceptualization, Writing – review & editing. Yaqian Li: Supervision, Writing – review & editing. Wenming Zhang: Supervision, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of Chain under grant 62106214, and the Natural Science Foundation of Hebei Province under grant F201920311.

References (46)

  • H. Zhao et al.

    Icnet for real-time semantic segmentation on high-resolution images

    European Conference on Computer Vision

    (2018)
  • H. Li et al.

    Dfanet: deep feature aggregation for real-time semantic segmentation

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2019)
  • A. Das et al.

    Design of real-time semantic segmentation decoder for automated driving

    arXiv

    (2019)
  • R. Poudel et al.

    Fast-scnn: Fast semantic segmentation network

    arXiv

    (2019)
  • G. Dong et al.

    Real-time high-performance semantic image segmentation of urban street scenes

    IEEE Trans. Intell. Transp. Syst.

    (2021)
  • C. Zhang et al.

    Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2019)
  • M. Oršic et al.

    In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2019)
  • Roland Gao

    Rethink dilated convolution for real-time semantic segmentation

    ArXiv

    (2021)
  • M. Xu et al.

    Mssa-net: Multi-scale self-attention network for breast ultrasound image segmentation

  • S. Hao et al.

    Contextual attention refinement network for real-time semantic segmentation

    IEEE Access

    (2020)
  • G. Li et al.

    Depth-wise asymmetric bottleneck with point-wise aggregation decoder for real-time semantic segmentation in urban scenes

    IEEE Access

    (2020)
  • M. Liu et al.

    Feature pyramid enconding network for real-time semantic segmentation

    arXiv

    (2019)
  • Z. Quan et al.

    Aglnet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network

    Appl. Soft Comput.

    (2020)
  • Cited by (16)

    • Weather-degraded image semantic segmentation with multi-task knowledge distillation

      2022, Image and Vision Computing
      Citation Excerpt :

      Road scene semantic segmentation [1] under severe weather is of great significance in autonomous driving [2–4], navigation system [5], and many other safety-related applications [6].

    View all citing articles on Scopus
    View full text