Real-time semantic segmentation with local spatial pixel adjustment
Introduction
Semantic segmentation achieves the pixel-wise classification by assigning a label to each pixel of the image. Some networks [[1], [2], [3], [4], [5]] with excellent accuracy have heavy computation costs and cannot be applied to real-time tasks, such as autonomous driving. To achieve a better balance between accuracy and inference speed of the network with fewer parameters, ENet [6] and ICNet [7] employ lightweight feature extraction modules to reduce network complexity and improve real-time performance, respectively. However, these methods often lose a lot of feature information and do not meet the accuracy requirements. Therefore, some researchers focus on the design of efficient fusion modules. [[8], [9], [10]] use simple fusion methods such as up-sampling and skip connection to merge multi-level feature maps. Regretfully, it is insufficient to learn multi-scale information in the fusion process. G. Dong [11] proposes a feature fusion network and uses dilated convolution to eliminate the difference in pixel positions of feature maps. CANet [12] fuses a high-resolution branch for effective spatial detail and a context branch with global aggregation and local distribution blocks. In addition, [13,14] refine the information contained in feature maps at different levels and selectively fuse different information. RegSeg [15] adopts the decoder to preserve more local information by fusing three scale feature maps in the encoder. However, even if these methods dramatically improve the accuracy, they also bring more parameters and calculations. To reduce the complexity of fusion, DFFNet [16] proposes a multi-level feature fusion module, which enhances the semantic consistency between feature maps through two attention refinement blocks to realize the joint learning of spatial and semantic information. Meanwhile, WFDCNet [17] fuses partial spatial detail and rich semantic information in fast decoding blocks to improve the accuracy of segmentation. Nevertheless, the boundary and detail information in the shallow layer is not adequately learned, which inevitably results in information loss.
In terms of extracting more fine-grained information, it is not enough to use feature fusion alone. On the contrary, it is essential to assigning the importance to each pixel value in feature maps. To distinguish the role of pixel values between channels, MSSANet [18] uses the channel attention mechanism to adjust the feature map at the encoder and decoder respectively. Similarly, [11] proposes convolutional attention modules and sequentially inserts them into four different blocks in the decoder stage to capture important information. However, these approaches make the learning time longer and do not address interaction between feature maps at different levels and channels. Considering the differences of spatial and semantic information between low-level and high-level feature maps, [19] proposes a channel attention based feature fusion module to guide the channel adjustment of feature maps at different levels. CARNet [20] adopts the merged feature map to generate attention and semantic vectors through convolution and nonlinear operation, where the attention vector calculates the fusion weight of multiple feature maps to promote fusion and the semantic vector constructs the semantic context loss for regularization training. DABNet [21] uses the channel weights generated at the lower level to guide the up-sampling of the higher-level feature maps. But these methods are limited to solve the spatial pixel discrepancy problem. Therefore, [22,23] present adjusting pixel values in both spatial and channel dimensions. The former makes the spatial and channel attention mechanism guide each other's learning. The latter adjusts the importance of pixel values in spatial and channel dimensions sequentially. Moreover, [24,25] use attention mechanisms of different dimensions to promote the interaction between different contextual information, but the dual attention mechanism that works together in the same stage does not improve accuracy well. WFDCNet [17] adjusts the channels at each stage of the encoder and achieves a high accuracy of 73.7%. However, the local information in cross space is not taken seriously, which causes pixel value dislocation.
The above researches have made contributed to real-time semantic segmentation, but there are still some problems in dealing with small objects and pixel value dislocation. In this paper, a lightweight local spatial pixel adjustment network (LSPANet) is proposed. LSPANet is an encoder-decoder structure, in which the WFDCNet [17] is employed as the encoder and the stage of feature extraction is improved. Dual-branch decoding fusion (DDF) module and spatial pixel cross-correlation (SPCC) block are proposed in the decoder respectively. The DDF module can fuse rich semantic information and sufficient spatial detail to alleviate the boundary ambiguity of small targets. SPCC block relieves the problem of pixel value misalignment by sequentially establishing the relationship between pixel values in the horizontal and vertical direction. The contributions are summarized in the following aspects:
(1) A dual-branch decoding fusion module is presented to fuse different stages feature maps and eliminate the semantic and pixel position differences in an effective calculation, which can refine a variety of important information from the deep and shallow feature.
(2) A spatial pixel cross-correlation block is adopted to establish the space relationship between the pixel values in the horizontal and the vertical through the horizontal and vertical spatial pixel adjustment modules sequentially. And the pixel values with learning errors are corrected and adjusted to the optimum.
(3) A novel lightweight local spatial pixel adjustment network is proposed. LSPANet achieves competitive performance on Cityscapes [26] and Camvid [27] with fewer parameters and lower computational costs than start-of-the-arts.
Section snippets
Related work
In this section, we briefly review the current research state of real-time semantic segmentation and then introduce the attention mechanisms applied in compute vision tasks.
Network architecture
In this section, firstly, the DDF module that integrates two levels of feature maps in the decoder is proposed. Then the SPCC block is described as the basic unit of the pixel value adjustment. Finally, a real-time semantic segmentation network based on the above two modules, named LSPANet, is introduced.
Datasets
We experiment on two popular benchmarks: Cityscape, and Camvid. And ablation experiments were done on the Cityscapes. In the experiments, only 5000 fine-annotated images from the Cityscapes are used, and the dataset is divided into training, validation, and test sets in a ratio of approximately 6:1:3. The Camvid dataset includes 701 images with a resolution of 720 × 960 pixels, of which 367, 233, and 101 are used for training, validation, and testing, respectively.
Experiment details
All experiments are
Conclusion
In this paper, a local spatial pixel adjustment network is proposed for real-time semantic segmentation. The presented network is mainly composed of two core structures, which are dual-branch decoding fusion (DDF) module and spatial pixel cross-correlation (SPCC) block, respectively. Among them, the DDF module can simultaneously learn the low-level and high-level feature maps to maximize the capture of boundary and detail information while preserve the integrity of semantic information at the
Credit authorship contribution statement
Cunjun Xiao: Conceptualization, Methodology, Writing – original draft. Xingjun Hao: Validation, Writing – review & editing. Haibin Li: Conceptualization, Writing – review & editing. Yaqian Li: Supervision, Writing – review & editing. Wenming Zhang: Supervision, Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of Chain under grant 62106214, and the Natural Science Foundation of Hebei Province under grant F201920311.
References (46)
- et al.
Mfenet: multi-level feature enhancement network for real-time semantic segmentation
Neurocomputing
(2020) - et al.
Dffnet: an iot-perceptive dual feature fusion network for general real-time semantic segmentation
Inf. Sci.
(2021) - et al.
Real-time semantic segmentation with weighted factorized-depthwise convolution
Image Vis. Comput.
(2021) - et al.
Aerial-bisenet: a real-time semantic segmentation network for high resolution aerial imagery
Chin. J. Aeronaut.
(2021) - et al.
Pyramid scene parsing network
IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - et al.
Pyramid scene parsing network
IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Refinenet: multipath refinement networks with identity mappings for high resolution semantic segmentation
IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
U-net: convolutional networks for biomedical image segmentation
International Conference on Medical Image Computing and Computer-Assisted Intervention
(2015) - et al.
Enet: A deep neural network architecture for real-time semantic segmentation
arXiv
(2016)
Icnet for real-time semantic segmentation on high-resolution images
European Conference on Computer Vision
Dfanet: deep feature aggregation for real-time semantic segmentation
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Design of real-time semantic segmentation decoder for automated driving
arXiv
Fast-scnn: Fast semantic segmentation network
arXiv
Real-time high-performance semantic image segmentation of urban street scenes
IEEE Trans. Intell. Transp. Syst.
Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning
IEEE/CVF Conference on Computer Vision and Pattern Recognition
In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Rethink dilated convolution for real-time semantic segmentation
ArXiv
Mssa-net: Multi-scale self-attention network for breast ultrasound image segmentation
Contextual attention refinement network for real-time semantic segmentation
IEEE Access
Depth-wise asymmetric bottleneck with point-wise aggregation decoder for real-time semantic segmentation in urban scenes
IEEE Access
Feature pyramid enconding network for real-time semantic segmentation
arXiv
Aglnet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network
Appl. Soft Comput.
Cited by (16)
Improving defocus blur detection via adaptive supervision prior-tokens
2023, Image and Vision ComputingLightweight multi-scale attention-guided network for real-time semantic segmentation
2023, Image and Vision ComputingAutoSegEdge: Searching for the edge device real-time semantic segmentation based on multi-task learning
2023, Image and Vision ComputingWeather-degraded image semantic segmentation with multi-task knowledge distillation
2022, Image and Vision ComputingCitation Excerpt :Road scene semantic segmentation [1] under severe weather is of great significance in autonomous driving [2–4], navigation system [5], and many other safety-related applications [6].
Real-time semantic segmentation network based on parallel atrous convolution for short-term dense concatenate and attention feature fusion
2024, Journal of Real-Time Image Processing