Real-time semantic segmentation with local spatial pixel adjustment

doi:10.1016/j.imavis.2022.104470

Image and Vision Computing

Volume 123, July 2022, 104470

https://doi.org/10.1016/j.imavis.2022.104470 Get rights and content

Highlights

•
Present dual-branch decoding fusion module to fuse a variety of information.
•
Propose spatial pixel cross-correlation block to capture relationship in local space.
•
Design a local spatial pixel adjustment network for real-time semantic segmentation.

Abstract

The research of semantic segmentation networks has achieved a significant breakthrough recently. However, most part of methods have difficulty in utilizing information generated at each stage, which resulting in pixel value dislocation and blurred boundaries for small-scale objects. To overcome these challenges, a local spatial pixel adjustment network (LSPANet) is proposed in this paper, which mainly consists of a dual-branch decoding fusion (DDF) module and a spatial pixel cross-correlation (SPCC) block. Specifically, the DDF module takes the high-level and low-level feature maps with different stages as the input, and gradually eliminates the discrepancy in the information of the feature map to fuse a variety of information extracted in the encoder stage. The SPCC block adopts the horizontal spatial pixel adjustment (HSPA) module and the vertical spatial pixel adjustment (VSPA) module to capture the relationship of each pixel value in the local horizontal and vertical space respectively, and then assign the importance to all values based on this relationship. LSPANet is evaluated on Cityscapes and Camvid datasets. The experimental results show that our network achieves 77.1% mIoU with 2 M parameters on the challenging Cityscapes dataset and the inference speed exceeds 30 FPS in a single GTX 2080 Ti GPU.

Introduction

Semantic segmentation achieves the pixel-wise classification by assigning a label to each pixel of the image. Some networks [[1], [2], [3], [4], [5]] with excellent accuracy have heavy computation costs and cannot be applied to real-time tasks, such as autonomous driving. To achieve a better balance between accuracy and inference speed of the network with fewer parameters, ENet [6] and ICNet [7] employ lightweight feature extraction modules to reduce network complexity and improve real-time performance, respectively. However, these methods often lose a lot of feature information and do not meet the accuracy requirements. Therefore, some researchers focus on the design of efficient fusion modules. [[8], [9], [10]] use simple fusion methods such as up-sampling and skip connection to merge multi-level feature maps. Regretfully, it is insufficient to learn multi-scale information in the fusion process. G. Dong [11] proposes a feature fusion network and uses dilated convolution to eliminate the difference in pixel positions of feature maps. CANet [12] fuses a high-resolution branch for effective spatial detail and a context branch with global aggregation and local distribution blocks. In addition, [13,14] refine the information contained in feature maps at different levels and selectively fuse different information. RegSeg [15] adopts the decoder to preserve more local information by fusing three scale feature maps in the encoder. However, even if these methods dramatically improve the accuracy, they also bring more parameters and calculations. To reduce the complexity of fusion, DFFNet [16] proposes a multi-level feature fusion module, which enhances the semantic consistency between feature maps through two attention refinement blocks to realize the joint learning of spatial and semantic information. Meanwhile, WFDCNet [17] fuses partial spatial detail and rich semantic information in fast decoding blocks to improve the accuracy of segmentation. Nevertheless, the boundary and detail information in the shallow layer is not adequately learned, which inevitably results in information loss.

In terms of extracting more fine-grained information, it is not enough to use feature fusion alone. On the contrary, it is essential to assigning the importance to each pixel value in feature maps. To distinguish the role of pixel values between channels, MSSANet [18] uses the channel attention mechanism to adjust the feature map at the encoder and decoder respectively. Similarly, [11] proposes convolutional attention modules and sequentially inserts them into four different blocks in the decoder stage to capture important information. However, these approaches make the learning time longer and do not address interaction between feature maps at different levels and channels. Considering the differences of spatial and semantic information between low-level and high-level feature maps, [19] proposes a channel attention based feature fusion module to guide the channel adjustment of feature maps at different levels. CARNet [20] adopts the merged feature map to generate attention and semantic vectors through convolution and nonlinear operation, where the attention vector calculates the fusion weight of multiple feature maps to promote fusion and the semantic vector constructs the semantic context loss for regularization training. DABNet [21] uses the channel weights generated at the lower level to guide the up-sampling of the higher-level feature maps. But these methods are limited to solve the spatial pixel discrepancy problem. Therefore, [22,23] present adjusting pixel values in both spatial and channel dimensions. The former makes the spatial and channel attention mechanism guide each other's learning. The latter adjusts the importance of pixel values in spatial and channel dimensions sequentially. Moreover, [24,25] use attention mechanisms of different dimensions to promote the interaction between different contextual information, but the dual attention mechanism that works together in the same stage does not improve accuracy well. WFDCNet [17] adjusts the channels at each stage of the encoder and achieves a high accuracy of 73.7%. However, the local information in cross space is not taken seriously, which causes pixel value dislocation.

The above researches have made contributed to real-time semantic segmentation, but there are still some problems in dealing with small objects and pixel value dislocation. In this paper, a lightweight local spatial pixel adjustment network (LSPANet) is proposed. LSPANet is an encoder-decoder structure, in which the WFDCNet [17] is employed as the encoder and the stage of feature extraction is improved. Dual-branch decoding fusion (DDF) module and spatial pixel cross-correlation (SPCC) block are proposed in the decoder respectively. The DDF module can fuse rich semantic information and sufficient spatial detail to alleviate the boundary ambiguity of small targets. SPCC block relieves the problem of pixel value misalignment by sequentially establishing the relationship between pixel values in the horizontal and vertical direction. The contributions are summarized in the following aspects:

(1) A dual-branch decoding fusion module is presented to fuse different stages feature maps and eliminate the semantic and pixel position differences in an effective calculation, which can refine a variety of important information from the deep and shallow feature.

(2) A spatial pixel cross-correlation block is adopted to establish the space relationship between the pixel values in the horizontal and the vertical through the horizontal and vertical spatial pixel adjustment modules sequentially. And the pixel values with learning errors are corrected and adjusted to the optimum.

(3) A novel lightweight local spatial pixel adjustment network is proposed. LSPANet achieves competitive performance on Cityscapes [26] and Camvid [27] with fewer parameters and lower computational costs than start-of-the-arts.

Section snippets

Related work

In this section, we briefly review the current research state of real-time semantic segmentation and then introduce the attention mechanisms applied in compute vision tasks.

Network architecture

In this section, firstly, the DDF module that integrates two levels of feature maps in the decoder is proposed. Then the SPCC block is described as the basic unit of the pixel value adjustment. Finally, a real-time semantic segmentation network based on the above two modules, named LSPANet, is introduced.

Datasets

We experiment on two popular benchmarks: Cityscape, and Camvid. And ablation experiments were done on the Cityscapes. In the experiments, only 5000 fine-annotated images from the Cityscapes are used, and the dataset is divided into training, validation, and test sets in a ratio of approximately 6:1:3. The Camvid dataset includes 701 images with a resolution of 720 × 960 pixels, of which 367, 233, and 101 are used for training, validation, and testing, respectively.

Experiment details

All experiments are

Conclusion

In this paper, a local spatial pixel adjustment network is proposed for real-time semantic segmentation. The presented network is mainly composed of two core structures, which are dual-branch decoding fusion (DDF) module and spatial pixel cross-correlation (SPCC) block, respectively. Among them, the DDF module can simultaneously learn the low-level and high-level feature maps to maximize the capture of boundary and detail information while preserve the integrity of semantic information at the

Credit authorship contribution statement

Cunjun Xiao: Conceptualization, Methodology, Writing – original draft. Xingjun Hao: Validation, Writing – review & editing. Haibin Li: Conceptualization, Writing – review & editing. Yaqian Li: Supervision, Writing – review & editing. Wenming Zhang: Supervision, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of Chain under grant 62106214, and the Natural Science Foundation of Hebei Province under grant F201920311.

References (46)

B. Zhang et al.
Mfenet: multi-level feature enhancement network for real-time semantic segmentation
Neurocomputing
(2020)
X. Tang et al.
Dffnet: an iot-perceptive dual feature fusion network for general real-time semantic segmentation
Inf. Sci.
(2021)
X. Hao et al.
Real-time semantic segmentation with weighted factorized-depthwise convolution
Image Vis. Comput.
(2021)
F. Wang et al.
Aerial-bisenet: a real-time semantic segmentation network for high resolution aerial imagery
Chin. J. Aeronaut.
(2021)
H. Zhao et al.
Pyramid scene parsing network
IEEE Conference on Computer Vision and Pattern Recognition
(2017)
L. Chen et al.
Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
H. Zhao et al.
Pyramid scene parsing network
IEEE Conference on Computer Vision and Pattern Recognition
(2017)
G. Lin et al.
Refinenet: multipath refinement networks with identity mappings for high resolution semantic segmentation
IEEE Conference on Computer Vision and Pattern Recognition
(2017)
O. Ronneberger et al.
U-net: convolutional networks for biomedical image segmentation
International Conference on Medical Image Computing and Computer-Assisted Intervention
(2015)
A. Paszke et al.
Enet: A deep neural network architecture for real-time semantic segmentation
arXiv
(2016)

H. Zhao et al.

Icnet for real-time semantic segmentation on high-resolution images

European Conference on Computer Vision

(2018)

H. Li et al.

Dfanet: deep feature aggregation for real-time semantic segmentation

IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2019)

A. Das et al.

Design of real-time semantic segmentation decoder for automated driving

arXiv

(2019)

R. Poudel et al.

Fast-scnn: Fast semantic segmentation network

arXiv

(2019)

G. Dong et al.

Real-time high-performance semantic image segmentation of urban street scenes

IEEE Trans. Intell. Transp. Syst.

(2021)

C. Zhang et al.

Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning

IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2019)

M. Oršic et al.

In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images

IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2019)

Roland Gao

Rethink dilated convolution for real-time semantic segmentation

ArXiv

(2021)

M. Xu et al.

Mssa-net: Multi-scale self-attention network for breast ultrasound image segmentation

S. Hao et al.

Contextual attention refinement network for real-time semantic segmentation

IEEE Access

(2020)

G. Li et al.

Depth-wise asymmetric bottleneck with point-wise aggregation decoder for real-time semantic segmentation in urban scenes

IEEE Access

(2020)

M. Liu et al.

Feature pyramid enconding network for real-time semantic segmentation

arXiv

(2019)

Z. Quan et al.

Aglnet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network

Appl. Soft Comput.

(2020)

Cited by (16)

Improving defocus blur detection via adaptive supervision prior-tokens
2023, Image and Vision Computing
The Defocus Blur Detection (DBD) technique is devised to accurately identify regions of blurriness within images. The prediction difficulty of defocused pixels is closely associated with their spatial location. Owing to the cluttered background, pixels near the edges are more prone to erroneous predictions. To address the issue of uneven pixel distribution at the edges of defocused regions, we deliberately decouple the original labels into Prior-Tokaens: Edge Transition Detail Region (EDR) and Structure Body Region (SBR). Subsequently, we propose a novel adaptive multi-supervised network comprising a feature extraction module, a feature fusion network (FFN), and a Multi-scale Channel Attention Module (MCAM). This method harnesses complementary features between SBR and EDR, furnishing a tailored feature learning strategy that outperforms traditional single-supervised techniques. Furthermore, considering that features generated with varying receptive fields contain information at different levels, we introduce MCAM to identify feature pixels at different scales, enhancing semantic relevance. Moreover, for images with complex scenes, an adaptive learning scheme is developed to selectively fuse low-level detail features and high-level semantic information, thereby enhancing the model's generalization capability. The proposed approach outperforms state-of-the-art techniques on various evaluation metrics, as demonstrated through qualitative and quantitative analyses of popular public datasets.
Lightweight multi-scale attention-guided network for real-time semantic segmentation
2023, Image and Vision Computing
The wide application of small mobile devices makes the demand for lightweight real-time semantic segmentation algorithm become more and more intense, which makes it become one of the most popular research topics in the field of computer vision. However, some current methods blindly pursue low parameter numbers and high inference speeds, resulting in excessively low model accuracy and a loss of practical value. Therefore, a lightweight multi-scale attention-guided network for real-time semantic segmentation(LMANet) based on asymmetric encoder-decoder is proposed in this paper to solve the above dilemmas. In the encoder, we propose multi-scale asymmetric residual(MAR) modules to extract local spatial information and context information to enhance feature expression. In the decoder, we design an attention feature fusion(AFF) module and an attention pyramid refining (APR) module. AFF module guides the fusion of low-level and middle-level feature information through high-level semantic information, and finally refines the fusion result through APR module. In addition, we improve the segmentation performance of the model with the help of the attention modules in the network. Our network is tested on two complex urban road datasets. The experimental results show that LMANet achieves 70.6% mIoU and 66.5% mIoU on Cityscapes and Camvid datasets at 112FPS and 333FPS respectively, only 0.95 M parameters without any pre-training or pre-processing. Compared with most of existing state-of-the-art models, our network not only guarantees reasonable inference speed and parameter quantity, but also improves the accuracy as much as possible, which makes it more practical.
AutoSegEdge: Searching for the edge device real-time semantic segmentation based on multi-task learning
2023, Image and Vision Computing
Real-time semantic segmentation is a challenging task for resource-constrained edge devices. We propose AutoSegEdge, based on Neural Architecture Search (NAS), a semantic segmentation approach that runs on edge devices in real-time. Besides accuracy, we employ FLOPs and latency on the target edge devices as search constraints. Our work is probably one of the first attempts to translate multi-objectives NAS into Multi-Task Learning. Be inspired by Multi-Task Learning, we regard the sub-objective in multi-objective NAS as a learning task in Multi-Task Learning. The total loss function of the multi-objective NAS is deconstructed into the weighted sum of the sub-objective loss function. However, the conflict among the sub-objective will cause the searched networks to “architecture collapse.” To avoid the multi-objectives NAS falls into “architecture collapse.” Based on uncertainty, this paper proposes a method to learn the weights of sub-objective loss functions automatically. AutoSegEdge was discovered from an efficient cell-level search space that integrates multi-resolution branches. Additionally, AutoSegEdge employs knowledge distillation to further boost accuracy. Finally, we accelerated AutoSegEdge using NVIDIA's TensorRT and deployed it on the Nvidia Jetson NX. Experiments demonstrate that multi-objectives NAS only requires 1.5 GPU days to obtain the best result on a single Nvidia Tesla V100 GPU. On the Cityscapes dataset, AutoSegEdge achieved an mIoU of 70.3% with 16.6 FPS on the Nvidia Jetson NX (and 194.54 FPS on an Nvidia Tesla V100 GPU) at the original resolution (1024 × 2048) using TensorRT. Our method is 2–3 × faster than existing state-of-the-art real-time methods while maintaining competitive accuracy. We also conducted robustness experiments to analyze our method and modules.
The code is available: https://github.com/douziwenhit/AutoSeg_edge.git.
Weather-degraded image semantic segmentation with multi-task knowledge distillation
2022, Image and Vision Computing
Citation Excerpt :
Road scene semantic segmentation [1] under severe weather is of great significance in autonomous driving [2–4], navigation system [5], and many other safety-related applications [6].
The semantic segmentation of degraded image in adverse weather is of great importance for the navigation system of autonomous driving. However, weather-degraded images increase the difficulty of semantic segmentation as well as decrease the accuracy. It is natural to integrate image enhancement into degraded image semantic segmentation to improve the accuracy, which is computation intensive and time consuming. To meet the challenge, we propose a fast degraded image semantic segmentation with Multi-Task Knowledge Distillation called MTKD. The proposed MTKD method encourages image enhancement and semantic segmentation networks to learn from each other to make full use of the correlation between two tasks. Additionally, we propose shift operator to realize a lightweight model design. Extensive experiments demonstrate that the proposed MTKD outperforms state-of-the-art methods not only with better semantic segmentation performance but also with higher speed in weather-degraded images, which achieves 0.038 s in semantic segmentation for a 2048 × 1024 image.
Real-time semantic segmentation network based on parallel atrous convolution for short-term dense concatenate and attention feature fusion
2024, Journal of Real-Time Image Processing
Semantic segmentation of urban environments: Leveraging U-Net deep learning model for cityscape image analysis
2024, PLoS ONE

View all citing articles on Scopus

View full text

Real-time semantic segmentation with local spatial pixel adjustment

Highlights

Abstract

Introduction

Section snippets

Related work

Network architecture

Datasets

Experiment details

Conclusion

Credit authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Inf. Sci.

Image Vis. Comput.

Chin. J. Aeronaut.

Pyramid scene parsing network

IEEE Conference on Computer Vision and Pattern Recognition

Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

IEEE Trans. Pattern Anal. Mach. Intell.

Pyramid scene parsing network

IEEE Conference on Computer Vision and Pattern Recognition

Refinenet: multipath refinement networks with identity mappings for high resolution semantic segmentation

IEEE Conference on Computer Vision and Pattern Recognition

U-net: convolutional networks for biomedical image segmentation

International Conference on Medical Image Computing and Computer-Assisted Intervention

Enet: A deep neural network architecture for real-time semantic segmentation

arXiv

Icnet for real-time semantic segmentation on high-resolution images

European Conference on Computer Vision

Dfanet: deep feature aggregation for real-time semantic segmentation

IEEE/CVF Conference on Computer Vision and Pattern Recognition

Design of real-time semantic segmentation decoder for automated driving

arXiv

Fast-scnn: Fast semantic segmentation network

arXiv

Real-time high-performance semantic image segmentation of urban street scenes

IEEE Trans. Intell. Transp. Syst.

Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning

IEEE/CVF Conference on Computer Vision and Pattern Recognition

In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images

IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rethink dilated convolution for real-time semantic segmentation

ArXiv

Mssa-net: Multi-scale self-attention network for breast ultrasound image segmentation

Contextual attention refinement network for real-time semantic segmentation

IEEE Access

Depth-wise asymmetric bottleneck with point-wise aggregation decoder for real-time semantic segmentation in urban scenes

IEEE Access

Feature pyramid enconding network for real-time semantic segmentation

arXiv

Aglnet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network

Appl. Soft Comput.