A scene segmentation algorithm combining the body and the edge of the object

doi:10.1016/j.ipm.2021.102840

Information Processing & Management

Volume 59, Issue 2, March 2022, 102840

https://doi.org/10.1016/j.ipm.2021.102840 Get rights and content

Highlights

•
This article proposes the BEJNet network structure which combines the main body and the edge of the object for better feature extraction. On the basis of the semantic flow feature alignment module, the U-shaped body context information extraction module with residual connections is designed to comprehensively use the object body local and global context information while keeping the flow model training stable.
•
This article proposes an edge attention module, which uses high-level information to generate edge features containing semantic information and combines low-level edge features guided by the global pooling module to refine the edge features of objects. So, the segmentation effect of the object edges is improved.
•
Experimental results of the proposed method on many classic network structures such as FCN, PSPNet, DeepLabv3+ and SFNet structures, which improves the mIoU of semantic segmentation with tiny parameters. In addition, we also conduct tests on several classic scene datasets of Cityscapes, CamVid and KITTI, which indicates that our proposed method have reached good results.

Abstract

Scene segmentation is a very challenging task where convolutional neural networks are used in this field and have achieved very good results. Current scene segmentation methods often ignore the internal consistency of the target object, and lack to make full use of global and local context information which leads to the situation of object misclassification. In addition, most of the previous work focused on the segmentation of the main part of the object, however, there are few researches on the quality of the object edge segmentation. In this article, based on the use of flow information to maintain body consistency, the context feature extraction module is designed to fully consider the global and local body context information of the target object, refining the rough feature map in the intermediate stage. So, the misclassification of the target object is reduced. Besides, in the proposed edge attention module, the low-level feature map guided by the global feature and the edge feature map with semantic information obtained by intermediate process are connected to obtain more accurate edge detail information. Finally, the segmentation quality that contains the body part of the noise and the edge details can be improved. This paper not only conducts experiments on the classic FCN, PSPNet, and DeepLabv3+ several mainstream network architectures, but also on the real-time SFNet network structure proposed last year, and the value of mIoU in object and boundary is improved to verify the effectiveness of the method proposed in this paper. Moreover, in order to prove the robustness of the experiment, we conduct experiments on three complex scene segmentation data sets of Cityscapes, CamVid, and KiTTi, and obtained mIoU values of 80.52% on the Cityscapes validation data set, and 71.4%, 56.53% on the Camvid and KITTI test data set, which shows better results when compared with most of the state-of-the-art methods.

Introduction

Scene segmentation is to assign a category label to each pixel of the object in the image based on semantic information. It is the main content of the current field of vision research and has a very high theoretical significance for promoting the development of real scenes. At present, scene segmentation has been widely used in unmanned driving (Feng et al., 2020), indoor navigation (Wang et al., 2018), video surveillance (Franklin and Dabbagol, 2020) and many other fields.

Although the full convolutional neural network has achieved great success in semantic segmentation, however, achieving accurate segmentation and distinguishing targets in complex scenes is still an important research direction currently. On the one hand, there will be noise inside the target to be segmented in the scene image, the internal consistency of the object is difficult to maintain, besides, the object category is easy to be misclassified when the global context information is lacking (Zhou et al., 2019); on the other hand, while the deep network captures deep semantic information, the down-sampling operation will lose a lot detail information on the edge of the object (Chen et al., 2019), so, it is difficult to segment small targets and object edges. For this reason, how to effectively capture the global and local context relationships between multiple objects in a picture to reduce category misclassification while maintaining internal consistency of objects, and improve the performance of object edge segmentation has become a key point in the research direction of scene segmentation.

In response to the first problem, the pyramid pooling module, the atrous convolution, etc. have been proposed to further expand the receptive field while capturing multi-scale features successively; for the second problem, in order to make deep and high-level features contain strong semantic information, but also retain the rich detailed information of the object, the high-level and low-level feature cascade (Pang et al., 2019) or connect (Zhao et al., 2018) is a common way. This is no doubt that these methods improve the performance of semantic segmentation. Although the multi-scale feature extraction module helps to capture objects of different scales, it cannot make full use of the relationship between objects and objects and the global image, it also lacks the attention to the consistency of the internal pixels of the object. In addition, the common low-level and high-level feature fusion methods lack the interactive use of the information that exists in the body and boundary of the object in the picture which is important for improving the performance of image edge segmentation (Takikawa et al., 2019). It can be viewed from Fig. 1 that the edges obtained from the label image contain whether they are edges and edge category information.

In this article, on the basis of the idea of semantic flow in scene segmentation (Li et al., 2020), we design the body and edge joint network abbreviated as BEJNet. After extracting high-level features from the deep Resnet network structure, the consistency of the pixel values inside the object can be maintained based on the object flow information. We believe that maintaining the saliency feature map with the same pixels within the object body is more conducive to learning context. Based on the designed body context feature capture module, the object category can be accurately identified and the misclassification of objects can be reduced. On the other hand, the edges containing semantic information obtained by subtracting the main features and the low-level edge features guided by the global features can be connected to better capture the edge features. After connecting the body feature map guided by local and global context with edge map obtained from edge attention module, the final semantic segmentation result achieved is more accurate. The proposed module is lightweight and can be migrated to any network structure with simple modifications to form an end-to-end semantic segmentation network. We do ablation experiments on the Cityscapes training and validation datasets on the FCN network structure, and conduct experiments on multiple scene data sets and multiple classic network structures. Finally, the value of mIoU on the Cityscapes obtained from the Cityscapes validation dataset achieves 80.52%. Besides, the method proposed in this paper obtains an mIoU value of 71.4% in Camvid, and the value of mIoU is 56.53% on the KITTI test datasets.

The main contributions of this work are:

•
This article proposes the BEJNet network structure which combines the main body and the edge of the object for better feature extraction. On the basis of the semantic flow feature alignment module, the U-shaped body context information extraction module with residual connections is designed to comprehensively use the object body local and global context information while keeping the flow model training stable;
•
This article proposes an edge attention module, which uses high-level information to generate edge features containing semantic information and combines low-level edge features guided by the global pooling module to refine the edge features of objects. So, the segmentation effect of the object edges is improved;
•
The proposed method experimentes on many classic network structures such as FCN, PSPNet, DeepLabv3+ and SFNet structures recently, which improves the mIoU of semantic segmentation with tiny parameters further. In addition, we are also working on several classic scene data of Cityscapes, CamVid and KITTI experiments on the dataset, all of which have reached good results.

Section snippets

Related works

At present, classic scene segmentation methods are based on the FCN-based end-to-end encoder-decoder structure generally. On this basis, methods such as multi-scale feature fusion and various attention mechanisms are introduced to improve the result of semantic segmentation.

Encoder-decoder structure: A lightweight asymmetric encoder-decoder semantic segmentation structure is proposed by Hong et al. (2021) which uses atrous convolution and dense connections to maintain a larger receptive field

Proposed ways

The total framework of the method proposed in this article is displayed in Fig. 2. First, the original image is input into the backbone represented by Resnet to generate stages’ feature maps with different information. In the experiment, in order to speed up the reasoning, this article uses the deep Resnet-50 as the backbone. Unlike the classic Resnet-50, atrous convolution in the network structure is used to expand the receptive field further. Compare with the original input image, the size of

Experimental results

This paper conducts experiments on the representative scene segmentation Cityscapes dataset (Cordts et al., 2016), which contains two parts of fine and coarse data. The finely annotated images are composed of 5000 city images, and the size of the image is 1024 × 2048, including pedestrians, Vehicles, etc., a total of 19 categories are recorded. In the specific experiment, 2975 pictures were used for training, and 500 and 1525 pictures were distributed for verification and testing. Considering

Conclusion

In this article, we design a joint object body and edge network structure to enhance the segmentation result of scene images. On the basis of the original deep Resnet-50, on the one hand, the flow structure is used to maintain the internal consistency of the object, and the body context module is designed to make take advantage of the local and global context to reduce the misclassification of similar objects; the other way is edge attention where the edges containing semantic information

Author statement

Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by Hunan Provincial Natural Science Foundation (2020JJ4340, 2020JJ4343, 2020JJ5218), by the Scientific Research Fund of Education Department of Hunan Province (19B245, 20B266), by Hunan Graduate Education Innovation Project and Professional Ability Improvement Project (CX20201114), by the Hunan Emergency Communication Engineering Technology Research Center (2018TP2022), by the Engineering Research Center on 3D Reconstruction and Intelligent Application Technology of

References (45)

G.J. Brostow et al.
Semantic object classes in video: A high-definition ground truth database
Pattern Recognition Letters
(2009)
N. Ibtehaz et al.
MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation
Neural Networks
(2020)
Y. Zhang et al.
GPNet: Gated pyramid network for semantic segmentation
Pattern Recognition
(2021)
S. Almotairi et al.
Liver tumor segmentation in CT scans using modified SegNet
Sensors
(2020)
S. Bardhan
Salient object detection by contextual refinement
Y. Cao et al.
Gcnet: Non-local networks meet squeeze-excitation networks and beyond
Y. Chen et al.
Dual path networks
arXiv preprint
(2017)
L.C. Chen et al.
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2017)
L.C. Chen et al.
Encoder-decoder with atrous separable convolution for semantic image segmentation
X. Chen et al.
Residual pyramid learning for single-shot semantic segmentation
IEEE Transactions on Intelligent Transportation Systems
(2019)

M. Cordts et al.

The cityscapes dataset for semantic urban scene understanding

D. Feng et al.

Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges

IEEE Transactions on Intelligent Transportation Systems

(2020)

R.J. Franklin et al.

Anomaly detection in videos for video surveillance applications using neural networks

J. Fu et al.

Adaptive context network for scene parsing

J. Fu et al.

Dual attention network for scene segmentation

A. Geiger et al.

Are we ready for autonomous driving? the kitti vision benchmark suite

L. Hong et al.

LAENet: Light-weight asymmetric encoder-decoder network for semantic segmentation

Journal of Physics: Conference Series

(2021)

P.Y. Huang et al.

Efficient uncertainty estimation for semantic segmentation in videos

A. Kirillov et al.

Panoptic feature pyramid networks

M. Klingner et al.

Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance

Kong, S., & Fowlkes, C. (2018). Pixel-wise attentional gating for parsimonious pixel labeling. arXiv preprint...

H. Li et al.

Dfanet: Deep feature aggregation for real-time semantic segmentation

Cited by (3)

Strong robust copy-move forgery detection network based on layer-by-layer decoupling refinement
2024, Information Processing and Management
This paper proposes an all-encompassing methodology called Strong Robust Copy-Move Forgery Detection Network based on Layer-by-Layer Decoupling Refinement (DRNet) which concentrates on detecting a pair of structurally complete similar areas (the source and the tampered area) in the copy-move forgery image by fully extracting the semantically irrelevant shallow information. The DRNet consists of two interacting modules: the Coarse Similarity Area Detection (CD) module and the Shallow Suppression Similarity Area Detection (SD) module. Specifically, the CD module is leveraged to obtain a coarse locating of similar target areas which also work as prior knowledge to guide the detection of the SD module. The SD module fully mines the suppressed information at the shallow layer of the network through layer-by-layer decoupling and uses it as a supplement to refine the coarse detection from the CD module. In addition, we propose a High-Order Self-Correlation Scheme (HS) by dealing with the problem of introducing noise during the process of utilizing the shallow feature to avoid false alarms and improve the robustness. The designed experiments are conducted on USC-ISI CMFD, CASIA CMFD, and CoMoFoD public datasets and the pixel-level F1 score tested by DRnet is improved by 2.27%, 3.82%, and 4.60% respectively than State-of-the-Art in CMFD.
Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation
2024, Information Processing and Management
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual–textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of $J&F$ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P $@$ 0.5 on the former two, respectively, while it has a gain of 2.9% in terms of $J$ on the latter one.
Complex Scene Segmentation With Local to Global Self-Attention Module and Feature Alignment Module
2023, IEEE Access

Hanpu Wang received his B.S. degree in computer science engineering from Nanjing Institute of Technology in 2019. He is now a postgraduate student in Machine Vision and Artificial Intelligence Research Center of Hunan Institute of Science and Technology. His main research interests including image processing and semantic segmentation.

Wujing Li received the B.E. degree in software engineering and Ph.D. degree in computer science and technology from Sichuan University, Chengdu, China, in 2007 and 2012 respectively. His main research interests include image denoising, image restoration, and image enhancement.

Guoyun Zhang g is a professor with Hunan Institute of Science and Technology. He received the B.S. degree in Automation from Xiangtan University in 1993. Then he received his M.S. degree and Ph.D. degree in control theory and control engineering from Hunan University, Changsha, China, in 2000 and 2003, respectively. He was a visiting researcher at George Fox University from Jan. to Jun. 2014. His research interests including image processing, computer vision and pattern recognition.

Siyuan Chen holds a Ph.D in Civil Engineering from University College Dublin as a Marie Curie Fellow and a MSc degree in Mechanical and Aerospace Engineering from Syracuse University. Before receiving his PhD, he worked as a research assistant at Columbia University, in Numerical Modeling, and prior to, at Tsinghua University as a project manager. He received his bachelor degree from the Beijing Institute of Petrochemical Technology in Mechanical Engineering. His research experience including UAV Inspection, 3D data analysis, fluid analysis, mechanical design and 3D printing.

View full text

A scene segmentation algorithm combining the body and the edge of the object

Highlights

Abstract

Introduction

Section snippets

Related works

Proposed ways

Experimental results

Conclusion

Author statement

Declaration of Competing Interest

Acknowledgements

Pattern Recognition Letters

Neural Networks

Pattern Recognition

Liver tumor segmentation in CT scans using modified SegNet

Sensors

Salient object detection by contextual refinement

Gcnet: Non-local networks meet squeeze-excitation networks and beyond

Dual path networks

arXiv preprint

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS

IEEE Transactions on Pattern Analysis and Machine Intelligence

Encoder-decoder with atrous separable convolution for semantic image segmentation

Residual pyramid learning for single-shot semantic segmentation

IEEE Transactions on Intelligent Transportation Systems

The cityscapes dataset for semantic urban scene understanding

Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges

IEEE Transactions on Intelligent Transportation Systems

Anomaly detection in videos for video surveillance applications using neural networks

Adaptive context network for scene parsing

Dual attention network for scene segmentation

Are we ready for autonomous driving? the kitti vision benchmark suite

LAENet: Light-weight asymmetric encoder-decoder network for semantic segmentation

Journal of Physics: Conference Series

Efficient uncertainty estimation for semantic segmentation in videos

Panoptic feature pyramid networks

Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance

Dfanet: Deep feature aggregation for real-time semantic segmentation