A scene segmentation algorithm combining the body and the edge of the object
Introduction
Scene segmentation is to assign a category label to each pixel of the object in the image based on semantic information. It is the main content of the current field of vision research and has a very high theoretical significance for promoting the development of real scenes. At present, scene segmentation has been widely used in unmanned driving (Feng et al., 2020), indoor navigation (Wang et al., 2018), video surveillance (Franklin and Dabbagol, 2020) and many other fields.
Although the full convolutional neural network has achieved great success in semantic segmentation, however, achieving accurate segmentation and distinguishing targets in complex scenes is still an important research direction currently. On the one hand, there will be noise inside the target to be segmented in the scene image, the internal consistency of the object is difficult to maintain, besides, the object category is easy to be misclassified when the global context information is lacking (Zhou et al., 2019); on the other hand, while the deep network captures deep semantic information, the down-sampling operation will lose a lot detail information on the edge of the object (Chen et al., 2019), so, it is difficult to segment small targets and object edges. For this reason, how to effectively capture the global and local context relationships between multiple objects in a picture to reduce category misclassification while maintaining internal consistency of objects, and improve the performance of object edge segmentation has become a key point in the research direction of scene segmentation.
In response to the first problem, the pyramid pooling module, the atrous convolution, etc. have been proposed to further expand the receptive field while capturing multi-scale features successively; for the second problem, in order to make deep and high-level features contain strong semantic information, but also retain the rich detailed information of the object, the high-level and low-level feature cascade (Pang et al., 2019) or connect (Zhao et al., 2018) is a common way. This is no doubt that these methods improve the performance of semantic segmentation. Although the multi-scale feature extraction module helps to capture objects of different scales, it cannot make full use of the relationship between objects and objects and the global image, it also lacks the attention to the consistency of the internal pixels of the object. In addition, the common low-level and high-level feature fusion methods lack the interactive use of the information that exists in the body and boundary of the object in the picture which is important for improving the performance of image edge segmentation (Takikawa et al., 2019). It can be viewed from Fig. 1 that the edges obtained from the label image contain whether they are edges and edge category information.
In this article, on the basis of the idea of semantic flow in scene segmentation (Li et al., 2020), we design the body and edge joint network abbreviated as BEJNet. After extracting high-level features from the deep Resnet network structure, the consistency of the pixel values inside the object can be maintained based on the object flow information. We believe that maintaining the saliency feature map with the same pixels within the object body is more conducive to learning context. Based on the designed body context feature capture module, the object category can be accurately identified and the misclassification of objects can be reduced. On the other hand, the edges containing semantic information obtained by subtracting the main features and the low-level edge features guided by the global features can be connected to better capture the edge features. After connecting the body feature map guided by local and global context with edge map obtained from edge attention module, the final semantic segmentation result achieved is more accurate. The proposed module is lightweight and can be migrated to any network structure with simple modifications to form an end-to-end semantic segmentation network. We do ablation experiments on the Cityscapes training and validation datasets on the FCN network structure, and conduct experiments on multiple scene data sets and multiple classic network structures. Finally, the value of mIoU on the Cityscapes obtained from the Cityscapes validation dataset achieves 80.52%. Besides, the method proposed in this paper obtains an mIoU value of 71.4% in Camvid, and the value of mIoU is 56.53% on the KITTI test datasets.
The main contributions of this work are:
- •
This article proposes the BEJNet network structure which combines the main body and the edge of the object for better feature extraction. On the basis of the semantic flow feature alignment module, the U-shaped body context information extraction module with residual connections is designed to comprehensively use the object body local and global context information while keeping the flow model training stable;
- •
This article proposes an edge attention module, which uses high-level information to generate edge features containing semantic information and combines low-level edge features guided by the global pooling module to refine the edge features of objects. So, the segmentation effect of the object edges is improved;
- •
The proposed method experimentes on many classic network structures such as FCN, PSPNet, DeepLabv3+ and SFNet structures recently, which improves the mIoU of semantic segmentation with tiny parameters further. In addition, we are also working on several classic scene data of Cityscapes, CamVid and KITTI experiments on the dataset, all of which have reached good results.
Section snippets
Related works
At present, classic scene segmentation methods are based on the FCN-based end-to-end encoder-decoder structure generally. On this basis, methods such as multi-scale feature fusion and various attention mechanisms are introduced to improve the result of semantic segmentation.
Encoder-decoder structure: A lightweight asymmetric encoder-decoder semantic segmentation structure is proposed by Hong et al. (2021) which uses atrous convolution and dense connections to maintain a larger receptive field
Proposed ways
The total framework of the method proposed in this article is displayed in Fig. 2. First, the original image is input into the backbone represented by Resnet to generate stages’ feature maps with different information. In the experiment, in order to speed up the reasoning, this article uses the deep Resnet-50 as the backbone. Unlike the classic Resnet-50, atrous convolution in the network structure is used to expand the receptive field further. Compare with the original input image, the size of
Experimental results
This paper conducts experiments on the representative scene segmentation Cityscapes dataset (Cordts et al., 2016), which contains two parts of fine and coarse data. The finely annotated images are composed of 5000 city images, and the size of the image is 1024 × 2048, including pedestrians, Vehicles, etc., a total of 19 categories are recorded. In the specific experiment, 2975 pictures were used for training, and 500 and 1525 pictures were distributed for verification and testing. Considering
Conclusion
In this article, we design a joint object body and edge network structure to enhance the segmentation result of scene images. On the basis of the original deep Resnet-50, on the one hand, the flow structure is used to maintain the internal consistency of the object, and the body context module is designed to make take advantage of the local and global context to reduce the misclassification of similar objects; the other way is edge attention where the edges containing semantic information
Author statement
Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by Hunan Provincial Natural Science Foundation (2020JJ4340, 2020JJ4343, 2020JJ5218), by the Scientific Research Fund of Education Department of Hunan Province (19B245, 20B266), by Hunan Graduate Education Innovation Project and Professional Ability Improvement Project (CX20201114), by the Hunan Emergency Communication Engineering Technology Research Center (2018TP2022), by the Engineering Research Center on 3D Reconstruction and Intelligent Application Technology of
Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and
References (45)
- et al.
Semantic object classes in video: A high-definition ground truth database
Pattern Recognition Letters
(2009) - et al.
MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation
Neural Networks
(2020) - et al.
GPNet: Gated pyramid network for semantic segmentation
Pattern Recognition
(2021) - et al.
Liver tumor segmentation in CT scans using modified SegNet
Sensors
(2020) Salient object detection by contextual refinement
- et al.
Gcnet: Non-local networks meet squeeze-excitation networks and beyond
- et al.
Dual path networks
arXiv preprint
(2017) - et al.
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2017) - et al.
Encoder-decoder with atrous separable convolution for semantic image segmentation
- et al.
Residual pyramid learning for single-shot semantic segmentation
IEEE Transactions on Intelligent Transportation Systems
(2019)
The cityscapes dataset for semantic urban scene understanding
Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges
IEEE Transactions on Intelligent Transportation Systems
Anomaly detection in videos for video surveillance applications using neural networks
Adaptive context network for scene parsing
Dual attention network for scene segmentation
Are we ready for autonomous driving? the kitti vision benchmark suite
LAENet: Light-weight asymmetric encoder-decoder network for semantic segmentation
Journal of Physics: Conference Series
Efficient uncertainty estimation for semantic segmentation in videos
Panoptic feature pyramid networks
Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance
Dfanet: Deep feature aggregation for real-time semantic segmentation
Cited by (3)
Strong robust copy-move forgery detection network based on layer-by-layer decoupling refinement
2024, Information Processing and ManagementFully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation
2024, Information Processing and Management
Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and transmission. His main research interests including machine vision and artificial intelligence, hyperspectral image change detection and intelligent optimization.
Hanpu Wang received his B.S. degree in computer science engineering from Nanjing Institute of Technology in 2019. He is now a postgraduate student in Machine Vision and Artificial Intelligence Research Center of Hunan Institute of Science and Technology. His main research interests including image processing and semantic segmentation.
Wujing Li received the B.E. degree in software engineering and Ph.D. degree in computer science and technology from Sichuan University, Chengdu, China, in 2007 and 2012 respectively. His main research interests include image denoising, image restoration, and image enhancement.
Guoyun Zhang g is a professor with Hunan Institute of Science and Technology. He received the B.S. degree in Automation from Xiangtan University in 1993. Then he received his M.S. degree and Ph.D. degree in control theory and control engineering from Hunan University, Changsha, China, in 2000 and 2003, respectively. He was a visiting researcher at George Fox University from Jan. to Jun. 2014. His research interests including image processing, computer vision and pattern recognition.
Siyuan Chen holds a Ph.D in Civil Engineering from University College Dublin as a Marie Curie Fellow and a MSc degree in Mechanical and Aerospace Engineering from Syracuse University. Before receiving his PhD, he worked as a research assistant at Columbia University, in Numerical Modeling, and prior to, at Tsinghua University as a project manager. He received his bachelor degree from the Beijing Institute of Petrochemical Technology in Mechanical Engineering. His research experience including UAV Inspection, 3D data analysis, fluid analysis, mechanical design and 3D printing.