A scene segmentation algorithm combining the body and the edge of the object

https://doi.org/10.1016/j.ipm.2021.102840Get rights and content

Highlights

  • This article proposes the BEJNet network structure which combines the main body and the edge of the object for better feature extraction. On the basis of the semantic flow feature alignment module, the U-shaped body context information extraction module with residual connections is designed to comprehensively use the object body local and global context information while keeping the flow model training stable.

  • This article proposes an edge attention module, which uses high-level information to generate edge features containing semantic information and combines low-level edge features guided by the global pooling module to refine the edge features of objects. So, the segmentation effect of the object edges is improved.

  • Experimental results of the proposed method on many classic network structures such as FCN, PSPNet, DeepLabv3+ and SFNet structures, which improves the mIoU of semantic segmentation with tiny parameters. In addition, we also conduct tests on several classic scene datasets of Cityscapes, CamVid and KITTI, which indicates that our proposed method have reached good results.

Abstract

Scene segmentation is a very challenging task where convolutional neural networks are used in this field and have achieved very good results. Current scene segmentation methods often ignore the internal consistency of the target object, and lack to make full use of global and local context information which leads to the situation of object misclassification. In addition, most of the previous work focused on the segmentation of the main part of the object, however, there are few researches on the quality of the object edge segmentation. In this article, based on the use of flow information to maintain body consistency, the context feature extraction module is designed to fully consider the global and local body context information of the target object, refining the rough feature map in the intermediate stage. So, the misclassification of the target object is reduced. Besides, in the proposed edge attention module, the low-level feature map guided by the global feature and the edge feature map with semantic information obtained by intermediate process are connected to obtain more accurate edge detail information. Finally, the segmentation quality that contains the body part of the noise and the edge details can be improved. This paper not only conducts experiments on the classic FCN, PSPNet, and DeepLabv3+ several mainstream network architectures, but also on the real-time SFNet network structure proposed last year, and the value of mIoU in object and boundary is improved to verify the effectiveness of the method proposed in this paper. Moreover, in order to prove the robustness of the experiment, we conduct experiments on three complex scene segmentation data sets of Cityscapes, CamVid, and KiTTi, and obtained mIoU values of 80.52% on the Cityscapes validation data set, and 71.4%, 56.53% on the Camvid and KITTI test data set, which shows better results when compared with most of the state-of-the-art methods.

Introduction

Scene segmentation is to assign a category label to each pixel of the object in the image based on semantic information. It is the main content of the current field of vision research and has a very high theoretical significance for promoting the development of real scenes. At present, scene segmentation has been widely used in unmanned driving (Feng et al., 2020), indoor navigation (Wang et al., 2018), video surveillance (Franklin and Dabbagol, 2020) and many other fields.

Although the full convolutional neural network has achieved great success in semantic segmentation, however, achieving accurate segmentation and distinguishing targets in complex scenes is still an important research direction currently. On the one hand, there will be noise inside the target to be segmented in the scene image, the internal consistency of the object is difficult to maintain, besides, the object category is easy to be misclassified when the global context information is lacking (Zhou et al., 2019); on the other hand, while the deep network captures deep semantic information, the down-sampling operation will lose a lot detail information on the edge of the object (Chen et al., 2019), so, it is difficult to segment small targets and object edges. For this reason, how to effectively capture the global and local context relationships between multiple objects in a picture to reduce category misclassification while maintaining internal consistency of objects, and improve the performance of object edge segmentation has become a key point in the research direction of scene segmentation.

In response to the first problem, the pyramid pooling module, the atrous convolution, etc. have been proposed to further expand the receptive field while capturing multi-scale features successively; for the second problem, in order to make deep and high-level features contain strong semantic information, but also retain the rich detailed information of the object, the high-level and low-level feature cascade (Pang et al., 2019) or connect (Zhao et al., 2018) is a common way. This is no doubt that these methods improve the performance of semantic segmentation. Although the multi-scale feature extraction module helps to capture objects of different scales, it cannot make full use of the relationship between objects and objects and the global image, it also lacks the attention to the consistency of the internal pixels of the object. In addition, the common low-level and high-level feature fusion methods lack the interactive use of the information that exists in the body and boundary of the object in the picture which is important for improving the performance of image edge segmentation (Takikawa et al., 2019). It can be viewed from Fig. 1 that the edges obtained from the label image contain whether they are edges and edge category information.

In this article, on the basis of the idea of semantic flow in scene segmentation (Li et al., 2020), we design the body and edge joint network abbreviated as BEJNet. After extracting high-level features from the deep Resnet network structure, the consistency of the pixel values inside the object can be maintained based on the object flow information. We believe that maintaining the saliency feature map with the same pixels within the object body is more conducive to learning context. Based on the designed body context feature capture module, the object category can be accurately identified and the misclassification of objects can be reduced. On the other hand, the edges containing semantic information obtained by subtracting the main features and the low-level edge features guided by the global features can be connected to better capture the edge features. After connecting the body feature map guided by local and global context with edge map obtained from edge attention module, the final semantic segmentation result achieved is more accurate. The proposed module is lightweight and can be migrated to any network structure with simple modifications to form an end-to-end semantic segmentation network. We do ablation experiments on the Cityscapes training and validation datasets on the FCN network structure, and conduct experiments on multiple scene data sets and multiple classic network structures. Finally, the value of mIoU on the Cityscapes obtained from the Cityscapes validation dataset achieves 80.52%. Besides, the method proposed in this paper obtains an mIoU value of 71.4% in Camvid, and the value of mIoU is 56.53% on the KITTI test datasets.

The main contributions of this work are:

  • This article proposes the BEJNet network structure which combines the main body and the edge of the object for better feature extraction. On the basis of the semantic flow feature alignment module, the U-shaped body context information extraction module with residual connections is designed to comprehensively use the object body local and global context information while keeping the flow model training stable;

  • This article proposes an edge attention module, which uses high-level information to generate edge features containing semantic information and combines low-level edge features guided by the global pooling module to refine the edge features of objects. So, the segmentation effect of the object edges is improved;

  • The proposed method experimentes on many classic network structures such as FCN, PSPNet, DeepLabv3+ and SFNet structures recently, which improves the mIoU of semantic segmentation with tiny parameters further. In addition, we are also working on several classic scene data of Cityscapes, CamVid and KITTI experiments on the dataset, all of which have reached good results.

Section snippets

Related works

At present, classic scene segmentation methods are based on the FCN-based end-to-end encoder-decoder structure generally. On this basis, methods such as multi-scale feature fusion and various attention mechanisms are introduced to improve the result of semantic segmentation.

Encoder-decoder structure: A lightweight asymmetric encoder-decoder semantic segmentation structure is proposed by Hong et al. (2021) which uses atrous convolution and dense connections to maintain a larger receptive field

Proposed ways

The total framework of the method proposed in this article is displayed in Fig. 2. First, the original image is input into the backbone represented by Resnet to generate stages’ feature maps with different information. In the experiment, in order to speed up the reasoning, this article uses the deep Resnet-50 as the backbone. Unlike the classic Resnet-50, atrous convolution in the network structure is used to expand the receptive field further. Compare with the original input image, the size of

Experimental results

This paper conducts experiments on the representative scene segmentation Cityscapes dataset (Cordts et al., 2016), which contains two parts of fine and coarse data. The finely annotated images are composed of 5000 city images, and the size of the image is 1024 × 2048, including pedestrians, Vehicles, etc., a total of 19 categories are recorded. In the specific experiment, 2975 pictures were used for training, and 500 and 1525 pictures were distributed for verification and testing. Considering

Conclusion

In this article, we design a joint object body and edge network structure to enhance the segmentation result of scene images. On the basis of the original deep Resnet-50, on the one hand, the flow structure is used to maintain the internal consistency of the object, and the body context module is designed to make take advantage of the local and global context to reduce the misclassification of similar objects; the other way is edge attention where the edges containing semantic information

Author statement

Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by Hunan Provincial Natural Science Foundation (2020JJ4340, 2020JJ4343, 2020JJ5218), by the Scientific Research Fund of Education Department of Hunan Province (19B245, 20B266), by Hunan Graduate Education Innovation Project and Professional Ability Improvement Project (CX20201114), by the Hunan Emergency Communication Engineering Technology Research Center (2018TP2022), by the Engineering Research Center on 3D Reconstruction and Intelligent Application Technology of

Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and

References (45)

  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding

  • D. Feng et al.

    Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges

    IEEE Transactions on Intelligent Transportation Systems

    (2020)
  • R.J. Franklin et al.

    Anomaly detection in videos for video surveillance applications using neural networks

  • J. Fu et al.

    Adaptive context network for scene parsing

  • J. Fu et al.

    Dual attention network for scene segmentation

  • A. Geiger et al.

    Are we ready for autonomous driving? the kitti vision benchmark suite

  • L. Hong et al.

    LAENet: Light-weight asymmetric encoder-decoder network for semantic segmentation

    Journal of Physics: Conference Series

    (2021)
  • P.Y. Huang et al.

    Efficient uncertainty estimation for semantic segmentation in videos

  • A. Kirillov et al.

    Panoptic feature pyramid networks

  • M. Klingner et al.

    Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance

  • Kong, S., & Fowlkes, C. (2018). Pixel-wise attentional gating for parsimonious pixel labeling. arXiv preprint...
  • H. Li et al.

    Dfanet: Deep feature aggregation for real-time semantic segmentation

  • Cited by (3)

    Xianfeng Ou received the B.S. degree in electronic information science and technology and M.S. degree in communication and information system from Xinjiang University, Urumchi, China, in 2006 and 2009, respectively. In 2015, he received his Ph.D. degree in communication and information system from Sichuan University, Chengdu, China. He was a visiting researcher at the Internet Media Group, Polytechnic di Torino, Turin, Italy, from Jan. to Apr. 2014, working on distributed video coding and transmission. His main research interests including machine vision and artificial intelligence, hyperspectral image change detection and intelligent optimization.

    Hanpu Wang received his B.S. degree in computer science engineering from Nanjing Institute of Technology in 2019. He is now a postgraduate student in Machine Vision and Artificial Intelligence Research Center of Hunan Institute of Science and Technology. His main research interests including image processing and semantic segmentation.

    Wujing Li received the B.E. degree in software engineering and Ph.D. degree in computer science and technology from Sichuan University, Chengdu, China, in 2007 and 2012 respectively. His main research interests include image denoising, image restoration, and image enhancement.

    Guoyun Zhang g is a professor with Hunan Institute of Science and Technology. He received the B.S. degree in Automation from Xiangtan University in 1993. Then he received his M.S. degree and Ph.D. degree in control theory and control engineering from Hunan University, Changsha, China, in 2000 and 2003, respectively. He was a visiting researcher at George Fox University from Jan. to Jun. 2014. His research interests including image processing, computer vision and pattern recognition.

    Siyuan Chen holds a Ph.D in Civil Engineering from University College Dublin as a Marie Curie Fellow and a MSc degree in Mechanical and Aerospace Engineering from Syracuse University. Before receiving his PhD, he worked as a research assistant at Columbia University, in Numerical Modeling, and prior to, at Tsinghua University as a project manager. He received his bachelor degree from the Beijing Institute of Petrochemical Technology in Mechanical Engineering. His research experience including UAV Inspection, 3D data analysis, fluid analysis, mechanical design and 3D printing.

    View full text