Semantic segmentation for multiscale target based on object recognition using the improved Faster-RCNN model

https://doi.org/10.1016/j.future.2021.04.019Get rights and content

Highlights

  • A multi task semantic segmentation model is constructed to realize image semantic segmentation and object recognition at the same time.

  • The information fusion method of RGB-D is improved, which is beneficial to the stability and training speed of the model

  • The model can realize the semantic segmentation of multi-scale objects in indoor scenes.

Abstract

Image semantic segmentation has received great attention in computer vision, whose aim is to segment different objects and provide them different semantic category labels so that the computer can fully obtain the semantic information of the scene. However, the current research mainly focuses on color image data as training, for outdoor scenes and single task semantic segmentation. This paper carries out multi-task semantic segmentation model in the complex indoor environment on joint target detection using RGB-D image information based on the improved Faster-RCNN algorithm, which can simultaneously realize the indoor scene semantic segmentation, target classification and detection multiple visual tasks. In which, in view of the influence of uneven lighting in the environment, the method of fusion of RGB images and depth images is improved. While enhancing the fusion image feature information, it also improves the efficiency of model training. Simultaneously, in order to meet the needs for operating on multi-scale target objects, the non-maximum value suppression algorithm is improved to improve the model’s performance. So as to realize the output of the model’s multi-task information, the loss function has also been redesigned and optimized. The indoor scene semantic segmentation model constructed in this paper not only has good performance and high efficiency, but also can segment the contours of different scale objects clearly and adapt to the indoor uneven lighting environment.

Introduction

As one of the most convenient ways to obtain object information, images play an important role in information transmission. At present, Computer Vision (CV) technology is developing rapidly because of the outbreak of Artificial Intelligence (AI) and Deep Learning (DL). The tehnology processing allows intelligent robots to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously [1], [2], [3]. For example, the combination of learning and image processing helps human resist diseases such as the coronavirus disease (COVID-19) [4] and automatic driving [5]. And their joint can effectively improve the efficiency of urban management, helping establish the relevant disaster prevention mechanism [6].

Image Semantic Segmentation is a basic research direction in computer vision [7], [8]. Its purpose is to segment the target in one scene with various colors for semantic color labeling, and to label each pixel in the image with semantic information. In which, the semantics refers to the category names of different targets in the image, which is extremely important for image understanding, target recognition and detection, and object tracking. At same time, Target detection is inextricably bound up with semantic segmentation. The former needs to gain the position information of the object, and the latter takes into account not just the position but also the content information. Therefore, more refined vision tasks can be achieved with them, which are widely used in mobile robots, smart security, unmanned driving, industrial vision, virtual reality and others.

Compared with the image segmentation for the single object, scene semantic segmentation needs to deal with more difficult issues in providing a predefined semantic category pixel label for scene image or video [9], [10]. At the same time, there are a lots of challenges such as the abundant semantic categories, mutual occlusion, uneven lighting [11], [12], the similarity of diverse objects and so on in indoor scenes. At the same time, semantic segmentation consumes a lot of computing resources, so how to improve the efficiency of the algorithm is quite necessary, especially on edge devices [13]. With the wide application of service robots, indoor scene understanding has attracted the attention of many researchers which is closely related to indoor scene semantic segmentation. Based on this, aiming at the process of color information and depth information fusion, this paper contribute to reduce the amount of fusion information, so as to cut down the consumption of equipment calculation.

The key contributions of this work are as follows:

(i) A multi-task semantic segmentation model for multiscale target is proposed on the basis of the improved Faster-RCNN model.

(ii) A fusion method of depth image and color image is applied to improve the performance of the model.

(iii) An improved NMS method is proposed for better selecting of local candidate regions.

(iv) The performance of the proposed algorithm is analyzed and verified with results of several experiments.

This article mainly consists of 5 parts. Among them, part 2 includes the relevant research results, in which the rationality of the methods is also discussed. The third part then illustrates the data set and pre-processing methods. The next part introduces the experimental results in this paper.

Section snippets

Related works

Before deep learning technology became widespread, traditional image semantic segmentation mainly performed related operations on the target area of the image, using artificially designed feature extractors to extract relevant features such as texture, color, and shape of the image, which would be sent to a classifier (such as SVM, etc.)or other intelligent algorithms to predict the target category in the image [14], [15], [16], [17], [18]. However, these methods often contain less information

Semantic segmentation for multiscale objects fusing color and depth information

This paper built a multi-task semantic segmentation model for multiscale objects in indoor scenes with joint target detection, which could detect the target’s location information while interpreting the target’s semantic information, and outputting the confidence of target object detection. In addition, by fusing depth image information, not only could the influence of the indoor lighting be overcome, but also more comprehensive image features could be extracted, improving the training speed

Data set preparation

This paper used Kinect color depth camera to collect indoor scene images under various angles and illumination backgrounds, and constructs indoor scene RGB-D dataset as experimental data. The data set had 2900 color images and 2900 depth images, in which the 2100 were training sets and the others were test sets. The scene image contains four types of common objects in life: Chair, Book, Table and Keyboard, and the others were regarded as Background (Fig. 8).

This article used the Tensorflow to

Conclusion

In order to realize the semantic segmentation of multi-scale objects in indoor scenes, a multi task semantic segmentation model is proposed, which can not only obtain the location information of objects, but also further obtain the semantic information of objects. The model is mainly improved on the basis of fast RNN, such as optimizing and improving the NMS process, adding information fusion and so on. Through the self built indoor scene data model, and training the model. The model is

CRediT authorship contribution statement

Du Jiang: Methodology, Software, Writing - original draft. Gongfa Li: Conceptualization, Software. Chong Tan: Data curation, Writing - original draft. Li Huang: Methodology, Software. Ying Sun: Writing - review & editing, Formal analyses. Jianyi Kong: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by grants of National Natural Science Foundation of China (Grant Nos. 52075530, 51575407, 51505349, 61733011, 41906177); the Grants of Hubei Provincial Department of Education, China (D20191105); the Grants of National Defense Pre-Research Foundation of Wuhan University of Science and Technology, China (GF201705) and Open Fund of the Key Laboratory for Metallurgical Equipment and Control of Ministry of Education in Wuhan University of Science and Technology, China (

Du Jiang received B.S. degree in mechanical engineering and automation from Wuhan University of Science and Technology, Wu Han, China, in 2017. He is currently occupied in his PhD. degree in mechanical design and theory at Wuhan University of Science and Technology. His current research interests include image processing and intelligent controls.

References (35)

  • JiangD. et al.

    Manipulator grabbing position detection with information fusion of color image and depth image using deep learning

    J. Ambient Intell. Hum. Comput.

    (2021)
  • SunY. et al.

    Gesture recognition algorithm based on multi-scale feature fusion in RGB-D images

    IET Image Process.

    (2020)
  • DuanH. et al.

    Gesture recognition based on multi-modal feature weight

    Concurr. Comput.: Pract. Exper.

    (2020)
  • ZhouZ. et al.

    BEGIN: Big data enabled energy-efficient vehicular edge computing

    IEEE Commun. Mag.

    (2018)
  • JiangD. et al.

    Gesture recognition based on skeletonization algorithm and CNN with ASL database

    Multimedia Tools Appl.

    (2019)
  • JiangD. et al.

    Grip strength forecast and rehabilitative guidance based on adaptive neural fuzzy inference system using sEMG

    Pers. Ubiquitous Comput.

    (2019)
  • JiangD. et al.

    Gesture recognition based on binocular vision

    Cluster Comput.

    (2019)
  • Cited by (135)

    View all citing articles on Scopus

    Du Jiang received B.S. degree in mechanical engineering and automation from Wuhan University of Science and Technology, Wu Han, China, in 2017. He is currently occupied in his PhD. degree in mechanical design and theory at Wuhan University of Science and Technology. His current research interests include image processing and intelligent controls.

    Gongfa Li received the Ph.D. degree in Wuhan University of Science and Technology, Wuhan, China. He is currently a professor in Wuhan University of Science and Technology. His major research interests are computer aided engineering, mechanical CAD/CAE, modeling and optimal control of complex industrial process.

    Chong Tan received M.S. degree in mechanical engineering and automation from Wuhan University of Science and Technology, Wu Han, China, in 2020. His current research interests include image processing and intelligent controls.

    Li Huang is currently a associate professor of computer science of the School of Computer Science and Technology, WUST, Wuhan, China. She received her Ph.D. in computer science from Huazhong University of Science and Technology in 2011. Her research interests include data management, semantic web and knowledge.

    Ying Sun is currently a professor in Wuhan University of Science and Technology. Her major research focuses on teaching research in Mechanical Engineering.

    Jianyi Kong received the Ph.D. degree in Helmut Schmidt University, Germany. He is currently a professor in Wuhan University of Science and Technology. His research interests are intelligent machine and controlled mechanism, mechanical and dynamic design and fault diagnosis of electrical system, mechanical CAD/CAE, intelligent design and control.

    View full text