Elsevier

Neurocomputing

Volume 390, 21 May 2020, Pages 239-247
Neurocomputing

Text-based indoor place recognition with deep neural network

https://doi.org/10.1016/j.neucom.2019.02.065Get rights and content

Abstract

Indoor place recognition is a challenging problem because of the hard representation to complicated intra-class variations and inter-class similarities.This paper presents a new indoor place recognition scheme using deep neural network. Traditional representations of indoor place almost utilize image feature to retain the spatial structure without considering the object’s semantic characteristics. However, we argue that the attributes, state and relationships of objects are much more helpful in indoor place recognition. In particular, we improve the recognition framework by utilizing Place Descriptors (PDs) in text from to connect different types of place information with their categories. Meanwhile, we analyse the ability of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) for classification in natural language, for which we use them to process the indoor place descriptions. In addition, we improve the robustness of the designed deep neural network by combining a number of effective strategies, i.e. L2-regularization, data normalization, and proper calibration of key parameters. Compared with existing state of the art, the proposed approach achieves well performance of 70.73%, 70.08% and 70.16% of accuracy, precision and recall on Visual Genome database respectively. Meanwhile, the accuracy becomes 98.6% after adding voting mechanics.

Introduction

Place recognition is one of the key issues in the semantic map research area. Its fundamental purpose is to enable service robots to perceive the environment via human understanding. In general, a specific place can be defined by the inside objects and a series of related tasks occurring within the objects. Therefore, place can be recognized through the positional relationships and attributes of objects or even people’s state in the environment. To date, many approaches have been proposed from different aspects to address this issue [1]. However, most existing methods just focus on visual information itself without fully utilizing rich semantic contents in images. These image features only contain simple characteristics, e.g. texture, color, and geometric structure, so it is hard to accurately determine the type of place. On the other hand, the current research consider objects information only from single aspect, such as category, positional relationships,etc., which still unable to simulate the process of human perception of the place. These above representation methods cannot describe a specific indoor place just like the definition. Meanwhile, there is currently no effective way to obtain the semantic characteristic of objects and it still face great difficulties. Therefore, it remains an open problem to extract semantic cue for place recognition, because semantic information is much more complex than visual information.

As for description model of object’s attributes and relationships, witnessing the recent rapid development of deep learning application in image description and image captioning, some researchers tend to represent objects’ information via a natural language model. For example, hierarchical recurrent network was presented in [2] to generate entire paragraphs for image description. Xu et al. [3] focused on the logical relationship between objects in the image, and generated more clear semantic words. Similar research [4], [5] indicates that the attributes,status as well as relationships of objects in an image can be described by natural language model, which brings a new way to the object representation in place perception.

Natural language is another form of information representation that aligns well with human cognition process of things. It can ignore redundant information, and highlight the intrinsic attributes of the objects. Therefore, converting image information into text representation is beneficial for classification and inference. In this paper, we innovatively consider both the objects properties and positional relationships in the place, which has a more theoretical intuition than traditional methods that only considered image features. Particularly, our approach learns a priori text-based knowledge of object attributes, which is more helpful to judge the type of place.

Therefore, the contribution of this proposal is that we develop a deep learning-based approach which merges the image features of an indoor environment with the textual information converted from the image domain using LSTM-CNN. With such extra textual information, our new model does improve the recognition accuracy of indoor scene significantly.

This paper is organized as follows: Section 2 provides an overview of related work in indoor place recognition, especially the recent work using deep learning methods. In Section 3, we first analyse the basic scheme of place recognition, and then present an algorithm based on the LSTM-CNN for processing the objects information in text form. Section 4 introduces the implementation details of our algorithm, including data processing and model structure with training process. Section 5 reports experimental results of our methods, with a detailed comparison of other approaches. Finally, we conclude our work in Section 6.

Section snippets

Related work

Indoor place recognition is still a complicated problem due to the challenges of information integration and logical reasoning caused by high variability of indoor environment. It is generally considered that indoor place recognition mainly consists of three basic steps, i.e. image acquisition, information representation, and place classification. For information representation, different feature extraction methods have been proposed.

Recently, Place recognition has been extensively studied in

Overview of our method

As mentioned in Section 1, we attempt to utilize the natural language model to represent object information in images. Therefore, the place recognition are divided into three parts, as shown in Fig. 1. Firstly, for an image containing place with its categorical semantic label y, the object information such as objects status (denoted as set S) and attributes (denoted as set A), relationships (denoted as set R) is obtained by image description approach and/or by manual input through human machine

Data preprocessing

Since we focus on the semantic information of place, this model is assumed with the ability to recognize learn semantic representation. We utilize the dataset Visual Genome1[28] for generating the description corpus i.e. set D, which collect rich annotations of objects, attributes, and relationships within each image.

As shown in Fig. 1(a), we take the bedroom with its annotations as an example to illustrate the descriptors that we concerned. Here the objects in the

Parameter settings

In our experiment, we choose five indoor categories including kitchen, bedroom, living room, bathroom and office, and each of them contains 50 images from Visual Genome. These contents of annotations are the prior knowledge which our model needs to learn. Table 1 shows the statistics of the dataset and some examples for each category are illustrated in Fig. 3.

According to the instructions of Word2Vec2, there are four key parameters in the

Conclusion

This paper presents a strategy for text-based indoor place recognition which applies LSTM-CNN network structure. Classification accuracy and other quantitative metrics of our method are evaluated experimental results indicate that our approach is effective in the place recognition problem. For the future work, We would focus on exploring the semantic connection of the objects and human state holistically in one place. Besides, the generalization performance of the proposed model needs further

Declaration of Competing Interest

None

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61573097, 61671151 and 91748106), in part by Key Laboratory of Integrated Automation of Process Industry (PAL-N201704), the Natural Science Foundation of Jiangsu Province (BK20181265), the Fundamental Research Funds for the Central Universities (3208008401), the Qing Lan Project and Six Major Top-talent Plan, and in part by the Priority Academic Program Development of Jiangsu Higher Education

Pei Li is currently studying the Ph.D. and Master degree of Control Science and Engineering at the School of Automation, Southeast University. He received his Bachelor degree with a major in Automation from the Tongji University. His main research include place perception, deep learning, natural language processing.

References (34)

  • A. Swadzba et al.

    A detailed analysis of a new 3D spatial feature vector for indoor scene classification

    Robot. Autonom. Syst.

    (2014)
  • S. Lowry et al.

    Visual place recognition: a survey

    IEEE Trans. Robot.

    (2016)
  • J. Krause et al.

    A hierarchical approach for generating descriptive image paragraphs

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • D. Xu et al.

    Scene graph generation by iterative message passing

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • C. Lu et al.

    Visual relationship detection with language priors

    Proceedings of the European Conference on Computer Vision

    (2016)
  • L. Shen et al.

    Scaling human-object interaction recognition through zero-shot learning

    Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)

    (2018)
  • O.M. Mozos et al.

    Supervised learning of places from range data using adaboost

    Proceedings of the IEEE International Conference on Robotics and Automation,ICRA

    (2005)
  • H. Madokoro et al.

    Scene classification using unsupervised neural networks for mobile robot vision

    Proceedings of the SICE Annual Conference (SICE), 2012

    (2012)
  • L. Shi et al.

    Multi-class classification for semantic labeling of places

    Proceedings of the 11th International Conference on Control Automation Robotics & Vision (ICARCV)

    (2010)
  • A. Ranganathan

    Pliss: labeling places using online changepoint detection

    Autonom. Rob.

    (2012)
  • C. Romero-Gonzlez et al.

    On robot indoor scene classification based on descriptor quality and efficiency

    Expert Syst. Appl.

    (2017)
  • Y. Zheng et al.

    Indoor scene classification by incorporating predicted depth descriptor

  • K. Charalampous et al.

    Place categorization through object classification

    Proceedings of the IEEE International Conference on Imaging Systems and Techniques (IST)

    (2014)
  • J.-R. Ruiz-Sarmiento et al.

    Joint categorization of objects and rooms for mobile robots

    Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    (2015)
  • P. Viswanathan et al.

    Automated place classification using object detection

    Proceedings of the Canadian Conference on Computer and Robot Vision (CRV)

    (2010)
  • A. Yildiz et al.

    Integrated optimal topology design and shape optimization using neural networks

    Struct. Multidiscipl. Optim.

    (2003)
  • N. Öztürk et al.

    Neuro-genetic design optimization framework to support the integrated robust design optimization process in ce

    Concurr. Eng.

    (2006)
  • Cited by (7)

    • SCGRU: A general approach for identifying multiple classes of self-admitted technical debt with text generation oversampling

      2023, Journal of Systems and Software
      Citation Excerpt :

      Use zero padding at the beginning until the comment length reaches the maximum. And Sentence length padding is to maintain consistency by adding zero tensor (zero tensor is a placeholder that does not represent any information) to avoid the existence of empty elements (Li et al., 2020). Adding a zero tensor at the beginning has the same effect as adding at the end.

    • Optimal Time-Efficient UAV Area Coverage Path Planning Based on Raster Map

      2023, 2023 8th IEEE International Conference on Advanced Robotics and Mechatronics, ICARM 2023
    View all citing articles on Scopus

    Pei Li is currently studying the Ph.D. and Master degree of Control Science and Engineering at the School of Automation, Southeast University. He received his Bachelor degree with a major in Automation from the Tongji University. His main research include place perception, deep learning, natural language processing.

    Xinde Li received his Ph.D. from the Department of Control, Huazhong University of Science and Technology in June 2007. In December of the same year, he worked in the School of Automation, Southeast University. From January 2012 to January 2013, he visited Georgia Polytechnic University as a national visiting scholar for one year. From January 2016 to the end of August 2016, he worked as a research fellow in the Department of ECE, National University of Singapore. His main research interests include intelligent robots, machine vision perception, machine learning, human-computer interaction, intelligent information fusion and artificial intelligence.

    Hong Pan is associate researcher at the School of Automation, Southeast University. In 2004, he graduated from Southeast University with his Ph.D. in pattern recognition and intelligent systems. From September 2004 to August 2006, he worked as a research associate at the Multimedia Signal Processing Center of the Hong Kong Polytechnic University, where he worked on image sparse transform and coding. His research interests include machine learning, deep learning, computer vision, medical image processing and analysis, multimedia signal processing (image/video codec, retrieval and analysis).

    Mohammad Omar Khyam received the B.Sc. degree in electronics and telecommunication engineering from the Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh, in 2010, and the Ph.D. degree from the University of New South Wales, Australia, in 2015. He is currently a Lecturer with the Central Queensland University, Australia. His research interests include signal processing and wireless communication.

    Md. Noor-A-Rahim received the Ph.D. degree from Institute for Telecommunications Research, University of South Australia, Adelaide, SA, Australia, in 2015. He was a Postdoctoral Research Fellow with the Centre for Infocomm Technology, Nanyang Technological University, Singapore. He is currently a Senior Postdoctoral Researcher (MSCA Fellow) with the School of Computer Science and IT, University College Cork, Cork, Ireland. His research interests include control over wireless networks, intelligent transportation systems, information theory, signal processing, and DNA-based data storage. He was the recipient of the Michael Miller Medal from the Institute for Telecommunications Research, University of South Australia, for the most Outstanding Ph.D. Thesis in 2015.

    1

    Current Affiliation: School of Engineering and Technology, Central Queensland University, Melbourne, Australia.

    View full text