Text-based indoor place recognition with deep neural network
Introduction
Place recognition is one of the key issues in the semantic map research area. Its fundamental purpose is to enable service robots to perceive the environment via human understanding. In general, a specific place can be defined by the inside objects and a series of related tasks occurring within the objects. Therefore, place can be recognized through the positional relationships and attributes of objects or even people’s state in the environment. To date, many approaches have been proposed from different aspects to address this issue [1]. However, most existing methods just focus on visual information itself without fully utilizing rich semantic contents in images. These image features only contain simple characteristics, e.g. texture, color, and geometric structure, so it is hard to accurately determine the type of place. On the other hand, the current research consider objects information only from single aspect, such as category, positional relationships,etc., which still unable to simulate the process of human perception of the place. These above representation methods cannot describe a specific indoor place just like the definition. Meanwhile, there is currently no effective way to obtain the semantic characteristic of objects and it still face great difficulties. Therefore, it remains an open problem to extract semantic cue for place recognition, because semantic information is much more complex than visual information.
As for description model of object’s attributes and relationships, witnessing the recent rapid development of deep learning application in image description and image captioning, some researchers tend to represent objects’ information via a natural language model. For example, hierarchical recurrent network was presented in [2] to generate entire paragraphs for image description. Xu et al. [3] focused on the logical relationship between objects in the image, and generated more clear semantic words. Similar research [4], [5] indicates that the attributes,status as well as relationships of objects in an image can be described by natural language model, which brings a new way to the object representation in place perception.
Natural language is another form of information representation that aligns well with human cognition process of things. It can ignore redundant information, and highlight the intrinsic attributes of the objects. Therefore, converting image information into text representation is beneficial for classification and inference. In this paper, we innovatively consider both the objects properties and positional relationships in the place, which has a more theoretical intuition than traditional methods that only considered image features. Particularly, our approach learns a priori text-based knowledge of object attributes, which is more helpful to judge the type of place.
Therefore, the contribution of this proposal is that we develop a deep learning-based approach which merges the image features of an indoor environment with the textual information converted from the image domain using LSTM-CNN. With such extra textual information, our new model does improve the recognition accuracy of indoor scene significantly.
This paper is organized as follows: Section 2 provides an overview of related work in indoor place recognition, especially the recent work using deep learning methods. In Section 3, we first analyse the basic scheme of place recognition, and then present an algorithm based on the LSTM-CNN for processing the objects information in text form. Section 4 introduces the implementation details of our algorithm, including data processing and model structure with training process. Section 5 reports experimental results of our methods, with a detailed comparison of other approaches. Finally, we conclude our work in Section 6.
Section snippets
Related work
Indoor place recognition is still a complicated problem due to the challenges of information integration and logical reasoning caused by high variability of indoor environment. It is generally considered that indoor place recognition mainly consists of three basic steps, i.e. image acquisition, information representation, and place classification. For information representation, different feature extraction methods have been proposed.
Recently, Place recognition has been extensively studied in
Overview of our method
As mentioned in Section 1, we attempt to utilize the natural language model to represent object information in images. Therefore, the place recognition are divided into three parts, as shown in Fig. 1. Firstly, for an image containing place with its categorical semantic label y, the object information such as objects status (denoted as set S) and attributes (denoted as set A), relationships (denoted as set R) is obtained by image description approach and/or by manual input through human machine
Data preprocessing
Since we focus on the semantic information of place, this model is assumed with the ability to recognize learn semantic representation. We utilize the dataset Visual Genome1[28] for generating the description corpus i.e. set which collect rich annotations of objects, attributes, and relationships within each image.
As shown in Fig. 1(a), we take the bedroom with its annotations as an example to illustrate the descriptors that we concerned. Here the objects in the
Parameter settings
In our experiment, we choose five indoor categories including kitchen, bedroom, living room, bathroom and office, and each of them contains 50 images from Visual Genome. These contents of annotations are the prior knowledge which our model needs to learn. Table 1 shows the statistics of the dataset and some examples for each category are illustrated in Fig. 3.
According to the instructions of Word2Vec2, there are four key parameters in the
Conclusion
This paper presents a strategy for text-based indoor place recognition which applies LSTM-CNN network structure. Classification accuracy and other quantitative metrics of our method are evaluated experimental results indicate that our approach is effective in the place recognition problem. For the future work, We would focus on exploring the semantic connection of the objects and human state holistically in one place. Besides, the generalization performance of the proposed model needs further
Declaration of Competing Interest
None
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61573097, 61671151 and 91748106), in part by Key Laboratory of Integrated Automation of Process Industry (PAL-N201704), the Natural Science Foundation of Jiangsu Province (BK20181265), the Fundamental Research Funds for the Central Universities (3208008401), the Qing Lan Project and Six Major Top-talent Plan, and in part by the Priority Academic Program Development of Jiangsu Higher Education
Pei Li is currently studying the Ph.D. and Master degree of Control Science and Engineering at the School of Automation, Southeast University. He received his Bachelor degree with a major in Automation from the Tongji University. His main research include place perception, deep learning, natural language processing.
References (34)
- et al.
A detailed analysis of a new 3D spatial feature vector for indoor scene classification
Robot. Autonom. Syst.
(2014) - et al.
Visual place recognition: a survey
IEEE Trans. Robot.
(2016) - et al.
A hierarchical approach for generating descriptive image paragraphs
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2017) - et al.
Scene graph generation by iterative message passing
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Visual relationship detection with language priors
Proceedings of the European Conference on Computer Vision
(2016) - et al.
Scaling human-object interaction recognition through zero-shot learning
Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)
(2018) - et al.
Supervised learning of places from range data using adaboost
Proceedings of the IEEE International Conference on Robotics and Automation,ICRA
(2005) - et al.
Scene classification using unsupervised neural networks for mobile robot vision
Proceedings of the SICE Annual Conference (SICE), 2012
(2012) - et al.
Multi-class classification for semantic labeling of places
Proceedings of the 11th International Conference on Control Automation Robotics & Vision (ICARCV)
(2010) Pliss: labeling places using online changepoint detection
Autonom. Rob.
(2012)
On robot indoor scene classification based on descriptor quality and efficiency
Expert Syst. Appl.
Indoor scene classification by incorporating predicted depth descriptor
Place categorization through object classification
Proceedings of the IEEE International Conference on Imaging Systems and Techniques (IST)
Joint categorization of objects and rooms for mobile robots
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Automated place classification using object detection
Proceedings of the Canadian Conference on Computer and Robot Vision (CRV)
Integrated optimal topology design and shape optimization using neural networks
Struct. Multidiscipl. Optim.
Neuro-genetic design optimization framework to support the integrated robust design optimization process in ce
Concurr. Eng.
Cited by (7)
A heterogeneous attention fusion mechanism for the cross-environment scene classification of the home service robot
2024, Robotics and Autonomous SystemsSCGRU: A general approach for identifying multiple classes of self-admitted technical debt with text generation oversampling
2023, Journal of Systems and SoftwareCitation Excerpt :Use zero padding at the beginning until the comment length reaches the maximum. And Sentence length padding is to maintain consistency by adding zero tensor (zero tensor is a placeholder that does not represent any information) to avoid the existence of empty elements (Li et al., 2020). Adding a zero tensor at the beginning has the same effect as adding at the end.
Optimal Time-Efficient UAV Area Coverage Path Planning Based on Raster Map
2023, 2023 8th IEEE International Conference on Advanced Robotics and Mechatronics, ICARM 2023Indoor Scene Classification Algorithm Based on an Object Vector for Robot Applications
2022, ACM International Conference Proceeding Series
Pei Li is currently studying the Ph.D. and Master degree of Control Science and Engineering at the School of Automation, Southeast University. He received his Bachelor degree with a major in Automation from the Tongji University. His main research include place perception, deep learning, natural language processing.
Xinde Li received his Ph.D. from the Department of Control, Huazhong University of Science and Technology in June 2007. In December of the same year, he worked in the School of Automation, Southeast University. From January 2012 to January 2013, he visited Georgia Polytechnic University as a national visiting scholar for one year. From January 2016 to the end of August 2016, he worked as a research fellow in the Department of ECE, National University of Singapore. His main research interests include intelligent robots, machine vision perception, machine learning, human-computer interaction, intelligent information fusion and artificial intelligence.
Hong Pan is associate researcher at the School of Automation, Southeast University. In 2004, he graduated from Southeast University with his Ph.D. in pattern recognition and intelligent systems. From September 2004 to August 2006, he worked as a research associate at the Multimedia Signal Processing Center of the Hong Kong Polytechnic University, where he worked on image sparse transform and coding. His research interests include machine learning, deep learning, computer vision, medical image processing and analysis, multimedia signal processing (image/video codec, retrieval and analysis).
Mohammad Omar Khyam received the B.Sc. degree in electronics and telecommunication engineering from the Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh, in 2010, and the Ph.D. degree from the University of New South Wales, Australia, in 2015. He is currently a Lecturer with the Central Queensland University, Australia. His research interests include signal processing and wireless communication.
Md. Noor-A-Rahim received the Ph.D. degree from Institute for Telecommunications Research, University of South Australia, Adelaide, SA, Australia, in 2015. He was a Postdoctoral Research Fellow with the Centre for Infocomm Technology, Nanyang Technological University, Singapore. He is currently a Senior Postdoctoral Researcher (MSCA Fellow) with the School of Computer Science and IT, University College Cork, Cork, Ireland. His research interests include control over wireless networks, intelligent transportation systems, information theory, signal processing, and DNA-based data storage. He was the recipient of the Michael Miller Medal from the Institute for Telecommunications Research, University of South Australia, for the most Outstanding Ph.D. Thesis in 2015.
- 1
Current Affiliation: School of Engineering and Technology, Central Queensland University, Melbourne, Australia.