Elsevier

Neurocomputing

Volume 174, Part B, 22 January 2016, Pages 1101-1106
Neurocomputing

Detection based object labeling of 3D point cloud for indoor scenes

https://doi.org/10.1016/j.neucom.2015.10.005Get rights and content

Abstract

While much exciting progress is being made in 3D reconstruction of scenes, object labeling of 3D point cloud for indoor scenes has been left as a challenge issue. How should we explore the reference images of 3D scene, in aid of scene parsing? In this paper, we propose a framework for 3D indoor scenes labeling, based upon object detection on the RGB-D frames of 3D scene. First, the point cloud is segmented into homogeneous segments. Then, we utilize object detectors to assign class probabilities to pixels in every RGB-D frame. After that, the class probabilities are projected into the segments. Finally, we perform accurate inference on a MRF model over the homogeneous segments, in combination with geometry cues to output the labels. Experiment on the challenging RGB-D Object Dataset demonstrates that our detection based approach produces accurate labeling and improves the robustness of small object detection for indoor scenes.

Introduction

Coming with the popularity of Kinect sensors and advanced 3D reconstruction techniques, we are facing an increasing amount of 3D point clouds. Consequently, the demand of scene understanding is emerging [6], [7]. Understanding 3D scenes is a fundamental problem in perception and robotics, as knowledge of the environment is a preliminary step for subsequent tasks such as route planning, augmented reality and mobile visual search [1], [2].

In the literature, a significant amount of work has been done in semantic labeling for pixels or regions in 2D images. In spite of this, semantic labeling of 3D point clouds remains an open problem. Its solution, however, can bring a breakthrough in a wide variety of computer vision and robotics research, with great potential in human-computer interface, 3D object indexing and retrieval, object manipulation in robotics [4] as well as exciting applications such as self-driving vehicles and semantic-aware augmented reality.

There is a great deal of literature on 3D scene labeling in indoor environments [5], [16]. Many of these operate directly on a 3D point cloud, as 3D point clouds contain very important shape and spatial information, which allows to model context and spatial relationships between objects However,one missing information of these methods to explore is the reference images. Existing works [4], [3] have shown that multi-view images can be used to recognize objects of 3D scene to a higher degree of accuracy.

In this paper, we focus on detection based object labeling in 3D scene. Specifically, we tackle the problem of object labeling in 3D scene with the help of object detection results from part-of-scene 2D reference image. Our goal is to transfer such reliable 2D labeling results into 3D to enhance the inference accuracy. It is noted that the proposed approach works in the scenario where a single point cloud is merged from multiple images. First, the point cloud is segmented into homogeneous segments. Then, we utilize object detectors to assign class probabilities to pixels in every RGB-D frame. After that, the class probabilities are projected into the segments. Finally, we perform accurate inference on a MRF model over the homogeneous segments, in combination with geometry cues to output the labels.

The outline of this paper is organized as follows. Section 2 surveys related work. The detailed of the proposed method is presented in Section 3. Experimental results and comparison are provided in Section 4. Finally, we draw conclusion of this paper in Section 5.

Section snippets

Related work

With the popularity of Kinect camera and 3D laser scanner, recently there is ever increasing research focus on semantic labeling of 3D point cloud [8], [9], [5]. The application scenarios are either indoor or outdoor, with corresponding tasks such as labeling tables, beds, desks, computers, or trees, cars, roads and pedestrians. So far, most robotic scene understanding work has focused on 3D outdoor scenes with related applications such as mapping and autonomous driving [10], [11], [12], [13].

DetSMRF

We now describe the proposed DetSMRF model, as shown in Fig. 1. The 3D point clouds we consider are captured from a set of RGB-D video frames. To reconstruct a 3D scene, RGB-D mapping [26] is employed to globally align and merge each frame with the scene under a rigid transform. The goal of this paper is to label small everyday objects of interest, which may comprise only a small part of the whole 3D point cloud.

DetSMRF is defined over a graph G=(V,E) , where V are vertices representing

Experiments

We experimentally evaluate our method primarily using the challenging RGB-D Object Dataset [31], which includes everyday objects captured individually on a turntable. We show that our scheme has a competitive performance compared to state-of-the-art method in terms of labelling accuracy. All of the parameter setting are able to be referred in previous sections in this paper.

Data set and Ground Truth Labeling: The RGB-D Object Dataset includes 250,000 segmented RGB-D images of 300 objects in 51

Conclusion

In this paper, we propose a detection-based scheme for 3D indoor scenes labeling. Experiment results show that our approach achieves better performance in terms of accuracy, compared to the compared to state-of-the-art method, on the challenging RGB-D Object Dataset. To some extend, our work also demonstrates the importance of the unary term in MRF model.

Acknowledgements

This work is supported by the Nature Science Foundation of China (No. 61373076), National Outstanding Youth Science Foundation of China (No. 61422210), and the Special Fund for Earthquake Research in the Public Interest (No. 201508025).

Wei Liu received B.S. Degree in Information and Computing Science in 2009 from Nanchang University, Jiangxi, China, and the M.S. Degree in Applied Mathematics from Jimei University in 2012, Fujian, China. He is currently working towards his Ph.D at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.

References (31)

  • H. Wang et al.

    Discriminative learning with latent variables for cluttered indoor scene understanding

    Commun. ACM

    (2013)
  • T. Guan et al.

    On-device mobile visual location recognition by integrating vision and inertial sensors

    IEEE Trans. Multimed.

    (2013)
  • T. Guan et al.

    Efficient BOF generation and compression for on-device mobile visual location recognition

    IEEE Multimed.

    (2014)
  • K. Lai, L. Bo, X. Ren, D. Fox, Detection-based object labeling in 3d scenes, In: ICRA, 2012, pp....
  • Y. Wang, R. Ji, S.-F. Chang, Label propagation from imagenet to 3d point clouds, In: CVPR, 2013, pp....
  • H.S. Koppula, A. Anand, T. Joachims, A. Saxena, Semantic labeling of 3d point clouds for indoor scenes, In: NIPS, 2011,...
  • Yue Gao et al.

    Less is moreefficient 3d object retrieval with query view selection

    IEEE Trans. Multimed.

    (2011)
  • Yue Gao et al.

    Camera constraint-free. view-based 3d object retrieval

    IEEE Trans. Image Process.

    (2012)
  • D. Munoz, J.A. Bagnell, N. Vandapel, M. Hebert, Contextual classification with functional max-margin Markov networks,...
  • D. Munoz, N. Vandapel, M. Hebert, Onboard contextual classification of 3-d point clouds with learned high-order Markov...
  • D. Anguelov, B. Taskarf, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, A. Ng, Discriminative learning of Markov...
  • B. Douillard et al.

    Classification and semantic mapping of urban environments

    Int. J. Robot. Res.

    (2011)
  • K. Lai et al.

    Object recognition in 3d point clouds using web data and domain adaptation

    Int. J. Robot. Res.

    (2010)
  • I. Posner et al.

    A generative framework for fast urban labeling using spatial and temporal context

    Auton. Robots

    (2009)
  • A. Golovinskiy, V.G. Kim, T. Funkhouser, Shape-based recognition of 3d point clouds in urban environments, In: CVPR,...
  • Cited by (0)

    Wei Liu received B.S. Degree in Information and Computing Science in 2009 from Nanchang University, Jiangxi, China, and the M.S. Degree in Applied Mathematics from Jimei University in 2012, Fujian, China. He is currently working towards his Ph.D at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.

    ShaoZi Li received the B.S. degree from the Computer Science Department, Hunan University in 1983, and the M.S. degree from the Institute of System Engineering, Xi׳an Jiaotong University in 1988, and the Ph.D. degree from the College of Computer Science, National University of Defense Technology in 2009. He currently serves as the Professor and Chair of School of Information Science and Technology of Xiamen University, the Vice Director of Fujian Key Lab of the Brain-like Intelligence System, and the Vice Director and General Secretary concurrently of the Council of Fujian Artificial Intelligence Society. His research interests cover Artificial Intelligence and Its Applications, Moving Objects Detection and Recognition, Machine Learning, Computer Vision, Natural Language Processing and Multimedia Information Retrieval, Network Multimedia and CSCW Technology and others.

    Donglin Cao received the B.S. degree from Xiamen University, China, in 2000, and the M.S. degree from Xiamen University, China, in 2003, and the Ph.D. degree from Institute of Computing Technology Chinese Academy of Sciences, China, in 2009. His research interests cover Artificial Intelligence and Its Applications, Computer Vision, Nature Language Processing and Machine Learning.

    Song-Zhi Su received the B.S. degree in Computer Science and Technology from Shandong University, China, in 2005. He received M.S. and Ph.D degree in Computer Science in 2008 and 2011, both from Xiamen University, Fujian, China. He joined the faculty of Xiamen University as an assistant professor in 2011. His research interests include pedestrian detection, object detection and recognition, RGBD based human action recognition and image/video retrieval.

    Rongrong Ji serves as the Professor of Xiamen University, where he directs the Intelligent Multimedia Technology Laboratory (http://www.imt.xmu.edu.cn) and serves as a Dean Assistant in the School of Information Science and Engineering. He has been a Postdoc research fellow in the Department of Electrical Engineering, Columbia University from 2010 to 2013, worked with Profes sor Shih-Fu Chang. He obtained his Ph.D. degree in Computer Science from Harbin I nstitute of Technology, graduated with a Best Thesis Award at HIT. He had been a visiting student at University of Texas of San Antonio worked with Professor Qi Tian, and a research assistant at Peking University worked with Professor Wen Gao in 2010, a research intern at Microsoft Research Asia, worked with Dr. Xing Xie from 2007 to 2008.

    He is the author of over 40 tired-1 journals and conferences including IJCV, TIP, TM M, ICCV, CVPR, IJCAI, AAAI, and ACM Multimedia. His research interests include image and video search, content understanding, mobile visual search, and social multimedia analytics. Dr. Ji is the recipient of the Best Paper Award at ACM Multimedia 2011 and Microsoft Fellowship 2007. He is a guest editor for IEEE Multimedia Maga zine, Neurocomputing, and ACM Multimedia Systems Journal. He has been a special session chair of MMM 2014, VCIP 2013, MMM 2013 and PCM 2012, would be a pr ogram chair of ICIMCS 2016, Local Arrangement Chair of MMSP 2015. He serves as reviewers for IEEE TPAMI, IJCV, TIP, TMM, CSVT, TSMC ABC and IEEE Sign al Processing Magazine, etc. He is in the program committees of over 10 top conferen ces including CVPR 2013, ICCV 2013, ECCV 2012, ACM Multimedia 2013-2010, etc.

    1

    This work was done when the author was a Ph.D. candidate at Xiamen University.

    View full text