Detection based object labeling of 3D point cloud for indoor scenes
Introduction
Coming with the popularity of Kinect sensors and advanced 3D reconstruction techniques, we are facing an increasing amount of 3D point clouds. Consequently, the demand of scene understanding is emerging [6], [7]. Understanding 3D scenes is a fundamental problem in perception and robotics, as knowledge of the environment is a preliminary step for subsequent tasks such as route planning, augmented reality and mobile visual search [1], [2].
In the literature, a significant amount of work has been done in semantic labeling for pixels or regions in 2D images. In spite of this, semantic labeling of 3D point clouds remains an open problem. Its solution, however, can bring a breakthrough in a wide variety of computer vision and robotics research, with great potential in human-computer interface, 3D object indexing and retrieval, object manipulation in robotics [4] as well as exciting applications such as self-driving vehicles and semantic-aware augmented reality.
There is a great deal of literature on 3D scene labeling in indoor environments [5], [16]. Many of these operate directly on a 3D point cloud, as 3D point clouds contain very important shape and spatial information, which allows to model context and spatial relationships between objects However,one missing information of these methods to explore is the reference images. Existing works [4], [3] have shown that multi-view images can be used to recognize objects of 3D scene to a higher degree of accuracy.
In this paper, we focus on detection based object labeling in 3D scene. Specifically, we tackle the problem of object labeling in 3D scene with the help of object detection results from part-of-scene 2D reference image. Our goal is to transfer such reliable 2D labeling results into 3D to enhance the inference accuracy. It is noted that the proposed approach works in the scenario where a single point cloud is merged from multiple images. First, the point cloud is segmented into homogeneous segments. Then, we utilize object detectors to assign class probabilities to pixels in every RGB-D frame. After that, the class probabilities are projected into the segments. Finally, we perform accurate inference on a MRF model over the homogeneous segments, in combination with geometry cues to output the labels.
The outline of this paper is organized as follows. Section 2 surveys related work. The detailed of the proposed method is presented in Section 3. Experimental results and comparison are provided in Section 4. Finally, we draw conclusion of this paper in Section 5.
Section snippets
Related work
With the popularity of Kinect camera and 3D laser scanner, recently there is ever increasing research focus on semantic labeling of 3D point cloud [8], [9], [5]. The application scenarios are either indoor or outdoor, with corresponding tasks such as labeling tables, beds, desks, computers, or trees, cars, roads and pedestrians. So far, most robotic scene understanding work has focused on 3D outdoor scenes with related applications such as mapping and autonomous driving [10], [11], [12], [13].
DetSMRF
We now describe the proposed DetSMRF model, as shown in Fig. 1. The 3D point clouds we consider are captured from a set of RGB-D video frames. To reconstruct a 3D scene, RGB-D mapping [26] is employed to globally align and merge each frame with the scene under a rigid transform. The goal of this paper is to label small everyday objects of interest, which may comprise only a small part of the whole 3D point cloud.
DetSMRF is defined over a graph , where are vertices representing
Experiments
We experimentally evaluate our method primarily using the challenging RGB-D Object Dataset [31], which includes everyday objects captured individually on a turntable. We show that our scheme has a competitive performance compared to state-of-the-art method in terms of labelling accuracy. All of the parameter setting are able to be referred in previous sections in this paper.
Data set and Ground Truth Labeling: The RGB-D Object Dataset includes 250,000 segmented RGB-D images of 300 objects in 51
Conclusion
In this paper, we propose a detection-based scheme for 3D indoor scenes labeling. Experiment results show that our approach achieves better performance in terms of accuracy, compared to the compared to state-of-the-art method, on the challenging RGB-D Object Dataset. To some extend, our work also demonstrates the importance of the unary term in MRF model.
Acknowledgements
This work is supported by the Nature Science Foundation of China (No. 61373076), National Outstanding Youth Science Foundation of China (No. 61422210), and the Special Fund for Earthquake Research in the Public Interest (No. 201508025).
Wei Liu received B.S. Degree in Information and Computing Science in 2009 from Nanchang University, Jiangxi, China, and the M.S. Degree in Applied Mathematics from Jimei University in 2012, Fujian, China. He is currently working towards his Ph.D at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.
References (31)
- et al.
Discriminative learning with latent variables for cluttered indoor scene understanding
Commun. ACM
(2013) - et al.
On-device mobile visual location recognition by integrating vision and inertial sensors
IEEE Trans. Multimed.
(2013) - et al.
Efficient BOF generation and compression for on-device mobile visual location recognition
IEEE Multimed.
(2014) - K. Lai, L. Bo, X. Ren, D. Fox, Detection-based object labeling in 3d scenes, In: ICRA, 2012, pp....
- Y. Wang, R. Ji, S.-F. Chang, Label propagation from imagenet to 3d point clouds, In: CVPR, 2013, pp....
- H.S. Koppula, A. Anand, T. Joachims, A. Saxena, Semantic labeling of 3d point clouds for indoor scenes, In: NIPS, 2011,...
- et al.
Less is moreefficient 3d object retrieval with query view selection
IEEE Trans. Multimed.
(2011) - et al.
Camera constraint-free. view-based 3d object retrieval
IEEE Trans. Image Process.
(2012) - D. Munoz, J.A. Bagnell, N. Vandapel, M. Hebert, Contextual classification with functional max-margin Markov networks,...
- D. Munoz, N. Vandapel, M. Hebert, Onboard contextual classification of 3-d point clouds with learned high-order Markov...
Classification and semantic mapping of urban environments
Int. J. Robot. Res.
Object recognition in 3d point clouds using web data and domain adaptation
Int. J. Robot. Res.
A generative framework for fast urban labeling using spatial and temporal context
Auton. Robots
Cited by (0)
Wei Liu received B.S. Degree in Information and Computing Science in 2009 from Nanchang University, Jiangxi, China, and the M.S. Degree in Applied Mathematics from Jimei University in 2012, Fujian, China. He is currently working towards his Ph.D at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.
ShaoZi Li received the B.S. degree from the Computer Science Department, Hunan University in 1983, and the M.S. degree from the Institute of System Engineering, Xi׳an Jiaotong University in 1988, and the Ph.D. degree from the College of Computer Science, National University of Defense Technology in 2009. He currently serves as the Professor and Chair of School of Information Science and Technology of Xiamen University, the Vice Director of Fujian Key Lab of the Brain-like Intelligence System, and the Vice Director and General Secretary concurrently of the Council of Fujian Artificial Intelligence Society. His research interests cover Artificial Intelligence and Its Applications, Moving Objects Detection and Recognition, Machine Learning, Computer Vision, Natural Language Processing and Multimedia Information Retrieval, Network Multimedia and CSCW Technology and others.
Donglin Cao received the B.S. degree from Xiamen University, China, in 2000, and the M.S. degree from Xiamen University, China, in 2003, and the Ph.D. degree from Institute of Computing Technology Chinese Academy of Sciences, China, in 2009. His research interests cover Artificial Intelligence and Its Applications, Computer Vision, Nature Language Processing and Machine Learning.
Song-Zhi Su received the B.S. degree in Computer Science and Technology from Shandong University, China, in 2005. He received M.S. and Ph.D degree in Computer Science in 2008 and 2011, both from Xiamen University, Fujian, China. He joined the faculty of Xiamen University as an assistant professor in 2011. His research interests include pedestrian detection, object detection and recognition, RGBD based human action recognition and image/video retrieval.
Rongrong Ji serves as the Professor of Xiamen University, where he directs the Intelligent Multimedia Technology Laboratory (http://www.imt.xmu.edu.cn) and serves as a Dean Assistant in the School of Information Science and Engineering. He has been a Postdoc research fellow in the Department of Electrical Engineering, Columbia University from 2010 to 2013, worked with Profes sor Shih-Fu Chang. He obtained his Ph.D. degree in Computer Science from Harbin I nstitute of Technology, graduated with a Best Thesis Award at HIT. He had been a visiting student at University of Texas of San Antonio worked with Professor Qi Tian, and a research assistant at Peking University worked with Professor Wen Gao in 2010, a research intern at Microsoft Research Asia, worked with Dr. Xing Xie from 2007 to 2008.
He is the author of over 40 tired-1 journals and conferences including IJCV, TIP, TM M, ICCV, CVPR, IJCAI, AAAI, and ACM Multimedia. His research interests include image and video search, content understanding, mobile visual search, and social multimedia analytics. Dr. Ji is the recipient of the Best Paper Award at ACM Multimedia 2011 and Microsoft Fellowship 2007. He is a guest editor for IEEE Multimedia Maga zine, Neurocomputing, and ACM Multimedia Systems Journal. He has been a special session chair of MMM 2014, VCIP 2013, MMM 2013 and PCM 2012, would be a pr ogram chair of ICIMCS 2016, Local Arrangement Chair of MMSP 2015. He serves as reviewers for IEEE TPAMI, IJCV, TIP, TMM, CSVT, TSMC and IEEE Sign al Processing Magazine, etc. He is in the program committees of over 10 top conferen ces including CVPR 2013, ICCV 2013, ECCV 2012, ACM Multimedia 2013-2010, etc.
- 1
This work was done when the author was a Ph.D. candidate at Xiamen University.