RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion

Lv, Xiong; Jiang, Shu-Qiang; Herranz, Luis; Wang, Shuang

doi:10.1007/s11390-015-1527-0

RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion

Regular Paper
Published: 13 March 2015

Volume 30, pages 340–352, (2015)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xiong Lv¹,
Shu-Qiang Jiang¹,
Luis Herranz¹ &
…
Shuang Wang¹

303 Accesses
13 Citations
Explore all metrics

Abstract

Object recognition has many applications in human-machine interaction and multimedia retrieval. However, due to large intra-class variability and inter-class similarity, accurate recognition relying only on RGB data is still a big challenge. Recently, with the emergence of inexpensive RGB-D devices, this challenge can be better addressed by leveraging additional depth information. A very special yet important case of object recognition is hand-held object recognition, as manipulating objects with hands is common and intuitive in human-human and human-machine interactions. In this paper, we study this problem and introduce an effective framework to address it. This framework first detects and segments the hand-held object by exploiting skeleton information combined with depth information. In the object recognition stage, this work exploits heterogeneous features extracted from different modalities and fuses them to improve the recognition accuracy. In particular, we incorporate handcrafted and deep learned features and study several multi-step fusion variants. Experimental evaluations validate the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Deep Learning for Generic Object Detection: A Survey

Article Open access 31 October 2019

References

Li L, Jiang S, Huang Q. Learning hierarchical semantic description via mixed-norm regularization for image understanding. IEEE Transactions on Multimedia, 2012, 14(5):1401–1413.
Article Google Scholar
Bo L, Ren X, Fox D. Unsupervised feature learning for RGB-D based object recognition. In Springer Tracts in Advanced Robotics 88, Desai J P, Dudek G, Khatib O, Kumar V (eds.), Springer, pp.387–402.
Gupta S, Arbeláez P, Girshick R, Malik J. Indoor scene understanding with RGB-D images: Bottom up segmentation, object detection and semantic segmentation. International Journal of Computer Vision, 2014. http://link.springer.com/article/10.1007/s11263-014-0777-6, Feb. 2015
Chai X, Li G, Lin Y, Xu Z, Tang Y, Chen X, Zhou M. Sign language recognition and translation with Kinect. In Proc. IEEE International Conference on Automatic Face and Gesture Recognition, April 2013.
Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2):91–110.
Article Google Scholar
Johnson A E, Hebert M. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, 21(5):433–449.
Article Google Scholar
Morisset B, Rusu R B, Sundaresan A, Hauser K, Agrawal M, Latombe J C, Beetz M. Leaving flatland: Toward realtime 3D navigation. In Proc. IEEE International Conference on Robotics and Automation, May 2009, pp.3786–3793.
Hinterstoisser S, Holzer S, Cagniart C et al. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proc. IEEE International Conference on Computer Vision (ICCV), Nov. 2011, pp.858–865.
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In Proc. Neural Information Processing Systems, Dec. 2012.
Zhang Z, Zhou C, Xin B,Wang Y, Gao W. An interactive system of stereoscopic video conversion. In Proc. the 20th ACM International Conference on Multimedia, Oct. 29–Nov. 2, 2012, pp.149–158.
Izadi S, Kim D, Hilliges O et al. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. In Proc. the 24th Annual ACM Symposium on User Interface Software and Technology, Nov. 2011, pp.559–568.
Liu S,Wang S,Wu L, Jiang S. Multiple feature fusion based hand-held object recognition with RGB-D data. In Proc. International Conference on Internet Multimedia Computing and Service, July 2014, p.303.
Lv X, Wang S, Li X, Jiang S. Combining heterogenous features for 3D handheld object recognition. In Proc. SPIE Optoelectronic Imaging and Multimedia Technology III, Oct. 2014.
Rivera-Rubio J, Idrees S, Alexiou I, Hadjilucas L, Bharath A. Small hand-held object recognition test (short). In Proc. the 2014 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2014, pp.524–531.
Beck C, Broun A, Mirmehdi M, Pipe A, Melhuish C. Text line aggregation. In Proc. International Conference on Pattern Recognition Applications and Methods (ICPRAM), Mar. 2014, pp.393–401.
Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In Proc. the 12th ECCV, Part 5, Oct. 2012, pp.746-760
Koppula H S, Anand A, Joachims T, Saxena A. Semantic labeling of 3D point clouds for indoor scenes. In Proc. the 25th Neural Information Processing Systems, Dec. 2011.
Kanezaki A, Suzuki T, Harada T, Kuniyoshi Y. Fast object detection for robots in a cluttered indoor environment using integral 3D feature table. In Proc. the 2011 IEEE International Conference on Robotics and Automation (ICRA), May 2011, pp.4026–4033.
Felzenszwalb P F, Girshick R B, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9):1627–1645.
Article Google Scholar
Alexandre L A. 3D object recognition using convolutional neural networks with transfer learning between input channels. In Proc. the 13th International Conference on Intelligent Autonomous Systems, July 2014.
Gupta S, Girshick R, Arbel´aez P, Malik J. Learning rich features from RGB-D images for object detection and segmentation. In Proc. the 13th ECCV, Part 7, Sept. 2014, pp.345–360.
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing textures in the wild. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.3606-3613
Xiao J, Ehinger K, Hays J, Torralba A, Oliva A. SUN database: Exploring a large collection of scene categories. International Journal of Computer Vision, 2014. http://link.springer.com/article/10.1007/s11263-014-0748-y, Feb. 2015.
Fu Y, Cao L, Guo G, Huang T S. Multiple feature fusion by subspace learning. In Proc. the 2008 International Conference on Content-Based Image and Video Retrieval, July 2008, pp.127–134.
Sun Q S, Jin Z, Heng P A, Xia D S. A novel feature fusion method based on partial least squares regression. In Proc. the 3rd International Conference on Advances in Pattern Recognition, Part 1, Aug. 2005, pp.268–277.
Barker M, Rayens W. Partial least squares for discrimination. Journal of Chemometrics, 2003, 17(3):166–173.
Article Google Scholar
Wohlkinger W, Vincze M. Ensemble of shape functions for 3D object classification. In Proc. the 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec. 2011, pp.2987–2992.
Kanezaki A, Marton Z C, Pangercic D, Harada T, Kuniyoshi Y, Beetz M. Voxelized shape and color histograms for RGBD. In Proc. IROS Workshop on Active Semantic Perception and Object Search in the Real World, Sept. 2011.
Jia Y, Shelhamer Evan, Donahue J et al. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014. http://arxiv.org/abs/1408.5093, Feb. 2015.
Marton Z C, Pangercic D, Rusu R B, Holzbach A, Beetz M. Hierarchical object geometric categorization and appearance classification for mobile manipulation. In Proc. the 10th IEEE-RAS International Conference on Humanoid Robots, Dec. 2010, pp.365-370
Snoek C G, Worring M, Smeulders A W. Early versus late fusion in semantic video analysis. In Proc. the 13th Annual ACM International Conference on Multimedia, Nov. 2005, pp.399–402.

Download references

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology Chinese Academy of Sciences, Beijing, 100190, China
Xiong Lv, Shu-Qiang Jiang, Luis Herranz & Shuang Wang

Authors

Xiong Lv
View author publications
You can also search for this author in PubMed Google Scholar
Shu-Qiang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Luis Herranz
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shu-Qiang Jiang.

Additional information

Special Section on Object Recognition

This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2012CB316400, the National Natural Science Foundation of China under Grant Nos. 61322212 and 61450110446, the National High Technology Research and Development 863 Program of China under Grant No. 2014AA015202, and the Chinese Academy of Sciences Fellowships for Young International Scientists under Grant No. 2011Y1GB05. This work is also funded by Lenovo Outstanding Young Scientists Program (LOYS).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lv, X., Jiang, SQ., Herranz, L. et al. RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion. J. Comput. Sci. Technol. 30, 340–352 (2015). https://doi.org/10.1007/s11390-015-1527-0

Download citation

Received: 29 December 2014
Revised: 22 February 2015
Published: 13 March 2015
Issue Date: March 2015
DOI: https://doi.org/10.1007/s11390-015-1527-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

ImageNet Large Scale Visual Recognition Challenge

Deep Learning for Generic Object Detection: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

ImageNet Large Scale Visual Recognition Challenge

Deep Learning for Generic Object Detection: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation