Abstract
Hand-held object recognition is an important research topic in image understanding and plays an essential role in human-machine interaction. With the easily available RGB-D devices, the depth information greatly promotes the performance of object segmentation and provides additional channel information. While how to extract a representative and discriminating feature from object region and efficiently take advantage of the depth information plays an important role in improving hand-held object recognition accuracy and eventual human-machine interaction experience. In this paper, we focus on a special but important area called RGB-D hand-held object recognition and propose a hierarchical feature learning framework for this task. First, our framework learns modality-specific features from RGB and depth images using CNN architectures with different network depth and learning strategies. Secondly a high-level feature learning network is implemented for a comprehensive feature representation. Different with previous works on feature learning and representation, the hierarchical learning method can sufficiently dig out the characteristics of different modal information and efficiently fuse them in a unified framework. The experimental results on HOD dataset illustrate the effectiveness of our proposed method.





Similar content being viewed by others
References
Beck C, Broun A, Mirmehdi M, Pipe A, Melhuish C (2014) Text line aggregation. Int Conf Pattern Recogn Appl Methods (ICPRAM), pp 393–401
Bo L, Ren X (2011) Depth kernel descriptors for object recognition. In: IROS, pp 821–826
Chai X, Li G, Lin Y, Xu Z, Tang Y, Chen X, Zhou M (2013) Sign language recognition and translation with kinect. In: ICAFGR
Fu Y, Cao L, Guo G, Huang TS (2008) Multiple feature fusion by subspace learning. In: Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM, pp 127– 134
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp 580–587
Gupta S, Arbeláez P, Girshick R, images JM (2014) Indoor scene understanding with rgb-d Bottom-up segmentation, object detection and semantic segmentation. IJCV, pp 1–17
Gupta S, Girshick RB, Arbelaez P, Malik J (2014) Learning rich features from RGB-d images for object detection and segmentation. CoRR, abs/1407:5736
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Kanezaki A, Marton Z-C, Pangercic D, Harada T, Kuniyoshi Y, Beetz M (2011) Voxelized shape and color histograms for rgb-d. In: IROS Workshop on Active Semantic Perception. Citeseer
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1106–1114
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp 1097–1105
Liu S, Wang S, Wu L, Jiang S (2014) Multiple feature fusion based hand-held object recognition with rgb-d data. In: Proceedings of International Conference on Internet Multimedia Computing and Service. ACM, pp 303–306
Liu W, Tao D, Cheng J, Tang Y (2014) Multiview hessian discriminative sparse coding for image annotation. Comput Vis Image Understand 118:50–60
Lv X, Jiang S-Q, Herranz L, Wang S (2015) Rgb-d hand-held object recognition based on heterogeneous feature fusion. J Comput Sci Technol 30(2):340–352
Lv X, Wang S, Li X, Jiang S (2014) Combining heterogenous features for 3d hand-held object recognition. In: Proceedings SPIE, Optoelectronic Imaging and Multimedia Technology III, vol 9273, pp 92732I–92732I–10
Marton Z-C, Pangercic D, Rusu Radu B, Holzbach A, Beetz M (2010) Hierarchical object geometric categorization and appearance classification for mobile manipulation. In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots, TN, USA
Ren X, Gu C (2010) Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR, pp 3137–3144
Rivera-Rubio J, Idrees S, Alexiou I, Hadjilucas L, Bharath AA (2014) Small hand-held object recognition test (short). In: WACV, pp 524–531
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409:1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. CoRR, abs/1409:4842
Wohlkinger W, Vincze M (2011) Ensemble of shape functions for 3d object classification. In: ROBIO, pp 2987–2992
Wu F, Jing X-Y, You X, Yue D, Hu R, Yang J-Y (2016) Multi-view low-rank dictionary learning for image classification. Pattern Recogn 50:143–154
Xu RYD, Jin JS (2006) Individual object interaction for camera control and multimedia synchronization. In: ICASSP, vol 5
Zeiler MD, Fergus R (2013) Visualizing and understanding convolutional networks. CoRR, abs/1311:2901
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: ECCV, pp 818–833
Zha Z-J, Yang Y, Tang J, Wang M, Chua T-S (2015) Robust multiview feature learning for rgb-d image understanding. ACM Trans Intell Syst Technol, vol 6, pp 15:115:19
Zhang B, Perina A, Li Z, Murino V (2016) Bounding multiple gaussians uncertainty with application to object tracking. IJCV
Zhang B, Li Z, Perina A, Del Bue A, Murino V (2015) Adaptive local movement modelling for object tracking. In: WACV, pp 25–32
Zhang K, Zhang L, Yang M-H (2014) Fast compressive tracking. IEEE Trans Pattern Anal Mach Intell 36(10):2002–2015
Acknowledgments
This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2012CB316400, the National Natural Science Foundation of China under Grant Nos. 61532018, and 61322212 the National High Technology Research and Development 863 Program of China under Grant No. 2014AA015202. This work is also funded by Lenovo Outstanding Young Scientists Program (LOYS).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lv, X., Liu, X., Li, X. et al. Modality-specific and hierarchical feature learning for RGB-D hand-held object recognition. Multimed Tools Appl 76, 4273–4290 (2017). https://doi.org/10.1007/s11042-016-3375-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3375-5