Abstract
Depth sensor is widely used today and has great impact in object pose estimation, camera tracking, human actions, and scene reconstruction. This paper presents a novel method for human interaction recognition based on 3D skeleton data captured by Kinect sensor using hierarchical spatial-temporal saliency-based representation method. Hierarchical saliency can be conceptualized as Salient Actions at the highest level, determined by the initial movement in an interaction; Salient Points at middle level, determined by a single time point uniquely identified for all instances of Salient Action; Salient Joints at the lowest level, determined by the greatest positional changes of human joints in a Salient Action sequence. Given the interaction saliency at different levels, several types of features, such as spatial displacement, direction relations, and etc., are introduced based on action characteristics. Since there are few publicly accessible test datasets, we created a new dataset with eight types of interactions named K3HI, using the Microsoft Kinect. The method was tested based on Support Vector Machine (SVM) multi-class classifier. In the experiment, the results demonstrate that the average recognition accuracy of hierarchical saliency-based representation is 90.29%, outperforming methods using other features.
Similar content being viewed by others
References
Aggarwal JK, Park S (2004) Semantic-level understanding of human actions and interactions using event hierarchy. IEEE Workshop on Articulated and Nonrigid Motion, Washington, DC, p. 12
Brand M (1997) Coupled hidden Markov models for modeling interacting processes. MIT Media Lab Perceptual Computing / Learning and Common Sense Technical Report
Chen L, Wei H, Ferryman J (2013) A survey of human motion analysis using depth imagery. Pattern Recogn Lett 34(15):1995–2006. https://doi.org/10.1016/j.patrec.2013.02.006
Crawford GP, Fiske TG, Silverstein LD (1997) Reflective color LCDs based on H-PDLC and PSCT technologies. J Soc Inf Disp 5(1):45–48. https://doi.org/10.1889/1.1985123
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, 65-72
Du Y, Chen F, Xu W (2007) Human interaction representation and recognition through motion decomposition. IEEE Signal Processing Letters 14(12):952–955
Edwards M, Deng J, Xie X (2016) From pose to activity: surveying datasets and introducing CONVERSE. Comput Vis Image Underst 144:73–105
Firman M (2016) RGBD datasets: past, present and future. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
Fryer PM, Colgan E, Galligan E, Graham W, Horton R, Hunt D, Jenkins L, John R, Koke P, Kuo Y, Latzko K, Libsch F, Lien A, Nywening R, Polastre R, Rothwell ME, Wilson J, Wisnieff R, Wright S (1997) A six-mask TFT-LCD process using copper-gate metallurgy. J Soc Inf Disp 5(1):49–52. https://doi.org/10.1889/1.1985124
Guo P, Miao Z, Zhang X-P, Shen Y, Wang S (2012) Coupled observation decomposed hidden Markov model for multiperson activity recognition. Circuits and Systems for Video Technology, IEEE Transactions on 22(9):1306–1320
Hu T, Zhu X, Guo W, Su K (2013) Efficient interaction recognition through positive action representation. Math Probl Eng 2013:1–11
Hu T, Zhu X, Guo W, Wang S, Zhu J (2018) Human action recognition based on scene semantics. Multimedia Tools and Applications, 1–22
Kakizaki T, Tanamachi S, Hayashi M (1997) Development of 25-in. Active-matrix LCD using plasma addressing for video-rate high-quality displays. J Soc Inf Disp 5(1):57–60. https://doi.org/10.1889/1.1985126
Kong Y, Fu Y (2016) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp 9–14
Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. Paper presented at the Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on
Mastorakis G, Makris D (2012) Fall detection system using Kinect’s infrared sensor. J Real-Time Image Proc:1–12. https://doi.org/10.1007/s11554-012-0246-9
Megavannan V, Agarwal B, Babu RV (2012) Human action recognition using depth maps. In: Signal Processing and Communications (SPCOM), 2012 International Conference on, pp 1–5
Ng AY, Jordan MI (2001) On Discriminative vs. Generative classifiers: a comparison of logistic regression and naive Bayes. Paper presented at the In NIPS
Ni B, Wang G, Moulin P (2011) RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: Consumer depth cameras for computer vision. Springer, London, pp 193–208
Nowozin S, Shotton J (2012) Action points: a representation for low-latency online human action recognition. Microsoft Research TechReport
Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36(3):710–719
Park S, Aggarwal J (2004) A hierarchical Bayesian network for event recognition of human actions and interactions. Multimedia Systems 10(2):164–179
Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency-based spatiotemporal feature points for action recognition. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, Miami
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011a) Real-time human pose recognition in parts from single depth images. In CVPR
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011b) Real-Time Human Pose Recognition in Parts from a Single Depth Image. Paper presented at the IEEE computer vision and pattern recognition (CVPR) 2011, Colorado
Sung J, Ponce C, Selman B, Saxena A (2011) Human Activity Detection from RGBD Images. CoRR, abs/1107.0169
Vig E, Dorr M, Cox DD (2012) Saliency-based selection of sparse descriptors for action recognition. Paper presented at the Proceedings of the International Conference on Image Processing
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. Paper presented at the Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, Providence
Xia L, Chen C-C, Aggarwal JK (2011) Human detection using depth information by Kinect. Paper presented at the Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 I.E. Computer Society Conference on Colorado Springs, CO
Yao A, Fanelli JGG, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British Machine Vision Conference-BMVC 2011
Yun K, Honorio J, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. Paper presented at the 2012 I.E. Computer Society Conference on Computer Vision And Pattern Recognition Workshops CVPRW
Zhang X, Wandell BA (1997) A spatial extension of CIELAB for digital color-image reproduction. J Soc Inf Disp 5(1):61–63. https://doi.org/10.1889/1.1985127
Acknowledgments
We are grateful to the volunteers for capturing data. This research is supported by the National Key R&D Program of China (No. 2016YFB0502204), the National Key Technology R&D Program (No. 2015BAK03B04), the Funds for the Central Universities (No. 413000010), the Open Found of State Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (No.16(03)), Guangxi Higher Education Undergraduate Teaching Reform Project Category A (2016JGA258), and the Opening Foundation of Key Laboratory of Environment Change and Resources Use in Beibu Gulf Ministry of Education (Guangxi Teachers Education University) and Guangxi Key Laboratory of Earth Surface Processes and Intelligent Simulation (Guangxi Teachers Education University) (No.GTEU-KLOP-K1704).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, T., Zhu, X., Wang, S. et al. Human interaction recognition using spatial-temporal salient feature. Multimed Tools Appl 78, 28715–28735 (2019). https://doi.org/10.1007/s11042-018-6074-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6074-6