Skip to main content
Log in

Human interaction recognition using spatial-temporal salient feature

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Depth sensor is widely used today and has great impact in object pose estimation, camera tracking, human actions, and scene reconstruction. This paper presents a novel method for human interaction recognition based on 3D skeleton data captured by Kinect sensor using hierarchical spatial-temporal saliency-based representation method. Hierarchical saliency can be conceptualized as Salient Actions at the highest level, determined by the initial movement in an interaction; Salient Points at middle level, determined by a single time point uniquely identified for all instances of Salient Action; Salient Joints at the lowest level, determined by the greatest positional changes of human joints in a Salient Action sequence. Given the interaction saliency at different levels, several types of features, such as spatial displacement, direction relations, and etc., are introduced based on action characteristics. Since there are few publicly accessible test datasets, we created a new dataset with eight types of interactions named K3HI, using the Microsoft Kinect. The method was tested based on Support Vector Machine (SVM) multi-class classifier. In the experiment, the results demonstrate that the average recognition accuracy of hierarchical saliency-based representation is 90.29%, outperforming methods using other features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Aggarwal JK, Park S (2004) Semantic-level understanding of human actions and interactions using event hierarchy. IEEE Workshop on Articulated and Nonrigid Motion, Washington, DC, p. 12

  2. Brand M (1997) Coupled hidden Markov models for modeling interacting processes. MIT Media Lab Perceptual Computing / Learning and Common Sense Technical Report

  3. Chen L, Wei H, Ferryman J (2013) A survey of human motion analysis using depth imagery. Pattern Recogn Lett 34(15):1995–2006. https://doi.org/10.1016/j.patrec.2013.02.006

    Article  Google Scholar 

  4. Crawford GP, Fiske TG, Silverstein LD (1997) Reflective color LCDs based on H-PDLC and PSCT technologies. J Soc Inf Disp 5(1):45–48. https://doi.org/10.1889/1.1985123

    Article  Google Scholar 

  5. Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, 65-72

  6. Du Y, Chen F, Xu W (2007) Human interaction representation and recognition through motion decomposition. IEEE Signal Processing Letters 14(12):952–955

    Article  Google Scholar 

  7. Edwards M, Deng J, Xie X (2016) From pose to activity: surveying datasets and introducing CONVERSE. Comput Vis Image Underst 144:73–105

    Article  Google Scholar 

  8. Firman M (2016) RGBD datasets: past, present and future. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

  9. Fryer PM, Colgan E, Galligan E, Graham W, Horton R, Hunt D, Jenkins L, John R, Koke P, Kuo Y, Latzko K, Libsch F, Lien A, Nywening R, Polastre R, Rothwell ME, Wilson J, Wisnieff R, Wright S (1997) A six-mask TFT-LCD process using copper-gate metallurgy. J Soc Inf Disp 5(1):49–52. https://doi.org/10.1889/1.1985124

    Article  Google Scholar 

  10. Guo P, Miao Z, Zhang X-P, Shen Y, Wang S (2012) Coupled observation decomposed hidden Markov model for multiperson activity recognition. Circuits and Systems for Video Technology, IEEE Transactions on 22(9):1306–1320

    Article  Google Scholar 

  11. Hu T, Zhu X, Guo W, Su K (2013) Efficient interaction recognition through positive action representation. Math Probl Eng 2013:1–11

    Google Scholar 

  12. Hu T, Zhu X, Guo W, Wang S, Zhu J (2018) Human action recognition based on scene semantics. Multimedia Tools and Applications, 1–22

  13. Kakizaki T, Tanamachi S, Hayashi M (1997) Development of 25-in. Active-matrix LCD using plasma addressing for video-rate high-quality displays. J Soc Inf Disp 5(1):57–60. https://doi.org/10.1889/1.1985126

    Article  Google Scholar 

  14. Kong Y, Fu Y (2016) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178

    Article  MathSciNet  MATH  Google Scholar 

  15. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp 9–14

  16. Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. Paper presented at the Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on

  17. Mastorakis G, Makris D (2012) Fall detection system using Kinect’s infrared sensor. J Real-Time Image Proc:1–12. https://doi.org/10.1007/s11554-012-0246-9

  18. Megavannan V, Agarwal B, Babu RV (2012) Human action recognition using depth maps. In: Signal Processing and Communications (SPCOM), 2012 International Conference on, pp 1–5

  19. Ng AY, Jordan MI (2001) On Discriminative vs. Generative classifiers: a comparison of logistic regression and naive Bayes. Paper presented at the In NIPS

  20. Ni B, Wang G, Moulin P (2011) RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: Consumer depth cameras for computer vision. Springer, London, pp 193–208

  21. Nowozin S, Shotton J (2012) Action points: a representation for low-latency online human action recognition. Microsoft Research TechReport

  22. Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36(3):710–719

    Article  Google Scholar 

  23. Park S, Aggarwal J (2004) A hierarchical Bayesian network for event recognition of human actions and interactions. Multimedia Systems 10(2):164–179

    Article  Google Scholar 

  24. Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency-based spatiotemporal feature points for action recognition. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, Miami

  25. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011a) Real-time human pose recognition in parts from single depth images. In CVPR

  26. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011b) Real-Time Human Pose Recognition in Parts from a Single Depth Image. Paper presented at the IEEE computer vision and pattern recognition (CVPR) 2011, Colorado

  27. Sung J, Ponce C, Selman B, Saxena A (2011) Human Activity Detection from RGBD Images. CoRR, abs/1107.0169

  28. Vig E, Dorr M, Cox DD (2012) Saliency-based selection of sparse descriptors for action recognition. Paper presented at the Proceedings of the International Conference on Image Processing

  29. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. Paper presented at the Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, Providence

  30. Xia L, Chen C-C, Aggarwal JK (2011) Human detection using depth information by Kinect. Paper presented at the Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 I.E. Computer Society Conference on Colorado Springs, CO

  31. Yao A, Fanelli JGG, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British Machine Vision Conference-BMVC 2011

  32. Yun K, Honorio J, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. Paper presented at the 2012 I.E. Computer Society Conference on Computer Vision And Pattern Recognition Workshops CVPRW

  33. Zhang X, Wandell BA (1997) A spatial extension of CIELAB for digital color-image reproduction. J Soc Inf Disp 5(1):61–63. https://doi.org/10.1889/1.1985127

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to the volunteers for capturing data. This research is supported by the National Key R&D Program of China (No. 2016YFB0502204), the National Key Technology R&D Program (No. 2015BAK03B04), the Funds for the Central Universities (No. 413000010), the Open Found of State Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (No.16(03)), Guangxi Higher Education Undergraduate Teaching Reform Project Category A (2016JGA258), and the Opening Foundation of Key Laboratory of Environment Change and Resources Use in Beibu Gulf Ministry of Education (Guangxi Teachers Education University) and Guangxi Key Laboratory of Earth Surface Processes and Intelligent Simulation (Guangxi Teachers Education University) (No.GTEU-KLOP-K1704).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaohua Wang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, T., Zhu, X., Wang, S. et al. Human interaction recognition using spatial-temporal salient feature. Multimed Tools Appl 78, 28715–28735 (2019). https://doi.org/10.1007/s11042-018-6074-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6074-6

Keywords

Navigation