Skip to main content
Log in

Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

One of the most challenging tasks in computer vision is human action recognition. The recent development of depth sensors has created new opportunities in this field of research. In this paper, a novel supervised spatio-temporal kernel descriptor (SSTKDes) is proposed from RGB-depth videos to establish a discriminative and compact feature representation of actions. To enhance the descriptive and discriminative ability of the descriptor, extracted primary kernel-based features are transformed into a new space by exploiting a supervised training strategy; i.e., large margin nearest neighbor (LMNN). The LMNN highly reduces the error of a nearest neighbor classifier by minimizing the intra-class variations and maximizing the inter-class distances. Subsequently, the efficient match kernel (EMK) is used to abstract the mid-level kernel features for a more efficient classification. The proposed approach is evaluated on five public benchmark datasets. The experimental evaluations demonstrate that the proposed method achieves superior performance to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://www.csie.ntu.edu.tw/~cjlin/libsvm/

References

  1. Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16

    Article  Google Scholar 

  2. Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80

    Article  Google Scholar 

  3. Asadi-Aghbolaghi M, Kasaei S (2014) View invariant human action recognition using fourier-based and radon-based point cloud analysis. In: 2014 7th international symposium on telecommunications (IST). IEEE, pp 66–71

  4. Asadi-Aghbolaghi M, Ramezanpour S, Kasaei S (2014) A new feature descriptor for 3d human action recognition. In: 2014 22nd Iranian conference on electrical engineering (ICEE). IEEE, pp 1157–1161

  5. Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In: Advances in neural information processing systems, pp 244–252

  6. Bo L, Sminchisescu C (2009) Efficient match kernel between sets of features for visual recognition. In: Advances in neural information processing systems, pp 135–143

  7. Boureau YL, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 111–118

  8. Brown M, Hua G, Winder S (2011) Discriminative learning of local image descriptors. IEEE Trans Pattern Anal Mach Intell 33(1):43–57

    Article  Google Scholar 

  9. Calonder M, Lepetit V, Strecha C, Fua P (2010) Brief: binary robust independent elementary features. In: European conference on computer vision. Springer, pp 778–792

  10. Chaaraoui AA, Padilla-López JR, Climent-Pérez P, Flórez-Revuelta F (2014) Evolutionary joint selection to improve human action recognition with rgb-d devices. Expert systems with applications 41(3):786–794

    Article  Google Scholar 

  11. Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE winter conference on applications of computer vision. IEEE, pp 1092–1099

  12. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893

  13. Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2013) Space-time pose representation for 3d human action recognition. In: International conference on image analysis and processing. Springer, pp 456–464

  14. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition , pp 1110–1118

  15. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. arXiv:160406573

  16. Gu Y, Do H, Ou Y, Sheng W (2012) Human gesture recognition through a kinect sensor. In: 2012 IEEE international conference on robotics and biomimetics (ROBIO). IEEE, pp 1379–1384

  17. Gupta A, Martinez J, Little JJ, Woodham RJ (2014) 3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2601–2608

  18. Han F, Reily B, Hoff W, Zhang H (2016) Space-time representation of people based on 3d skeletal data: a review. arXiv:160101006

  19. Jafari R, Ziou D (2012) Gaze estimation using kinect/ptz camera. In: 2012 IEEE international symposium on robotic and sensors environments (ROSE). IEEE, pp 13–18

  20. Junejo IN, Dexter E, Laptev I, Perez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33(1):172–185

    Article  Google Scholar 

  21. Kang SM, Wildes RP, 2016 Review of action recognition and detection methods. arXiv:161006906

  22. Kong Y, Satarboroujeni B, Fu Y (2016) Learning hierarchical 3d kernel descriptors for rgb-d action recognition. Comput Vis Image Underst 144:14–23

    Article  Google Scholar 

  23. Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO). IEEE, pp 1975–1979

  24. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123

    Article  Google Scholar 

  25. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8

  26. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops. IEEE, pp 9–14

  27. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, pp 816–833

  28. Liu Z, Feng X, Tian Y (2015) An effective view and time-invariant action recognition method based on depth videos. In: 2015 visual communications and image processing (VCIP). IEEE, pp 1–4

  29. Lu C, Jia J, Tang CK (2014) Range-sample depth feature for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 772–779

  30. Oreifej O, Liu Z (2013) Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition , pp 716–723

  31. Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  32. Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision. Springer , pp 742–757

  33. Reyes M, Domínguez G, Escalera S (2011) Featureweighting in dynamic timewarping for gesture recognition in depth data. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops). IEEE, pp 1182–1188

  34. Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2013) Real-time human pose recognition in parts from single depth images. Commun ACM 56(1):116–124

    Article  Google Scholar 

  35. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  36. Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from rgbd images. In: 2012 IEEE international conference on robotics and automation (ICRA). IEEE, pp 842–849

  37. Tehrani AKN, Aghbolaghi MA, Kasaei S (2017) Skeleton-based human action recognition - a learning method based on active joints. In: Proceedings of the 12th international joint conference on computer vision, imaging and computer graphics theory and applications - vol 5: VISAPP (VISIGRAPP 2017), pp 303–310

  38. Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:160404494

  39. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  40. Vieira AW, Nascimento ER, Oliveira GL, Liu Z, Campos MF (2012) Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Springer, pp 252–259

  41. Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012a) Robust 3d action recognition with random occupancy patterns. In: Computer vision–ECCV 2012. Springer, pp 872–885

  42. Wang J, Liu Z, Wu Y, Yuan J (2012b) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1290–1297

  43. Wang P, Wang J, Zeng G, Xu W, Zha H, Li S (2013) Supervised kernel descriptors for visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2858–2865

  44. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2015a) Action recognition from depth maps using deep convolutional neural networks

  45. Wang X, Farhadi A, Gupta A (2015b) Actions transformations. arXiv:151200795

  46. Wei P, Zhao Y, Zheng N, Zhu SC (2013) Modeling 4d human-object interactions for event and object recognition. In: 2013 IEEE international conference on computer vision. IEEE, pp 3272–3279

  47. Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(Feb):207–244

    MATH  Google Scholar 

  48. Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115 (2):224–241

    Article  Google Scholar 

  49. Wu D, Pigou L, Kindermans P J, Nam L, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition

  50. Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2834–2841

  51. Xia L, Chen CC, Aggarwal J (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 20–27

  52. Xiao Y, Xia L (2016) Human action recognition using modified slow feature analysis and multiple kernel learning. Multimedia Tools and Applications 75(21):13,041–13,056

    Article  Google Scholar 

  53. Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 804–811

  54. Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 14–19

  55. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM international conference on multimedia. ACM, pp 1057–1060

  56. Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) A survey on human motion analysis from depth data. In: Time-of-flight and depth imaging. Sensors, algorithms, and applications. Springer, pp 149– 187

  57. Yu S, Cheng Y, Su S, Cai G, Li S (2016) Stratified pooling based deep convolutional neural networks for human action recognition. Multimedia Tools and Applications, pp 1–16

  58. Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shohreh Kasaei.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asadi-Aghbolaghi, M., Kasaei, S. Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos. Multimed Tools Appl 77, 14115–14135 (2018). https://doi.org/10.1007/s11042-017-5017-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5017-y

Keywords

Navigation