Skip to main content
Log in

Hierarchical Gaussian descriptor based on local pooling for action recognition

  • Original paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a new approach based on Gaussian descriptors for action recognition. We first develop a feature representation technique that encodes high-order statistics of local features in two levels, where single Gaussians are used to capture the distributions involved. To deal with the possible loss of information about the distribution of features caused by heterogeneous feature vectors when summarizing them, we use K-means clustering and Sparse Coding to construct some sets of feature vectors over which the summarization is performed. We then present two methods based on depth images and pose data for action recognition. In both methods, the proposed feature representation technique is applied to effectively obtain discriminative action descriptors. Experimental evaluation on the seven benchmark datasets, i.e., MSRAction3D, MSRGesture3D, DHA, SKIG, Florence, UTKinect, and HDM05, shows that our methods achieve very promising results on all the datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. In our experimental settings, joint j has only one or two neighbors.

  2. We fix the codebook size \(K=50\) and evaluate SC-IELLogE-IELLogE on MSRAction3D dataset with all the possible combinations of D, \(D_1\), and \(D_2\), where \(D=1,2,3\), \(D_1=-4,-3,-2,-1,1,2,3,4\), \(D_2=-4,-3,-2,-1,1,2,3,4\).

  3. To select the most appropriate value of K, we evaluate all methods with \(K=50,100,150,200,250\).

  4. Our experiments are conducted with \(K=50,100,150,200,250\).

  5. We evaluate these methods with \(K=50,100,150,200,250\) on Florence and UTKinect datasets, and with \(K=100,200,300,400,500,600,700\) on HDM05 dataset.

References

  1. Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns: application to face recognition. TPAMI 28(12), 2037–2041 (2006)

    Article  MATH  Google Scholar 

  2. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector space structure on symmetric positive definite matrices. SIAM J. Matrix Anal. Appl. 29(1), 328–347 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bilinski, P., Bremond, F.: Video covariance matrix logarithm for human action recognition in videos. In: IJCAI, pp. 2140–2147 (2015)

  4. Boureau, Y.L., Roux, N.L., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-way local pooling for image recognition. In: ICCV, pp. 2651–2658 (2011)

  5. Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR, pp. 1302–1310 (2017)

  6. Cavazza, J., Zunino, A., Biagio, M.S., Murino, V.: Kernelized covariance for action recognition. In: ICPR, pp. 408–413 (2016)

  7. Chen, C., Jafari, R., Kehtarnavaz, N.: Action recognition from depth sequences using depth motion maps-based local binary patterns. In: WACV, pp. 1092–1099 (2015)

  8. Chen, C., Liu, K., Kehtarnavaz, N.: Real-time human action recognition based on depth motion maps. J. Real Time Image Process. 12(1), 155–163 (2016)

    Article  Google Scholar 

  9. Cirujeda, P., Binefa, X.: 4DCov: a nested covariance descriptor of spatio-temporal features for gesture recognition in depth sequences. In: 3DV, vol. 1, pp. 657–664 (2014)

  10. Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse coding and vector quantization. In: ICML, pp. 921–928 (2011)

  11. Davis, L.S.: Covariance discriminative learning: a natural and efficient approach to image set classification. In: CVPR, pp. 2496–2503 (2012)

  12. Devanne, M., Wannous, H., Berretti, S., Pala, P., Daoudi, M., Bimbo, A.D.: 3-d human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybern. 45(7), 1340–1352 (2015)

    Article  Google Scholar 

  13. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)

  14. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118 (2015)

  15. Evangelidis, G., Singh, G., Horaud, R.: Skeletal quads: human action recognition using joint quadruples. In: ICPR, pp. 4513–4518 (2014)

  16. Fan, K.C., Hung, T.Y.: A novel local pattern descriptor—local vector pattern in high-order derivative space for face recognition. IEEE Trans. Image Process. 23, 2877–2891 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  17. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  18. Gao, Z., Zhang, H., Xu, G., Xue, Y.: Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151, 554–564 (2015)

    Article  Google Scholar 

  19. Gong, L., Wang, T., Liu, F.: Shape of Gaussians as feature descriptors. In: CVPR, pp. 2366–2371 (2009)

  20. Gowayyed, M.A., Torki, M., Hussein, M.E., El-Saban, M.: Histogram of oriented displacements (HOD): describing trajectories of human joints for action recognition. In: IJCAI, pp. 1351–1357 (2013)

  21. Guo, K., Ishwar, P., Konrad, J.: Action recognition from video using feature covariance matrices. IEEE Trans. Image Process. 22(6), 2479–2494 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  22. Harandi, M.T., Salzmann, M., Hartley, R.: From manifold to manifold: geometry-aware dimensionality reduction for SPD matrices. In: ECCV, pp. 17–32 (2014)

  23. Harandi, M.T., Sanderson, C., Sanin, A., Lovell, B.C.: Spatio-temporal covariance descriptors for action and gesture recognition. In: WACV, pp. 103–110 (2013)

  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

  25. Holte, M.B., Moeslund, T.B., Fihl, P.: View-invariant gesture recognition using 3D optical flow and harmonic motion context. CVIU 114(12), 1353–1361 (2010)

    Google Scholar 

  26. Huang, Z., Gool, L.V.: A Riemannian network for SPD matrix learning. In: AAAI, pp. 2036–2042 (2017)

  27. Huang, Z., Wan, C., Probst, T., Gool, L.V.: Deep learning on Lie groups for skeleton-based action recognition. In: CVPR (2017)

  28. Huang, Z., Wu, J., Gool, L.V.: Building deep networks on Grassmann manifolds. In: AAAI (2018)

  29. Hussein, M.E., Torki, M., Gowayyed, M.A., El-Saban, M.: Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI, pp. 2466–2472 (2013)

  30. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV, pp. 2146–2153 (2009)

  31. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  32. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR, pp. 3304–3311 (2010)

  33. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC, pp. 1–10 (2008)

  34. Kurakin, A., Zhang, Z., Liu, Z.: A real time system for dynamic hand gesture recognition with a depth sensor. In: EUSIPCO, pp. 1975–1979 (2012)

  35. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp. 609–616 (2009)

  36. Li, P., Wang, Q.: Local log-Euclidean covariance matrix (L2ECM) for image representation and its applications. In: ECCV, pp. 469–482 (2012)

  37. Li, P., Wang, Q., Zeng, H., Zhang, L.: Local log-Euclidean multivariate Gaussian descriptor and its application to image classification. TPAMI 39(4), 803–817 (2017)

    Article  Google Scholar 

  38. Li, P., Zeng, H., Wang, Q., Shiu, S.C.K., Zhang, L.: High-order local pooling and encoding Gaussians over a dictionary of Gaussians. IEEE Trans. Image Process. 26(7), 3372–3384 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  39. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: CVPRW, pp. 9–14 (2010)

  40. Lin, Y.C., Hu, M.C., Cheng, W.H., Hsieh, Y.H., Chen, H.M.: Human action recognition and retrieval using sole depth information. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1053–1056 (2012)

  41. Liu, A., Nie, W., Su, Y., Ma, L., Hao, T., Yang, Z.: Coupled hidden conditional random fields for RGB-D human action recognition. Signal Process. 112(C), 74–82 (2015)

    Article  Google Scholar 

  42. Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. CoRR (2017). arXiv:1703.07475

  43. Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: CVPR, pp. 3671–3680 (2017)

  44. Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: IJCAI, pp. 1493–1500 (2013)

  45. Liu, M., Liu, H., Chen, C.: 3D action recognition using multi-scale energy-based global ternary image. IEEE Trans. Circuits Syst. Video Technol. 28(8), 1824–1838 (2018)

    Article  Google Scholar 

  46. Lovrić, M., Min-Oo, M., Ruh, E.A.: Multivariate normal distributions parametrized as a Riemannian symmetric space. J. Multivar. Anal. 74(1), 36–48 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  47. Luo, C., Ma, C., Wang, C., Wang, Y.: Learning discriminative activated simplices for action recognition. In: AAAI, pp. 4211–4217 (2017)

  48. Luo, J., Wang, W., Qi, H.: Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, pp. 1809–1816 (2013)

  49. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: ICML, pp. 689–696 (2009)

  50. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Jenatton, R., Obozinski, G.: SPAMS: SPArse modeling software, v2.4 (2014). http://spams-devel.gforge.inria.fr/downloads.html

  51. Matsukawa, T., Okabe, T., Suzuki, E., Sato, Y.: Hierarchical Gaussian descriptor for person re-identification. In: CVPR, pp. 1363–1372 (2016)

  52. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. TPAMI 27(10), 1615–1630 (2005)

    Article  Google Scholar 

  53. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap Database HDM05. Technical Report CG-2007-2, Universität Bonn (2007)

  54. Nguyen, X., Mouaddib, A.I., Nguyen, T., Jeanpierre, L.: Action recognition in depth videos using hierarchical Gaussian descriptor. Multimedia Tools Appl. 77(16), 21617–21652 (2018)

    Article  Google Scholar 

  55. Ojala, T., Pietikainen, M., Harwood, D.: Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 1, pp. 582–585 (1994)

  56. Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with Fisher vectors on a compact feature set. In: ICCV, pp. 1817–1824 (2013)

  57. Oreifej, O., Liu, Z.: HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp. 716–723 (2013)

  58. Pang, Y., Yuan, Y., Li, X.: Gabor-based region covariance matrices for face recognition. IEEE Trans. Circuits Syst. Video Technol. 18(7), 989–993 (2008)

    Article  Google Scholar 

  59. Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: CVPR, pp. 1506–1515 (2016)

  60. Sanchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. IJCV 105(3), 222–245 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  61. Seidenari, L., Varano, V., Berretti, S., Del Bimbo, A., Pala, P.: Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: CVPRW, pp. 479–485 (2013)

  62. Serra, G., Grana, C., Manfredi, M., Cucchiara, R.: GOLD: Gaussians of local descriptors for image representation. CVIU 134, 22–32 (2015)

    Google Scholar 

  63. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR, pp. 1010–1019 (2016)

  64. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Adaptive spectral graph convolutional networks for skeleton-based action recognition. CoRR (2018). arXiv:1805.07694

  65. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR, pp. 1297–1304 (2011)

  66. Tang, S., Wang, X., Lv, X., Han, T.X., Keller, J., He, Z., Skubic, M., Lao, S.: Histogram of oriented normal vectors for object recognition with a depth sensor. In: ACCV, pp. 525–538 (2013)

  67. Tuzel, O., Porikli, F., Meer, P.: Region covariance: a fast descriptor for detection and classification. ECCV, Part II, pp. 589–600 (2006)

  68. Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classification on Riemannian manifolds. TPAMI 30(10), 1713–1727 (2008)

    Article  Google Scholar 

  69. Vedaldi, A., Fulkerson, B.: Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1469–1472 (2010)

  70. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a Lie group. In: CVPR, pp. 588–595 (2014)

  71. Wang, C., Flynn, J., Wang, Y., Yuille, A.L.: Recognizing actions in 3D using action-snippets and activated simplices. In: AAAI, pp. 3604–3610 (2016)

  72. Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: CVPR, pp. 915–922 (2013)

  73. Wang, C., Wang, Y., Yuille, A.L.: Mining 3D key-pose-motifs for action recognition. In: CVPR, pp. 2639–2647 (2016)

  74. Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. In: ECCV, pp. 872–885 (2012)

  75. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp. 1290–1297 (2012)

  76. Wang, L., Zhang, J., Zhou, L., Tang, C., Li, W.: Beyond covariance: feature representation with nonlinear kernel matrices. In: ICCV, pp. 4570–4578 (2015)

  77. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum. Mach. Syst. 46(4), 498–509 (2016)

  78. Wang, Q., Li, P., Zhang, L., Zuo, W.: Towards effective codebookless model for image classification. Pattern Recognit. 59(C), 63–71 (2016)

    Article  Google Scholar 

  79. Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T.S., Yan, S.: Sparse representation for computer vision and pattern recognition. Proc. IEEE 98(6), 1031–1044 (2010)

    Article  Google Scholar 

  80. Xia, L., Aggarwal, J.K.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR, pp. 2834–2841 (2013)

  81. Xia, L., Chen, C.C., Aggarwal, J.K.: View invariant human action recognition using histograms of 3D joints. In: CVPRW, pp. 20–27 (2012)

  82. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)

  83. Yang, X., Tian, Y.: Super normal vector for activity recognition using depth sequences. In: CVPR, pp. 804–811 (2014)

  84. Yang, X., Tian, Y.L.: EigenJoints-based action recognition using Naive–Bayes-nearest-neighbor. In: CVPRW, pp. 14–19 (2012)

  85. Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1057–1060 (2012)

  86. Yi, Y., Wang, H.: Motion keypoint trajectory and covariance descriptor for human action recognition. Vis. Comput. 34(3), 391–403 (2018)

    Article  Google Scholar 

  87. Yu, M., Liu, L., Shao, L.: Structure-preserving binary representations for RGB-D action recognition. TPAMI 38(8), 1651–1664 (2016)

    Article  Google Scholar 

  88. Yuan, C., Hu, W., Li, X., Maybank, S., Luo, G.: Human action recognition under log-Euclidean Riemannian metric. In: ACCV, pp. 343–353 (2010)

  89. Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: ICCV, pp. 2752–2759 (2013)

  90. Zhang, C., Tian, Y.: Histogram of 3D facets. CVIU 139(C), 29–39 (2015)

    Google Scholar 

  91. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. TPAMI 29(6), 915–928 (2007)

    Article  Google Scholar 

  92. Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: ECCV, pp. 141–154 (2010)

Download references

Acknowledgements

Portions of the research in this paper use the DHA video dataset collected by Research Center for Information Technology Innovation (CITI), Academia Sinica.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuan Son Nguyen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, X.S., Mouaddib, AI. & Nguyen, T.P. Hierarchical Gaussian descriptor based on local pooling for action recognition. Machine Vision and Applications 30, 321–343 (2019). https://doi.org/10.1007/s00138-018-0989-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-018-0989-9

Keywords

Navigation