Skip to main content
Log in

Extracting hierarchical spatial and temporal features for human action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human action recognition is a challenging computer vision task and many efforts have been made to improve the performance. Most previous work has concentrated on the hand-crafted features or spatial-temporal features learned from multiple contiguous frames. In this paper, we present a dual-channel model to decouple the spatial and temporal feature extraction. More specifically, we propose to capture the complementary static form information from single frame and dynamic motion information from multi-frame differences in two separate channels. In both channels we use two stacked classical subspace networks to learn hierarchical representations, which are subsequently fused for action recognition. Our model is trained and evaluated on three typical benchmarks: KTH, UCF and Hollywood2 datasets. The experimental results illustrate that our approach achieves comparable performances to the state-of-the-art methods. In addition, both feature analysis and control experiments are also carried out to demonstrate the effectiveness of the proposed approach for feature extraction and thereby action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html

References

  1. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267

    Article  Google Scholar 

  2. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27

    Article  Google Scholar 

  3. Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: predicting the popularity of micro-videos via a transductive model. In: Proceedings of the 2016 ACM on multimedia conference, ACM, pp 898–907

  4. Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, vol 6. MIT Press, Cambridge

    MATH  Google Scholar 

  5. Fu Y, Zhang T, Wang W (2017) Sparse coding-based space-time video representation for action recognition. Multimed Tool Appl 76(10):12645–12658

  6. Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25

    Article  Google Scholar 

  7. Hubel DH, Wiesel TN (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591

    Article  Google Scholar 

  8. Hyvärinen A, Hoyer P (2000) Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Comput 12(7):1705–1720

    Article  Google Scholar 

  9. Hyvärinen A, Hurri J, Hoyer PO (2009) Natural image statistics: a probabilistic approach to early computational vision, vol 39. Springer Science & Business Media, Berlin

    Book  MATH  Google Scholar 

  10. Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: IEEE international conference on computer vision. IEEE, pp 1–8

  11. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  12. Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision conference, 2008. British Machine Vision Association, pp 275–1

  13. Laptev I, Lindeberg T (2003) Space-time interest points. In: IEEE International conference on computer vision, 2003. IEEE, pp 432–439

  14. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. IEEE, pp 1–8

  15. Le QV, Karpenko A, Ngiam J, Ng AY (2011) Ica with reconstruction cost for efficient overcomplete feature learning. In: Advances in neural information processing systems, pp 1017–1025

  16. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition, 2011. IEEE, pp 3361–3368

  17. Li L, Dai S (2017) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tool Appl 76(12):13953–13969

  18. Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114

    Article  Google Scholar 

  19. Liu A-A, Xu N, Nie W-Z, Su Y-T, Wong Y, Kankanhalli M (2017) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 47(7):1781–1794

  20. Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimed Tool Appl 75(21):13023–13039

  21. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision and pattern recognition, 2009. IEEE, pp 2929–2936

  22. Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY (2011) On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning, pp 265–272

  23. Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis Res 37(23):3311–3325

    Article  Google Scholar 

  24. Rodriguez MD, Ahmed J, Shah M (2008) Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE

  25. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, 2004, vol 3. IEEE, pp 32–36

  26. Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Trans Circuits Syst Video Technol 18(11):1587–1596

    Article  Google Scholar 

  27. Shen J, Pang H, Tao D, Li X (2010) Dual phase learning for large scale video gait recognition. In: MMM. Springer, pp 500–510

  28. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  29. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: European conference on computer vision. Springer, pp 140–153

  30. Tom M, Babu RV (2013) Rapid human action recognition in h. 264/avc compressed domain for video surveillance. In: Visual communications and image processing. IEEE, pp 1–6

  31. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference, 2009. BMVA Press, pp 124–1

  32. Xiao Y, Xia L (2015) Human action recognition using modified slow feature analysis and multiple kernel learning. Multimed Tool Appl:1–16

  33. Xue W, Zhao H, Zhang L (2016) Encoding multi-resolution two-stream cnns for action recognition. In: International conference on neural information processing. Springer, pp 564–571

  34. Yan C, Zhang Y, Dai F, Wang X, Li L, Dai Q (2014) Parallel deblocking filter for hevc on many-core processor. Electron Lett 50(5):367–368

    Article  Google Scholar 

  35. Yan C, Zhang Y, Dai F, Zhang J, Li L, Dai Q (2014) Efficient parallel hevc intra-prediction on many-core processor. Electron Lett 50(11):805–806

    Article  Google Scholar 

  36. Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for hevc coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21(5):573–576

    Article  Google Scholar 

  37. Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for hevc motion estimation on many-core processors. IEEE Trans Circuits Syst Video Technol 24(12):2077– 2089

    Article  Google Scholar 

  38. Yu S, Cheng Y, Su S, Cai G, Li S (2017) Stratified pooling based deep convolutional neural networks for human action recognition. Multimed Tool Appl 76(11):13367–13382

  39. Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1415–1424

  40. Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. IEEE Trans Pattern Anal Mach Intell 34(3):436–450

    Article  MathSciNet  Google Scholar 

  41. Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486

    Article  Google Scholar 

  42. Zou W, Zhu S, Yu K, Ng AY (2012) Deep learning of invariant features via simulated fixations in video. In: Advances in neural information processing systems, pp 3212–3220

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Grant No. 91420302), the National Basic Research Program of China (Grant No. 2015CB856004) and the Key Basic Research Program of Shanghai (Grant No. 15JC1400103).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liqing Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, K., Zhang, L. Extracting hierarchical spatial and temporal features for human action recognition. Multimed Tools Appl 77, 16053–16068 (2018). https://doi.org/10.1007/s11042-017-5179-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5179-7

Keywords

Navigation