Extracting hierarchical spatial and temporal features for human action recognition

Zhang, Keting; Zhang, Liqing

doi:10.1007/s11042-017-5179-7

Extracting hierarchical spatial and temporal features for human action recognition

Published: 15 September 2017

Volume 77, pages 16053–16068, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

419 Accesses
12 Citations
Explore all metrics

Abstract

Human action recognition is a challenging computer vision task and many efforts have been made to improve the performance. Most previous work has concentrated on the hand-crafted features or spatial-temporal features learned from multiple contiguous frames. In this paper, we present a dual-channel model to decouple the spatial and temporal feature extraction. More specifically, we propose to capture the complementary static form information from single frame and dynamic motion information from multi-frame differences in two separate channels. In both channels we use two stacked classical subspace networks to learn hierarchical representations, which are subsequently fused for action recognition. Our model is trained and evaluated on three typical benchmarks: KTH, UCF and Hollywood2 datasets. The experimental results illustrate that our approach achieves comparable performances to the state-of-the-art methods. In addition, both feature analysis and control experiments are also carried out to demonstrate the effectiveness of the proposed approach for feature extraction and thereby action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subspace Learning for Action Recognition

Action Recognition Using Hierarchical Independent Subspace Analysis with Trajectory

Action recognition by learning temporal slowness invariant features

Article 22 April 2015

Notes

http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html

References

Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Article Google Scholar
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Article Google Scholar
Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: predicting the popularity of micro-videos via a transductive model. In: Proceedings of the 2016 ACM on multimedia conference, ACM, pp 898–907
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, vol 6. MIT Press, Cambridge
MATH Google Scholar
Fu Y, Zhang T, Wang W (2017) Sparse coding-based space-time video representation for action recognition. Multimed Tool Appl 76(10):12645–12658
Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25
Article Google Scholar
Hubel DH, Wiesel TN (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591
Article Google Scholar
Hyvärinen A, Hoyer P (2000) Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Comput 12(7):1705–1720
Article Google Scholar
Hyvärinen A, Hurri J, Hoyer PO (2009) Natural image statistics: a probabilistic approach to early computational vision, vol 39. Springer Science & Business Media, Berlin
Book MATH Google Scholar
Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: IEEE international conference on computer vision. IEEE, pp 1–8
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision conference, 2008. British Machine Vision Association, pp 275–1
Laptev I, Lindeberg T (2003) Space-time interest points. In: IEEE International conference on computer vision, 2003. IEEE, pp 432–439
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. IEEE, pp 1–8
Le QV, Karpenko A, Ngiam J, Ng AY (2011) Ica with reconstruction cost for efficient overcomplete feature learning. In: Advances in neural information processing systems, pp 1017–1025
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition, 2011. IEEE, pp 3361–3368
Li L, Dai S (2017) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tool Appl 76(12):13953–13969
Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Article Google Scholar
Liu A-A, Xu N, Nie W-Z, Su Y-T, Wong Y, Kankanhalli M (2017) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 47(7):1781–1794
Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimed Tool Appl 75(21):13023–13039
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision and pattern recognition, 2009. IEEE, pp 2929–2936
Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY (2011) On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning, pp 265–272
Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis Res 37(23):3311–3325
Article Google Scholar
Rodriguez MD, Ahmed J, Shah M (2008) Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, 2004, vol 3. IEEE, pp 32–36
Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Trans Circuits Syst Video Technol 18(11):1587–1596
Article Google Scholar
Shen J, Pang H, Tao D, Li X (2010) Dual phase learning for large scale video gait recognition. In: MMM. Springer, pp 500–510
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: European conference on computer vision. Springer, pp 140–153
Tom M, Babu RV (2013) Rapid human action recognition in h. 264/avc compressed domain for video surveillance. In: Visual communications and image processing. IEEE, pp 1–6
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference, 2009. BMVA Press, pp 124–1
Xiao Y, Xia L (2015) Human action recognition using modified slow feature analysis and multiple kernel learning. Multimed Tool Appl:1–16
Xue W, Zhao H, Zhang L (2016) Encoding multi-resolution two-stream cnns for action recognition. In: International conference on neural information processing. Springer, pp 564–571
Yan C, Zhang Y, Dai F, Wang X, Li L, Dai Q (2014) Parallel deblocking filter for hevc on many-core processor. Electron Lett 50(5):367–368
Article Google Scholar
Yan C, Zhang Y, Dai F, Zhang J, Li L, Dai Q (2014) Efficient parallel hevc intra-prediction on many-core processor. Electron Lett 50(11):805–806
Article Google Scholar
Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for hevc coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21(5):573–576
Article Google Scholar
Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for hevc motion estimation on many-core processors. IEEE Trans Circuits Syst Video Technol 24(12):2077– 2089
Article Google Scholar
Yu S, Cheng Y, Su S, Cai G, Li S (2017) Stratified pooling based deep convolutional neural networks for human action recognition. Multimed Tool Appl 76(11):13367–13382
Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1415–1424
Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. IEEE Trans Pattern Anal Mach Intell 34(3):436–450
Article MathSciNet Google Scholar
Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486
Article Google Scholar
Zou W, Zhu S, Yu K, Ng AY (2012) Deep learning of invariant features via simulated fixations in video. In: Advances in neural information processing systems, pp 3212–3220

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Grant No. 91420302), the National Basic Research Program of China (Grant No. 2015CB856004) and the Key Basic Research Program of Shanghai (Grant No. 15JC1400103).

Author information

Authors and Affiliations

Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Keting Zhang & Liqing Zhang

Authors

Keting Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Liqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liqing Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, K., Zhang, L. Extracting hierarchical spatial and temporal features for human action recognition. Multimed Tools Appl 77, 16053–16068 (2018). https://doi.org/10.1007/s11042-017-5179-7

Download citation

Received: 13 February 2017
Revised: 24 July 2017
Accepted: 30 August 2017
Published: 15 September 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11042-017-5179-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting hierarchical spatial and temporal features for human action recognition

Abstract

Access this article

Similar content being viewed by others

Subspace Learning for Action Recognition

Action Recognition Using Hierarchical Independent Subspace Analysis with Trajectory

Action recognition by learning temporal slowness invariant features

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extracting hierarchical spatial and temporal features for human action recognition

Abstract

Access this article

Similar content being viewed by others

Subspace Learning for Action Recognition

Action Recognition Using Hierarchical Independent Subspace Analysis with Trajectory

Action recognition by learning temporal slowness invariant features

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation