Learning motion and content-dependent features with convolutions for action recognition

Liu, Cong; Xu, Weisheng; Wu, Qidi; Yang, Gelan

doi:10.1007/s11042-015-2550-4

Learning motion and content-dependent features with convolutions for action recognition

Published: 29 March 2015

Volume 75, pages 13023–13039, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Cong Liu¹,
Weisheng Xu¹,
Qidi Wu¹ &
…
Gelan Yang²

515 Accesses
Explore all metrics

Abstract

A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in multiple contiguous frames, which distinguishes action recognition from other recognition tasks. This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video. Our approach differs from other recent attempts in that it uses multiplicative interactions between convolutional outputs to describe motion information across contiguous frames. Interestingly, the representation of image content arises when we are at work on extracting motion pattern, which makes our model effectively incorporate both of them to analysis video. Additional theoretical analysis proves that motion and content-dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Our architecture is trained and evaluated on the standard video actions benchmarks of KTH and UCF101, where it matches the state of the art and has distinct advantages over previous attempts to use deep convolutional architectures for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple feature fusion in convolutional neural networks for action recognition

Article 10 January 2017

A Very Deep Sequences Learning Approach for Human Action Recognition

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

References

Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299
Article Google Scholar
Aggarwal J., Ryoo MS (2011) Human activity analysis: A review. ACM Comput Surveys (CSUR) 43(3):16
Article Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence. IEEE Transactions on 35(8):1798–1828
Google Scholar
Bouagar S, Larabi S (2014) Efficient descriptor for full and partial shape matching. Multimedia Tools and Applications pp. 1–23
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65–72. IEEE
Guo J, Kim J (2011) Adaptive motion vector smoothing for improving side information in distributed video coding. J Inf Process Syst 7(1):103–110
Article Google Scholar
van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London. Series B: Biol Sci 265 (1412):2315–2320
Google Scholar
Heider F, Simmel M (1944) An experimental study of apparent behavior. The American Journal of Psychology
Horn RA, Johnson CR (2012) Matrix analysis. Cambridge university press
Hyvärinen A, Hurri J, Hoyer PO (2009) Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., vol. 39. Springer
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 35(1):221–231
Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Kim H, Lee SH, Sohn MK, Kim DJ (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4(1):1–12
Article Google Scholar
Konda KR, Memisevic R, Michalski V (2013) The role of spatio-temporal synchrony in the encoding of motion. arXiv:CoRR1306.3162
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123
Article Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Liu S, Fu W, He L, Zhou J, Ma M (2014) Distribution of primary additional errors in fractal encoding method. Multimedia Tools and Applications pp. 1–16. 10.1007/s11042-014-2408-1
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition
Memisevic R (2011) Gradient-based learning of higher-order image features. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE
Memisevic R (2013) Learning to relate images. Pattern Analysis and Machine Intelligence. IEEE Trans 35(8):1829–1846
Google Scholar
Mobahi H, Collobert R, Weston J (2009) Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM
Ng CK, Ee GK, Noordin N, Fam JG (2013) Finger triggered virtual musical instruments. J Converg 4(1):39–46
Google Scholar
Olshausen BA (2003) Learning sparse, overcomplete representations of time-varying natural images. In: Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 1, IEEE
Sanin A, Sanderson C, Harandi MT, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: Applications of Computer Vision (WACV), 2013 IEEE Workshop on, IEEE
Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 3, pp. 32–36. IEEE
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842
Taylor GW, Fergus R, LeCun Y, Bregler C (2010)
Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: A survey. Circuits and Systems for Video Technology. IEEE Trans 18(11):1473–1488
Google Scholar
Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Comput 29(10):983–1009
Article Google Scholar
Wang H, Klaser A, Schmid C, Liu C.L. (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C et al (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference
Wang Y, Mori G (2009) Human action recognition by semilatent topic models. Pattern Analysis and Machine Intelligence. IEEE Trans 31(10):1762–1774
Google Scholar
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision–ECCV 2008, Springer
Wiskott L, Sejnowski T (2002) Slow feature analysis: Unsupervised learning of invariances. Neural Comput 14(4):715–770
Article MATH Google Scholar
Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 34(3):436–450
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, China
Cong Liu, Weisheng Xu & Qidi Wu
School of Information Science and Engineering, Hunan city University, Yiyang, 413000, China
Gelan Yang

Authors

Cong Liu
View author publications
You can also search for this author inPubMed Google Scholar
Weisheng Xu
View author publications
You can also search for this author inPubMed Google Scholar
Qidi Wu
View author publications
You can also search for this author inPubMed Google Scholar
Gelan Yang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Cong Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, C., Xu, W., Wu, Q. et al. Learning motion and content-dependent features with convolutions for action recognition. Multimed Tools Appl 75, 13023–13039 (2016). https://doi.org/10.1007/s11042-015-2550-4

Download citation

Received: 07 January 2015
Revised: 08 February 2015
Accepted: 05 March 2015
Published: 29 March 2015
Issue Date: November 2016
DOI: https://doi.org/10.1007/s11042-015-2550-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning motion and content-dependent features with convolutions for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiple feature fusion in convolutional neural networks for action recognition

A Very Deep Sequences Learning Approach for Human Action Recognition

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now