Abstract
A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in multiple contiguous frames, which distinguishes action recognition from other recognition tasks. This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video. Our approach differs from other recent attempts in that it uses multiplicative interactions between convolutional outputs to describe motion information across contiguous frames. Interestingly, the representation of image content arises when we are at work on extracting motion pattern, which makes our model effectively incorporate both of them to analysis video. Additional theoretical analysis proves that motion and content-dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Our architecture is trained and evaluated on the standard video actions benchmarks of KTH and UCF101, where it matches the state of the art and has distinct advantages over previous attempts to use deep convolutional architectures for action recognition.






Similar content being viewed by others
References
Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299
Aggarwal J., Ryoo MS (2011) Human activity analysis: A review. ACM Comput Surveys (CSUR) 43(3):16
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence. IEEE Transactions on 35(8):1798–1828
Bouagar S, Larabi S (2014) Efficient descriptor for full and partial shape matching. Multimedia Tools and Applications pp. 1–23
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65–72. IEEE
Guo J, Kim J (2011) Adaptive motion vector smoothing for improving side information in distributed video coding. J Inf Process Syst 7(1):103–110
van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London. Series B: Biol Sci 265 (1412):2315–2320
Heider F, Simmel M (1944) An experimental study of apparent behavior. The American Journal of Psychology
Horn RA, Johnson CR (2012) Matrix analysis. Cambridge university press
Hyvärinen A, Hurri J, Hoyer PO (2009) Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., vol. 39. Springer
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 35(1):221–231
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Kim H, Lee SH, Sohn MK, Kim DJ (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4(1):1–12
Konda KR, Memisevic R, Michalski V (2013) The role of spatio-temporal synchrony in the encoding of motion. arXiv:CoRR1306.3162
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Liu S, Fu W, He L, Zhou J, Ma M (2014) Distribution of primary additional errors in fractal encoding method. Multimedia Tools and Applications pp. 1–16. 10.1007/s11042-014-2408-1
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition
Memisevic R (2011) Gradient-based learning of higher-order image features. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE
Memisevic R (2013) Learning to relate images. Pattern Analysis and Machine Intelligence. IEEE Trans 35(8):1829–1846
Mobahi H, Collobert R, Weston J (2009) Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM
Ng CK, Ee GK, Noordin N, Fam JG (2013) Finger triggered virtual musical instruments. J Converg 4(1):39–46
Olshausen BA (2003) Learning sparse, overcomplete representations of time-varying natural images. In: Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 1, IEEE
Sanin A, Sanderson C, Harandi MT, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: Applications of Computer Vision (WACV), 2013 IEEE Workshop on, IEEE
Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 3, pp. 32–36. IEEE
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842
Taylor GW, Fergus R, LeCun Y, Bregler C (2010)
Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: A survey. Circuits and Systems for Video Technology. IEEE Trans 18(11):1473–1488
Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Comput 29(10):983–1009
Wang H, Klaser A, Schmid C, Liu C.L. (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C et al (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference
Wang Y, Mori G (2009) Human action recognition by semilatent topic models. Pattern Analysis and Machine Intelligence. IEEE Trans 31(10):1762–1774
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision–ECCV 2008, Springer
Wiskott L, Sejnowski T (2002) Slow feature analysis: Unsupervised learning of invariances. Neural Comput 14(4):715–770
Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 34(3):436–450
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, C., Xu, W., Wu, Q. et al. Learning motion and content-dependent features with convolutions for action recognition. Multimed Tools Appl 75, 13023–13039 (2016). https://doi.org/10.1007/s11042-015-2550-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2550-4