Abstract
We propose a deep learning based technique to classify actions based on Long Short Term Memory (LSTM) networks. The proposed scheme first learns spatial temporal features from the video, using an extension of the Convolutional Neural Networks (CNN) to 3D. A Recurrent Neural Network (RNN) is then trained to classify each sequence considering the temporal evolution of the learned features for each time step. Experimental results on the CMU MoCap, UCF 101, Hollywood 2 dataset show the efficacy of the proposed approach. We extend the proposed framework with an efficient motion feature, to enable handling significant camera motion. The proposed approach outperforms the existing deep models for each dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ziaeefar, M., Bergevin, R.: Semantic human activity recognition: a literature review. Pattern Recognit. 48(8), 2329–2345 (2015)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)
Chen, M.Y., Hauptmann, A.: MoSIFT: recognizing human actions in surveillance videos. Technical report CMU-CS-09-161. Carnegie Mellon University (2009)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Singh, S., Arora, C., Jawahar, C.V.: First person action recognition using deep learned descriptors. In: CVPR 2016 (2016)
Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing human action at a distance in video by key poses. IEEE Trans. Circ. Syst. Video Technol. 21(9), 1228–1241 (2011)
Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing interactions between human performers by ‘Dominating Pose Doublet’. Mach. Vis. Appl. 25(4), 1033–1052 (2014)
Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Mukherjee, S.: Human action recognition using dominant pose duplet. In: Nalpantidis, L., Krüger, V., Eklundh, J.-O., Gasteratos, A. (eds.) ICVS 2015. LNCS, vol. 9163, pp. 488–497. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20904-3_44
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)
Buddubariki, V., Tulluri, S.G., Mukherjee, S.: Event recognition in egocentric videos using a novel trajectory based feature. In: ICVGIP, pp. 76:1–76:8. ACM (2016)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild. In: CRCV-TR-12-01, November 2012
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15567-3_11
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE T-PAMI 35(1), 221–231 (2013)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep convolutional descriptors. In: CVPR (2015)
Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
CMU MoCap dataset. http://mocap.cs.cmu.edu/. Accessed Dec 2016
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Mukherjee, S., Singh, K.K.: Human action and event recognition using a novel descriptor based on improved dense trajectories. Multimed. Tools Appl. (2017). https://doi.org/10.1007/s11042-017-4980-7
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Acknowledgements
The authors wish to acknowledge the generous financial support provided by the Science and Engineering Research Board (SERB) of the Department of Science and Technology (DST), the Government of India, for conducting this research work. The financial support was provided through the project numbered ECR/2016/000652.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Singh, K.K., Mukherjee, S. (2018). Recognizing Human Activities in Videos Using Improved Dense Trajectories over LSTM. In: Rameshan, R., Arora, C., Dutta Roy, S. (eds) Computer Vision, Pattern Recognition, Image Processing, and Graphics. NCVPRIPG 2017. Communications in Computer and Information Science, vol 841. Springer, Singapore. https://doi.org/10.1007/978-981-13-0020-2_8
Download citation
DOI: https://doi.org/10.1007/978-981-13-0020-2_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0019-6
Online ISBN: 978-981-13-0020-2
eBook Packages: Computer ScienceComputer Science (R0)