Abstract
In order to improve the accuracy of human abnormal behavior recognition, a two-stream convolution neural network model was proposed. This model includes two main parts, VMHI and FRGB. Firstly, the motion history images are extracted and input into VGG-16 convolutional neural network for training. Then, the RGB image is input into Faster R-CNN algorithm for training using Kalman filter-assisted data annotation. Finally, the two stream VMHI and FRGB results are fused. The algorithm can recognize not only single person behavior, but also two person interaction behavior and improve the recognition accuracy of similar actions. Experimental results on KTH, Weizmann, UT-interaction, and TenthLab dataset showed that the proposed algorithm has higher accuracy than the other literature.
Similar content being viewed by others
References
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)
Fujiyoshi, H., Lipton, A.J.: Real-time human motion analysis by image skeletonization. Appl. Comput. Vis. 87, 113–120 (1998)
Yang, X., Tian, Y.L.: Effective 3D action recognition using EigenJoints. J. Vis. Commun. Image Represent. 25(1), 2–11 (2014)
Chaudhry, R., Ravichandran, A., Hager, G.: Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 20–25 (2009)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: IEEE International Conference on Pattern Recognition, pp. 23–26 (2004)
Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliency-based spatiotemporal feature points for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 43–48 (2009)
Hu, X.: Huang Y, Duan Q, et al, Abnormal event detection in crowded scenes using histogram of oriented contextual gradient descriptor. EURASIP J. Adv. Signal Process. 2018(1), 54 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Szegedy, C., Liu, W., Jia, Y.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2009)
He, K., Zhang, X., Ren, S.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Ren, S., He, K., Girshick, R.: Faster R-CNN: towards real-time object recognition with region proposal networks. In: International Conference on Neural Information Processing Systems (2015)
Redmon, J., Divvala, S., Girshick, R.: You only look once: unified, real-time object recognition. In: IEEE International Computer Vision and Pattern Recognition (2016)
Liu, W., Anguelov, D., Erhan, D.: SSD: single shot multibox detector. In: European Conference on Computer Vision (2016)
Li, C., Wang, P., Wang, S.: Skeleton-based action recognition using LSTM and CNN. In: IEEE International Conference on Multimedia and Expo Workshops (2017)
Donahue, J., Hendricks, L.A., Guadarrama, S.: Long-term recurrent convolutional networks for visual recognition and description. In: AB Initto Calculation of the Structures and Properties of Molecules (2015)
Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Wang, X., Gao, L., Song, J.: Beyond frame-level CNN: saliency-aware 3D CNN with LSTM for video action recognition. IEEE Signal Process. Lett. 99, 1 (2016)
Simonyan, K., Zisserma, A.: Two-stream convolutional networks for action recognition in videos. In: Conference and Workshop on Neural Information Processing Systems (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE International Computer Vision and Pattern Recognition (2016)
Wang, L., Xiong, Y., Wang, Z.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision (2016)
Chen, J., Wu, J., Konrad, J.: Semi-coupled two-stream fusion con-vnets for action recognition at extremely low resolutions. In: IEEE Winter Conference on Applications of Computer Vision (2017)
Wang, X., Gao, L., Wang, P.: Two-stream 3-D convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20, 634–644 (2018)
Zhao, R., Ali, H., Smagt, P.V.D.: Two-stream RNN/CNN for action recognition in 3D videos. In: IEEE International Conference on Intelligent Robots and Systems (2017)
Afrasiabi, M., Khotanlou, H., Mansoorizadeh, M.: DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis. Comput. (2019). https://doi.org/10.1007/s00371-019-01722-6
Imran, J., Raman, B.: Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition. J. Ambient Intell. Hum. Comput. 11, 189–208 (2020)
Yi, Y., Li, A., Zhou, X.F.: Human action recognition based on action relevance weighted encoding. Signal Process. Image Commun. 80, 115640 (2020)
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)
Acuna, D., Ling, H., Kar, A.: Efficient interactive annotation of segmentation datasets with polygon-RNN++. In: IEEE International Computer Vision and Pattern Recognition (2018)
Castrejon, L., Kundu, K., Urtasun, R.: Annotating object instances with a polygon-RNN. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Siswantoro, J., Prabuwono, A.S., Abdullah, A.: A linear model based on Kalman filter for improving neural network classification performance. Expert Syst. Appl. 49, 112–122 (2016)
Duin, R.P.W.: The combining classifier: to train or not to train. In: International Conference on Pattern Recognition (2002)
The KTH Dataset: http://www.nada.kth.se/cvap/actions/. Accessed on 18 Jan. (2005)
The Weizmann Dataset: http://www.wisdom.weizmann.ac.il/. Accessed on 24 Dec. (2007)
The UT-Interaction Dataset: http://cvrc.ece.utexas.edu/SDHA2010 (2007)
Qian, H., Zhou, J., Mao, Y.: Recognizing human actions from silhouettes described with weighted distance metric and kinematics. Multimed. Tools Appl. 76, 21889–21910 (2017)
Xu, K., Jiang, X., Sun, T.: Two-stream dictionary learning architecture for action recognition. IEEE Trans. Circuits Syst. Video 27, 567–576 (2017)
Chou, K.P., Prasad, M., Wu, D.: Robust feature-based automated multi-view human action recognition system. IEEE Access 6, 1 (2018)
Ko, K.E., Sim, K.B.: Deep convolutional framework for abnormal activities recognition in a smart surveillance system. Eng. Appl. Artif. Intell. 67, 226–234 (2018)
Wang, J., Zhou, S.C., Xia, L.M.: Human interaction recognition based on sparse representation of feature covariance matrices. J. Central South Univ. 25(2), 304–314 (2018)
Vishwakarma, D.K., Dhiman, C.: A unified model for human activity recognition using spatial distribution of gradients and difference of Gaussian kernel. Vis. Comput. 35, 1595–1613 (2019)
Sahoo, P.S., Ari, S.: On an algorithm for human action recognition. Expert Syst. Appl. 115, 524–534 (2019)
Vishwakarma, D.K.: A twofold transformation model for human action recognition using decisive pose. Cognit. Syst. Res. 61, 1–13 (2020)
Acknowledgements
Shanghai Natural Science Foundation (No. 17ZR1443500), Fund Project of National Natural Science Foundation of China (No. 61701296), Joint Funds of the National Natural Science Foundation of China (No. U1831133).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, C., Ying, J., Yang, H. et al. Improved human action recognition approach based on two-stream convolutional neural network model. Vis Comput 37, 1327–1341 (2021). https://doi.org/10.1007/s00371-020-01868-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-020-01868-8