Abstract
In view of the problem that the current deep learning network does not fully extract and fuse spatio-temporal information in the action recognition task, resulting in low recognition accuracy, this paper proposes a deep learning network model based on fusion of spatio-temporal features (FSTFN). Through two networks composed of CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory), the time and space information are extracted and fused; multi-segment input is used to process large-scale video frame information to solve the problem of long-term dependence and improve the prediction accuracy; The attention mechanism improves the weight of visual subjects in the network. The experimental verification on the UCF101 (University of Central Florida 101) data set shows that the prediction accuracy of the proposed FSFTN on the UCF101 data set is 92.7%, 4.7% higher than that of Two-stream, which verifies the effectiveness of the network model.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chen K, Forbus KD (2017) Action recognition from skeleton data via analogical generalization[C]. Proc. 30th International Workshop on Qualitative Reasoning, 55-67
Cooijmans T, Ballas N, Laurent C et al (2016) Recurrent batch normalization[C]. International Conference on Learning Representations: 1-13
Da Silva BCG, Carvalho-Tavares J, Ferrari RJ (2019) Detecting and tracking leukocytes in intravital video microscopy using a Hessian-based spatiotemporal approach[J]. Multidimens Syst Signal Process 30(2):815–839
Deng J, Dong W, Socher R, Li L-J, Liand K, Fei-Fei L (2009) Imagenet: Alarge-scale hierarchical image database[C]. 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248-255
Donahue J, Hendricks AL, Guadarrama S, et al (2015)Long-term recurrent convolutional networks for visual recognition and description[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2625-2634
Fan M, Han Q, Zhang X, et al (2018) Human action recognition based on dense sampling of motion boundary and histogram of motion gradient[C]. 2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS). IEEE, 1033-1038
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 1933-1941
Han PY, Yee KE, Yin OS (2018) Localized temporal representation in human action recognition[C]. Proceedings of the 2018 VII International Conference on Network, Communication and Computing, 261-266
Hao Y, Xie J, Lin Z (2018) Image Caption via Visual Attention Switch on DenseNet[C]. 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC). IEEE, 334-338
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778
Jain A, Singh D (2019) A review on histogram of oriented gradient[J]. IITM J Manag IT 10(1):34–36
Jiang B, Wang MM, Gan W, et al (2019) STM: Spatio-Temporal and motion encoding for action recognition[C]. Proceedings of the IEEE International Conference on Computer Vision, 2000-2009
Karpathy A, Toderici G, Shetty S, et al (2014)Large-scale video classification with convolutional neural networks[C]. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725-1732
Karthikeyan A, Pavithra S, Anu PM (2020) Detection and classification of 2D and 3D hyper spectral image using enhanced harris corner detector[J]. Scalable Comput: Pract Exp 21(1):93–100
Khan A, Sohail A, Zahoora U et al (2020) A survey of the recent architectures of deep convolutional neural networks[J]. Artif Intell Rev 53:5455–56516
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks[C]. Advances in neural information processing systems, 1097-1105
Kuehne H, Jhuang H, Stiefelhagen R, et al (2013) Hmdb51: A large video database for human motion recognition[M]//High Performance Computing in Science and Engineering ‘12. Springer, Berlin, Heidelberg, 571-582
Li J, Liu X, Zhang M et al (2020)Spatio-temporal deformable 3D ConvNets with attention for action recognition[J]. Pattern Recognit 98:107–117
Meng Z, Kong X, Meng L, et al (2019)Lucas-Kanade optical flow based camera motion estimation approach[C]. 2019 International SoC Design Conference(ISOCC). IEEE, 77-78
Nazir S, Yousaf MH, Velastin SA (2018) Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition[J]. Comput Electr Eng 72:660–669
Ragupathy P, Vivekanandan P (2021) A modified fuzzy histogram of optical flow for emotion classification[J]. Journal of Ambient Intelligence and Humanized Computing 12(2):1–8
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention[C]. Neural Information Processing Systems: Time Series Workshop, 1212-1225
Simonyan K, Zisserman A (2014)Two-stream convolutional networks for action recognition in videos[C]. Advance Neural Information Processing Systems, 568-576
Soomro K, Zamir A R,Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402:1055–1069
Sun B, Kong D, Wang S et al (2019) Effective human action recognition using global and local offsets of skeleton joints[J]. Multimed Tools Appl 78(5):6329–6353
Tanfous AB, Drira H, Amor BB (2020) Sparse coding of shape trajectories for facial expression and action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10):2594–2607
Tang Y, Tian Y, Lu J, et al (2018) Deep progressive reinforcement learning for skeleton-based action recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5323-5332
Tran D, Bourdev L, Fergus R, et al (2015) Learning spatiotemporal features with 3d convolutional networks[C]. Proceedings of the IEEE international conference on computer vision, 4489-4497
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: Towards good practices for deep action recognition[C]. European conference on computer vision. Springer, Cham, pp 20–36
Wang X, Yu L, Ren K, et al (2017) Dynamic attention deep model for article recommendation by learning human editors’ demonstration[C]. Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, 2051-2059
Yao K, Sang N, Gao C (2018) A cuboid bi-level log operator for action classification[J]. IEEE Access 6:54147–54157
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. Thirty-second AAAI conference on artificial intelligence, 65-77
Yu Y, Si X, Hu C et al (2019) A review of recurrent neural networks: LSTM cells and network architectures[J]. Neural Comput 31(7):1235–1270
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al (2015) Beyond short snippets: Deep networks for video classification[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 4694-4702
Zhang Z (2018) Improved Adamoptimizer for deep neural networks[C]. 2018 IEEE/ACM 26th International Symposiumon Quality of Service(IWQoS). IEEE, 1-2
Acknowledgements
This research was financially supported by Major Scientific Research Project for Universities of Guangdong Province (2018KTSCX288, 2019KZDXM015, 2020ZDZX3058); Guangdong Provincial special funds Project for Discipline Construction (No.2013WYXM0122); Guangdong Provincial College Innovation and Entrepreneurship Project (S201913177027, S201913177040, 201813177028, 201813177046); Scientific Research Project of Shenzhen (JCYJ20170303140803747); Key Laboratory of Intelligent Multimedia Technology(201762005).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, G., Zou, Wx. Deep learning network model based on fusion of spatiotemporal features for action recognition. Multimed Tools Appl 81, 9875–9896 (2022). https://doi.org/10.1007/s11042-022-11937-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-11937-w