Abstract
Human behavior is an important part of video content. Therefore, the effective recognition of human behavior in the video has attracted extensive attention. In order to solve the problem that the key features are not prominent and the accuracy rate is not high in the existing methods of human behavior recognition in video. This paper proposes a three-dimensional convolutional neural network fusing channel attention (3DCCA) model feature extraction method. Mean normalization is presented for the preprocessing of RGB video frames. The three-dimensional convolution (3DCNN) is presented for the spatiotemporal features extraction of the inputs clips. The channel attention(CA) is used to select features that are more critical for current behavior recognition from all features. Softmax classifiers to achieve in the Classification and Identification of the human behavior in video. The training results on UCF101 and HMDB51 public datasets show that the algorithm can make better use of the original information in the video, extract more effective features, correctly detect human behaviors and actions and show stronger recognition ability to the algorithm compared with other commonly used human behavior feature extraction and recognition methods.










Similar content being viewed by others
References
Action recognition. UCF101: a large human motion database (n.d.). https://www.crcv.ucf.edu/data/UCF101.php
Action recognition. HMDB: a large human motion database (n.d.). http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
Cai Z, Wang L, Peng X, et al, “Multi-view super vector for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603, Columbus, OH, USA, June 2014.
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, Honolulu, HI, USA
Du W, Wang Y, Qia Y (2018) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27:1347–1360
Hbn A, Fmb C, Mhya C et al (2021) T-VLAD: temporal vector of locally aggregated descriptor for Multiview human action recognition. Pattern Recognition Letters
Hsueh YL, Lie WN, Guo GY (2020) Human Behavior Recognition from Multiview Videos. Inf Sci 517:275–296
Hu H, Cheng K, Li Z, Chen J, Hu H (2020) Workflow recognition with structured two-stream convolutional networks. Pattern Recogn Lett 130:267–274
Huang J, Lin S, Wang N, Dai G, Xie Y, Zhou J (2020) TSE-CNN: a two-stage end-to-end CNN for human activity recognition. IEEE J Biomed Health Inform 24(1):292–299
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Kim JH, Cho YI (2020) A new residual attention network based on attention models for human action recognition in video. J Korea Soc Comp Inform 25(1):55–61
Klaser A, Marszalek M, Schmid C (2008) A Spatio-Temporal Descriptor Based on 3D-Gradients. In: Proceedings of the 19th British Machine Vision Conference, pp. 1–10, Leeds, United Kingdom
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(3):107–123
Li R, Wang L, Wang K (2014) A review of research on human movement and behavior recognition. Pattern Recogn Artif Intell 27(1):35–48
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CGM (2018) VideoLSTM convolves, attends and flows for action recognition, Comput. Vis Image Underst 166:41–50
Liciotti D, Bernardini M, Romeo L, Frontoni E (2020) A sequential deep learning application for recognising human activities in smart homes. Neurocomputing 396:501–513
Peng Y, Zhao Y, Zhang J (2019) Two-stream collaborative learning with spatialtemporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541, Venice, Italy
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360, Augsburg, Germany
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. CoRR
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 28th Neural Information Processing Systems, pp. 568–576, Montreal, Canada
Tran D, Bourdev L, Fergus R, et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497, Santiago, Chile
Tu NA, Huynh-The T, Khan KU, Lee YK (2019) ML-HDP: a hierarchical Bayesian nonparametric model for recognizing human actionsin video. IEEE Trans Circuits Syst for Video Technol 29(3):800–814
Wang L (2018) Three-dimensional convolutional restricted Boltzmann machine for human behavior recognition from RGB-D video. EURASIP J Image Video Process 120:1–11
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558, Sydney, Australia
Wang L, Xiong Y, Wang Z, et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, pp. 20–36, Springer, Cham
Woo S, Park J, Lee J Y, et al (2018) CBAM: Convolutional Block Attention Module. In: Proceedings of the European Conference on Computer Vision, pp. 3–19, Springer, Cham
Yao F (2020) Deep learning analysis of human behaviour recognition based on convolutional neural network analysis. Behav Inform Technol 40:1–9
Yao G, Lei T, Zhong J (2019) A review of Convolutional-neural-network-based action recognition. Pattern Recogn Lett 118:14–22
Ye Q, Liang Z, Zhong H, et al (2022) “Human behavior recognition based on time correlation sampling two-stream heterogeneous grafting network,” in Optik - International Journal for Light and Electron Optics, vol. 251, Elsevier,168402
Yeung S, Russakovsky O, Jin N et al (2015) Every moment counts: dense detailed labeling of actions in complex videos. Int J Comput Vis 126(2–4):375–389
T. Yu, C. Guo, L. Wang, et al, “Joint spatial-temporal attention for action recognition,” Pattern Recognit, Lett, vol. 112, pp. 226–233, 2018.
Yu S, Xie L, Liu L, Xia D (2020) Learning long-term temporal features with deep neural networks for human action recognition. IEEE Access 8:1840–1850
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702, Boston, MA, USA
Zhang J, Hu H (2019) Domain learning joint with semantic adaptation for human action recognition. Pattern Recogn 90:196–209
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (May 2018) Real-time action recognition with deeply transferred motion vector CNNS. IEEE TransImage Process 27(5):2326–2339
Zhang M, Yang Y, Ji Y, Xie N, Shen F (2018) Recurrent attention network using spatial-temporal relations for action recognition. Proceed Signal Process 145:137–145
Zhang M, Yang Y, Ji Y, Xie N, Shen F (2018) Recurrent attention network using spatial-temporal relations for action recognition. Signal Process 145:137–145
Zhang J, Hu H, Lu X (2019) Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans Multimedia Comput Commun Appl 15(3):1–16
Zufan Zhang, Zongming Lv, Chenquan Gan et al, “Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions,” in Proceedings of the Neurocomputing, vol. 410, pp. 304–316, 2020.
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2018) Pooling the convolutional layers in deep convNets for video action recognition. Proceed IEEE Trans, Circuits Syst, VideoTechnol 28(8):1839–1849
Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55:42–52
Acknowledgments
This research work was supported in part by the National Science Foundation of China under Grant 51668043, and Grant 61262016, in part by the CERNET Innovation Project under Grant NGII20160311, and Grant NGII20160112, and in part by the Gansu Science Foundation of China under Grant 18JR3RA156.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, H., Liu, J. & Wang, W. Research on human behavior recognition in video based on 3DCCA. Multimed Tools Appl 82, 20251–20268 (2023). https://doi.org/10.1007/s11042-023-14355-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14355-8