Abstract
Human behavior recognition has always been a hot spot for research in computer vision. With the wide application of behavior recognition in virtual reality and short video in recent years and the rapid development of deep learning algorithms, behavior recognition algorithms based on deep learning have emerged. Compared with traditional methods, behavior recognition algorithms based on deep learning have the advantages of strong robustness and high accuracy. This paper systemizes and introduces behavior recognition algorithms based on deep learning proposed in recent years, then focuses on a series of behavior recognition algorithms based on image and bone data; deeply analyzes their theories and performance, and finally, puts forward further prospects.
























Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data and code used to support the findings of this study are available from the corresponding author upon request (001600@nuist.edu.cn).
References
Arandjelovic R, Zisserman A (2013) All about vlad. Proceedings of the ieee conference on computer vision and pattern recognition (pp.1578–1585)
Chen B, Xia M, Huang J (2021) Mfanet: a multi-level feature aggregation network for semantic segmentation of land cover. Remote Sensing 13(4):731
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp.183-192)
Cho S, Foroosh H (2018). Spatio-temporal fusion networks for action recognition. Asian conference on computer vision (pp. 347-364)
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 7024-7033)
Deng S, Fu Y, Wang H (2017) Multi-label classification of chinese books with lstm model. Data Analysis and Knowledge Discovery 1(7):52–60
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2329-2338)
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. Proceed- ings of the ieee conference on computer vision and pattern recognition (pp. 2625-2634)
Du W, Wang Y, Qiao Y (2017) Rpan: An end-to-end recurrent poseattention network for action recognition in videos. Proceedings of the ieee international conference on computer vision (pp. 3725-3734)
Du Y, Fu Y, Wang L (2015) Skeleton based action recognition with convolutional neural network. 2015 3rd iapr asian conference on pattern recognition (acpr) (pp. 579-583)
Duta IC, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal vlad encoding for human action recognition in videos. International conference on multimedia modeling (pp. 365-378)
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. Proceedings of the ieee/cvf international conference on computer vision (pp. 6202-6211)
Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT press Cambridge, USA
He J, Wu X, Cheng Z, Yuan Z, Jiang Y (2021) Db-lstm: Densely-connected bi-directional lstm for human action recognition. Neurocomputing 444:319–331
He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770-778)
He K, Zhang X, Ren S, Sun J (2016b) Identity mappings in deep residual networks. European conference on computer vision (pp. 630-645)
Zhu H, Zhu C, Xu Z (2018) Research advances on human activity recognition datasets. Acta Automatica Sinica 44(6):978–1004
Luo H, Wang C, Lu F (2018) Survey of video behavior recognition. J Commun 39(6):169
Huang J (2016) Chinese word segmentation analysis based on bidirectional lstmn recurrent neural network. Nanjing University Jiangsu
Kazakos E, Nagrani A, Zisserman A, Damen D (2021) Slow-fast auditory streams for audio recognition. Icassp 2021-2021 ieee interna- tional conference on acoustics, speech and signal processing (icassp) (pp. 855-859)
Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B (2021) Movinets: Mobile video networks for efficient video recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 16020-16030)
Korban M, Li X (2020) Ddgcn: A dynamic directed graph convolutional network for action recognition. European conference on computer vision (pp. 761-776)
Lan Z, Zhu Y, Hauptmann AG, Newsam S (2017) Deep local video feature for action recognition. Proceedings of the ieee conference on computer vision and pattern recognition workshops (pp. 1-7)
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. Proceedings of the aaai conference on artificial intelligence 33:8561–8568
Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7872-7881)
Li D, Liu H, Zhang Z, Lin K, Fang S, Li Z, Xiong NN (2021) Carm: Confidence-aware recommender model via review representation learning and historical rating behavior in the online platforms. Neurocomputing 455:283–296
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional–structural graph convolutional networks for skeleton-based action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3595-3603)
Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 909- 918)
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG (2018) Videolstm convolves, attends and flows for action recognition. Comput Vision Image Understanding 166:41–50
Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Transactions on Neural Networks and Learning Systems
Liu S (2017) Video-based action recognition. Hebei Normal University
Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) Mfdnet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia
Liu H, Nie H, Zhang Z, Li Y (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in humancomputer interaction. Neurocomputing 433:310–322
Liu T, Liu H, Li Y, Zhang Z, Liu S (2018) Efficient blind signal reconstruction with wavelet transforms regularization for educational robot infrared vision sensing. IEEE/ASME Transactions on Mechatronics 24(1):384–394
Liu T, Liu H, Li Y, Chen Z, Zhang Z, Liu S (2019) Flexible ftir spectral imaging enhancement for industrial robot infrared vision sensing. IEEE Transac Indus Informatics 16(1):544–554
Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. Proceedings of the aaai conference on artificial intelligence (Vol. 32)
Majd M, Safabakhsh R (2020) Correlational convolutional lstm for human action recognition. Neurocomputing 396:224–229
Muhammad K, Ullah A, Imran AS, Sajjad M, Kiran MS, Sannino G et al (2021) Human action recognition using attention based lstm network with dilated cnn features. Future Generation Comp Syst 125:820–830
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150:109–125
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. Springer, Cham, pp 581–595
Qu Y, Xia M, Zhang Y (2021) Strip pooling channel spatial attention network for the segmentation of cloud and cloud shadow. Comput Geosci 157:104940
Ren P, Xiao G, Chang X, Xiao Y, Li Z, Chen X (2021) Nas-tc: Neural architecture search on temporal convolutions for complex action recognition. arXiv preprint arXiv:2104.01110
Shen X, Yi B, Liu H, Zhang W, Zhang Z, Liu S, Xiong N (2019) Deep variational matrix factorization with knowledge embedding for recommendation system. IEEE Transactions on Knowledge and Data Engineering 33(5):1906–1918
Shi L, Zhang Y, Cheng J, Lu H (2019a) Skeleton-based action recognition with directed graph neural networks. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7912-7921)
Shi L, Zhang Y, Cheng J, Lu H (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12026-12035)
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1227-1236)
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199
Song L, Xia M, Jin J, Qian M, Zhang Y (2021) Suacdnet: Attentional change detection network based on siamese u-shaped structure. Int J Appl Earth Obser Geoinformat 105:102597
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the ieee international conference on computer vision (pp. 4489-4497)
Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79
Wang H, Schmid C (2013) Action recognition with improved trajectories. Proceedings of the ieee international conference on computer vision (pp. 3551-3558)
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. European conference on computer vision (pp. 20-36)
Wang X, Miao Z, Zhang R, Hao S (2019) I3d-lstm A new model for human action recognition. Iop Conference Series: Mater Sci Engin 569:032035
Wu C, Zaheer M, Hu H, Manmatha R, Smola AJ, Krähenbühl P (2018) Compressed video action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 6026–6035)
Xia M, Cui Y, Zhang Y, Xu Y, Liu J, Xu Y (2021a) Dau-net: a novel water areas segmentation structure for remote sensing image. Int J Remote Sensing 42(7):2594–2621
Xia M, Qu Y, Lin H (2021b) Panda: parallel asymmetric network with double attention for cloud and its shadow detection. J Appl Remote Sens 15(4):046512
Xia M, Wang K, Song W, Chen C, Li Y et al (2020a) Non-intrusive load disaggregation based on composite deep long short-term memory network. Expert Syst Applicat 160:113669
Xia M, Wang T, Zhang Y, Liu J, Xu Y (2021c) Cloud/shadow segmentation based on global attention feature fusion residual network for remote sensing imagery. Int J Remote Sensing 42(6):2022–2045
Xia M, Zhang X, Weng L, Xu Y et al (2020b) Multi-stage feature constraints learning for age estimation. IEEE Transact Informat Forensic Sec 15:2417–2428
Xiao F, Lee YJ, Grauman K, Malik J, Feichtenhofer C (2020c) Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740
Yan A, Wang Y, Li Z, Qiao Y (2019) Pa3d: Pose-action 3d machine for video recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7922-7931)
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the aaai conference on artificial intelligence (Vol. 32)
Yan Y, Xu J, Ni B, Zhang W, Yang X (2017) Skeleton-aided articulated motion generation. Proceedings of the 25th acm international conference on multimedia (pp. 199-207)
Yang H, Gu Y, Zhu J, Hu K, Zhang X (2020) Pgcn-tca: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8:10040–10047
Yang X, Tian Y (2014) Action recognition using super sparse coding vector with spatio-temporal awareness. European conference on computer vision (pp. 727-741)
Chen Y, Gao X (2018) The latest progress of deep learning. Comput Sci Appl 08(04):565–571
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 4694-4702)
Zhang Y (2018) Text sentiment analysis based on multiple lstm structures. Beijing University of Posts and Telecommunications
Zhang S, Gong Y, Wang J (2017) The development of deep convolution neural network and its applications on computer vision. Chinese J Comput 40(9):1–29
Zhang Z, Li Z, Liu H, Xiong NN (2020) Multi-scale dynamic convolutional network for knowledge graph embedding. IEEE Transactions on Knowledge and Data Engineering
Zhang Z, Wang Z, Zhuang S, Huang F (2020) Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8:228108–228117
Zhao J, Snoek CG (2019) Dance with flow: Two-in-one stream action detection. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 9935-9944)
Ren Z, Xu H, Feng S, Zhou H, Shi J (2017) Sequence labeling chinese word segmentation method based on lstm networks. Appl Res Comput 34(5):1321–1324
Zhou Y, Sun X, Zha Z, Zeng W (2018) Mict: Mixed 3d/2d convolutional tube for human action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 449–458)
Zhu Y, Li X, Liu C, Zolfaghari M, Xiong Y, Wu C, Li M (2020). A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567
Acknowledgements
Research in this article is supported by the National Natural Science Foundation of China (No. 61876079), the key special project of the National Key R &D Program (2018YFC1405703), the financial support of Jiangsu Austin Optronics Technology Co., Ltd. is deeply appreciated, and I would like to express my heartfelt thanks to those reviewers and editors who submitted valuable revisions to this article.
Author information
Authors and Affiliations
Contributions
All authors drafted the manuscript, read, and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest were reported by the author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, K., Jin, J., Zheng, F. et al. Overview of behavior recognition based on deep learning. Artif Intell Rev 56, 1833–1865 (2023). https://doi.org/10.1007/s10462-022-10210-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10210-8