Skip to main content

Advertisement

Log in

Overview of behavior recognition based on deep learning

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Human behavior recognition has always been a hot spot for research in computer vision. With the wide application of behavior recognition in virtual reality and short video in recent years and the rapid development of deep learning algorithms, behavior recognition algorithms based on deep learning have emerged. Compared with traditional methods, behavior recognition algorithms based on deep learning have the advantages of strong robustness and high accuracy. This paper systemizes and introduces behavior recognition algorithms based on deep learning proposed in recent years, then focuses on a series of behavior recognition algorithms based on image and bone data; deeply analyzes their theories and performance, and finally, puts forward further prospects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Data Availability

The data and code used to support the findings of this study are available from the corresponding author upon request (001600@nuist.edu.cn).

References

  • Arandjelovic R, Zisserman A (2013) All about vlad. Proceedings of the ieee conference on computer vision and pattern recognition (pp.1578–1585)

  • Chen B, Xia M, Huang J (2021) Mfanet: a multi-level feature aggregation network for semantic segmentation of land cover. Remote Sensing 13(4):731

    Article  Google Scholar 

  • Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp.183-192)

  • Cho S, Foroosh H (2018). Spatio-temporal fusion networks for action recognition. Asian conference on computer vision (pp. 347-364)

  • Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 7024-7033)

  • Deng S, Fu Y, Wang H (2017) Multi-label classification of chinese books with lstm model. Data Analysis and Knowledge Discovery 1(7):52–60

    Google Scholar 

  • Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2329-2338)

  • Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. Proceed- ings of the ieee conference on computer vision and pattern recognition (pp. 2625-2634)

  • Du W, Wang Y, Qiao Y (2017) Rpan: An end-to-end recurrent poseattention network for action recognition in videos. Proceedings of the ieee international conference on computer vision (pp. 3725-3734)

  • Du Y, Fu Y, Wang L (2015) Skeleton based action recognition with convolutional neural network. 2015 3rd iapr asian conference on pattern recognition (acpr) (pp. 579-583)

  • Duta IC, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal vlad encoding for human action recognition in videos. International conference on multimedia modeling (pp. 365-378)

  • Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. Proceedings of the ieee/cvf international conference on computer vision (pp. 6202-6211)

  • Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT press Cambridge, USA

    MATH  Google Scholar 

  • He J, Wu X, Cheng Z, Yuan Z, Jiang Y (2021) Db-lstm: Densely-connected bi-directional lstm for human action recognition. Neurocomputing 444:319–331

    Article  Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770-778)

  • He K, Zhang X, Ren S, Sun J (2016b) Identity mappings in deep residual networks. European conference on computer vision (pp. 630-645)

  • Zhu H, Zhu C, Xu Z (2018) Research advances on human activity recognition datasets. Acta Automatica Sinica 44(6):978–1004

    Google Scholar 

  • Luo H, Wang C, Lu F (2018) Survey of video behavior recognition. J Commun 39(6):169

    Google Scholar 

  • Huang J (2016) Chinese word segmentation analysis based on bidirectional lstmn recurrent neural network. Nanjing University Jiangsu

  • Kazakos E, Nagrani A, Zisserman A, Damen D (2021) Slow-fast auditory streams for audio recognition. Icassp 2021-2021 ieee interna- tional conference on acoustics, speech and signal processing (icassp) (pp. 855-859)

  • Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B (2021) Movinets: Mobile video networks for efficient video recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 16020-16030)

  • Korban M, Li X (2020) Ddgcn: A dynamic directed graph convolutional network for action recognition. European conference on computer vision (pp. 761-776)

  • Lan Z, Zhu Y, Hauptmann AG, Newsam S (2017) Deep local video feature for action recognition. Proceedings of the ieee conference on computer vision and pattern recognition workshops (pp. 1-7)

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. Proceedings of the aaai conference on artificial intelligence 33:8561–8568

    Article  Google Scholar 

  • Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7872-7881)

  • Li D, Liu H, Zhang Z, Lin K, Fang S, Li Z, Xiong NN (2021) Carm: Confidence-aware recommender model via review representation learning and historical rating behavior in the online platforms. Neurocomputing 455:283–296

    Article  Google Scholar 

  • Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional–structural graph convolutional networks for skeleton-based action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3595-3603)

  • Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 909- 918)

  • Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG (2018) Videolstm convolves, attends and flows for action recognition. Comput Vision Image Understanding 166:41–50

    Article  Google Scholar 

  • Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Transactions on Neural Networks and Learning Systems

  • Liu S (2017) Video-based action recognition. Hebei Normal University

  • Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) Mfdnet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia

  • Liu H, Nie H, Zhang Z, Li Y (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in humancomputer interaction. Neurocomputing 433:310–322

    Article  Google Scholar 

  • Liu T, Liu H, Li Y, Zhang Z, Liu S (2018) Efficient blind signal reconstruction with wavelet transforms regularization for educational robot infrared vision sensing. IEEE/ASME Transactions on Mechatronics 24(1):384–394

    Article  Google Scholar 

  • Liu T, Liu H, Li Y, Chen Z, Zhang Z, Liu S (2019) Flexible ftir spectral imaging enhancement for industrial robot infrared vision sensing. IEEE Transac Indus Informatics 16(1):544–554

    Article  Google Scholar 

  • Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. Proceedings of the aaai conference on artificial intelligence (Vol. 32)

  • Majd M, Safabakhsh R (2020) Correlational convolutional lstm for human action recognition. Neurocomputing 396:224–229

    Article  Google Scholar 

  • Muhammad K, Ullah A, Imran AS, Sajjad M, Kiran MS, Sannino G et al (2021) Human action recognition using attention based lstm network with dilated cnn features. Future Generation Comp Syst 125:820–830

    Article  Google Scholar 

  • Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150:109–125

    Article  Google Scholar 

  • Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. Springer, Cham, pp 581–595

    Google Scholar 

  • Qu Y, Xia M, Zhang Y (2021) Strip pooling channel spatial attention network for the segmentation of cloud and cloud shadow. Comput Geosci 157:104940

    Article  Google Scholar 

  • Ren P, Xiao G, Chang X, Xiao Y, Li Z, Chen X (2021) Nas-tc: Neural architecture search on temporal convolutions for complex action recognition. arXiv preprint arXiv:2104.01110

  • Shen X, Yi B, Liu H, Zhang W, Zhang Z, Liu S, Xiong N (2019) Deep variational matrix factorization with knowledge embedding for recommendation system. IEEE Transactions on Knowledge and Data Engineering 33(5):1906–1918

    Google Scholar 

  • Shi L, Zhang Y, Cheng J, Lu H (2019a) Skeleton-based action recognition with directed graph neural networks. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7912-7921)

  • Shi L, Zhang Y, Cheng J, Lu H (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12026-12035)

  • Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1227-1236)

  • Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199

  • Song L, Xia M, Jin J, Qian M, Zhang Y (2021) Suacdnet: Attentional change detection network based on siamese u-shaped structure. Int J Appl Earth Obser Geoinformat 105:102597

    Article  Google Scholar 

  • Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the ieee international conference on computer vision (pp. 4489-4497)

  • Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79

    Article  MathSciNet  Google Scholar 

  • Wang H, Schmid C (2013) Action recognition with improved trajectories. Proceedings of the ieee international conference on computer vision (pp. 3551-3558)

  • Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. European conference on computer vision (pp. 20-36)

  • Wang X, Miao Z, Zhang R, Hao S (2019) I3d-lstm A new model for human action recognition. Iop Conference Series: Mater Sci Engin 569:032035

    Article  Google Scholar 

  • Wu C, Zaheer M, Hu H, Manmatha R, Smola AJ, Krähenbühl P (2018) Compressed video action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 6026–6035)

  • Xia M, Cui Y, Zhang Y, Xu Y, Liu J, Xu Y (2021a) Dau-net: a novel water areas segmentation structure for remote sensing image. Int J Remote Sensing 42(7):2594–2621

    Article  Google Scholar 

  • Xia M, Qu Y, Lin H (2021b) Panda: parallel asymmetric network with double attention for cloud and its shadow detection. J Appl Remote Sens 15(4):046512

    Article  Google Scholar 

  • Xia M, Wang K, Song W, Chen C, Li Y et al (2020a) Non-intrusive load disaggregation based on composite deep long short-term memory network. Expert Syst Applicat 160:113669

    Article  Google Scholar 

  • Xia M, Wang T, Zhang Y, Liu J, Xu Y (2021c) Cloud/shadow segmentation based on global attention feature fusion residual network for remote sensing imagery. Int J Remote Sensing 42(6):2022–2045

    Article  Google Scholar 

  • Xia M, Zhang X, Weng L, Xu Y et al (2020b) Multi-stage feature constraints learning for age estimation. IEEE Transact Informat Forensic Sec 15:2417–2428

    Article  Google Scholar 

  • Xiao F, Lee YJ, Grauman K, Malik J, Feichtenhofer C (2020c) Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740

  • Yan A, Wang Y, Li Z, Qiao Y (2019) Pa3d: Pose-action 3d machine for video recognition. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7922-7931)

  • Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the aaai conference on artificial intelligence (Vol. 32)

  • Yan Y, Xu J, Ni B, Zhang W, Yang X (2017) Skeleton-aided articulated motion generation. Proceedings of the 25th acm international conference on multimedia (pp. 199-207)

  • Yang H, Gu Y, Zhu J, Hu K, Zhang X (2020) Pgcn-tca: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8:10040–10047

    Article  Google Scholar 

  • Yang X, Tian Y (2014) Action recognition using super sparse coding vector with spatio-temporal awareness. European conference on computer vision (pp. 727-741)

  • Chen Y, Gao X (2018) The latest progress of deep learning. Comput Sci Appl 08(04):565–571

    Google Scholar 

  • Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 4694-4702)

  • Zhang Y (2018) Text sentiment analysis based on multiple lstm structures. Beijing University of Posts and Telecommunications

  • Zhang S, Gong Y, Wang J (2017) The development of deep convolution neural network and its applications on computer vision. Chinese J Comput 40(9):1–29

    MathSciNet  Google Scholar 

  • Zhang Z, Li Z, Liu H, Xiong NN (2020) Multi-scale dynamic convolutional network for knowledge graph embedding. IEEE Transactions on Knowledge and Data Engineering

  • Zhang Z, Wang Z, Zhuang S, Huang F (2020) Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8:228108–228117

    Article  Google Scholar 

  • Zhao J, Snoek CG (2019) Dance with flow: Two-in-one stream action detection. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 9935-9944)

  • Ren Z, Xu H, Feng S, Zhou H, Shi J (2017) Sequence labeling chinese word segmentation method based on lstm networks. Appl Res Comput 34(5):1321–1324

    Google Scholar 

  • Zhou Y, Sun X, Zha Z, Zeng W (2018) Mict: Mixed 3d/2d convolutional tube for human action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 449–458)

  • Zhu Y, Li X, Liu C, Zolfaghari M, Xiong Y, Wu C, Li M (2020). A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567

Download references

Acknowledgements

Research in this article is supported by the National Natural Science Foundation of China (No. 61876079), the key special project of the National Key R &D Program (2018YFC1405703), the financial support of Jiangsu Austin Optronics Technology Co., Ltd. is deeply appreciated, and I would like to express my heartfelt thanks to those reviewers and editors who submitted valuable revisions to this article.

Author information

Authors and Affiliations

Authors

Contributions

All authors drafted the manuscript, read, and approved the final manuscript.

Corresponding author

Correspondence to Kai Hu.

Ethics declarations

Conflict of interest

No potential conflict of interest were reported by the author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, K., Jin, J., Zheng, F. et al. Overview of behavior recognition based on deep learning. Artif Intell Rev 56, 1833–1865 (2023). https://doi.org/10.1007/s10462-022-10210-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10210-8

Keywords

Navigation