Skip to main content
Log in

A study on deep learning spatiotemporal models and feature extraction techniques for video understanding

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Video understanding requires abundant semantic information. Substantial progress has been made on deep learning models in the image, text, and audio domains, and notable efforts have been recently dedicated to the design of deep networks in the video domain. We discuss the state-of-the-art convolutional neural network (CNN) and its pipelines for the exploration of video features, various fusion strategies, and their performances; we also discuss the limitations of CNN for long-term motion cues and the use of sequential learning models such as long short-term memory to overcome these limitations. In addition, we address various multi-model approaches for extracting important cues and score fusion techniques from hybrid deep learning frameworks. Then, we highlight future plans in this domain, recent trends, and substantial challenges for video understanding. This survey’s objectives are to study the plethora of approaches that have been developed for solving video understanding problems, to comprehensively study spatiotemporal cues, to explore the various models that are available for solving these problems and to identify the most promising approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AAS, Asari VK (2019) A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3):292

    Article  Google Scholar 

  2. Atluri G, Karpatne A, Kumar V (2018) Spatio-temporal data mining: a survey of problems and methods. ACM Comput Surv: CSUR 51(4):83

    Article  Google Scholar 

  3. Baker S, Scharstein D, Lewis JP, Roth S, Black MJ, Szeliski R (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92(1):1–31

    Article  Google Scholar 

  4. Barrett B (2018) Inside the olympics opening ceremony world-record drone show. In: wired. https://www.wired.com/story/olympics-opening-ceremony-drone-show/

  5. Bhorge SB, Manthalkar RR (2018) Three-dimensional spatio-temporal trajectory descriptor for human action recognition. Int J Multimed Inf Retr 7(3):197–205

    Article  Google Scholar 

  6. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 3:257–267

    Article  Google Scholar 

  7. Burghouts GJ, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn Lett 34(15):1861–1869

    Article  Google Scholar 

  8. Chalapathy R, Chawla S (2019) Deep learning for anomaly detection: a survey. arXiv:1901.03407

  9. Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) June. Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 1932–1939

  10. Chen D, Liu S, Kingsbury P, Sohn S, Storlie CB, Habermann EB, Naessens JM, Larson DW, Liu H (2019) Deep learning and alternative learning strategies for retrospective real-world clinical data. NPJ Digit Med 2(1):1–5

    Article  Google Scholar 

  11. Chen K, Kovvuri R, Gao J, Nevatia R (2018) MSRC: multimodal spatial regression with semantic context for phrase grounding. Int J Multimed Inf Retr 7(1):17–28

    Article  Google Scholar 

  12. Cocchia A (2014) Smart and digital city: a systematic literature review. In: Dameri RP, Rosenthal-Sabroux C (eds) Smart city. Progress in IS. Springer, Cham, pp 13–43. https://doi.org/10.1007/978-3-319-06160-3_2

    Google Scholar 

  13. Deldjoo Y, Elahi M, Quadrana M, Cremonesi P (2018) Using visual features based on MPEG-7 and deep learning for movie recommendation. Int J Multimed Inf Retr 7(4):207–219

    Article  Google Scholar 

  14. Du Y, Yuan C, Li B, Zhao L, Li Y, Hu W (2018) Interaction-aware spatio-temporal pyramid attention networks for action classification. In: Proceedings of the European conference on computer vision (ECCV), pp 373–389

    Chapter  Google Scholar 

  15. Evensen D (2019) The rhetorical limitations of the #FridaysForFuture movement. Nat Clim Chang 9:428–430. https://doi.org/10.1038/s41558-019-0481-1

    Article  Google Scholar 

  16. Fan J, Ma C, Zhong Y (2019) A selective overview of deep learning. arXiv:1904.05526

  17. Federal Highway Administration (2015) Video analytics research projects. U.S Department of Transportation. 16 p

  18. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  19. Gammulle H, Denman S, Sridharan S, Fookes C (2017) March. Two stream lstm: a deep fusion framework for human action recognition. In: 2017 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 177–186

  20. Gonzalez TF (2007) Handbook of approximation algorithms and metaheuristics. Chapman and Hall, London

    Book  MATH  Google Scholar 

  21. Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25

    Article  Google Scholar 

  22. Guiming D, Xia W, Guangyan W, Yan Z, Dan L (2016) Speech recognition based on convolutional neural networks. In: 2016 IEEE international conference on signal and image processing (ICSIP). IEEE, pp 708–711

  23. Guo Y, Liu Y, Georgiou T, Lew MS (2018) A review of semantic segmentation using deep neural networks. Int J Multimed Inf Retr 7(2):87–93

    Article  Google Scholar 

  24. Hatcher WG, Yu W (2018) A survey of deep learning: platforms, applications and emerging research trends. IEEE Access 6:24411–24432

    Article  Google Scholar 

  25. He D, Li F, Zhao Q, Long X, Fu Y, Wen S (2018) Exploiting spatial-temporal modelling and multi-modal fusion for human action recognition. arXiv:1806.10319

  26. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  27. Hoang VD, Hoang DH, Hieu CL (2018) Action recognition based on sequential 2D-CNN for surveillance systems. In: IECON 2018-44th annual conference of the IEEE industrial electronics society. IEEE, pp 3225–3230

  28. Honda (2018) Cooperative merge. In: Honda news. http://www.multivu.com/players/English/7988331-honda-ces-cooperative-mobility-ecosystem/

  29. Hou R, Chen C, Shah M (2017) Tube convolutional neural network (T-CNN) for action detection in videos. In: Proceedings of the IEEE international conference on computer vision, pp 5822–5831

  30. Huang H, Yu PS, Wang C (2018) An introduction to image synthesis with generative adversarial nets. arXiv:1803.04469

  31. Hui TW, Tang X, Change Loy C (2018) Liteflownet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8981–8989

  32. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2462–2470

  33. Jiang YG, Wu Z, Tang J, Li Z, Xue X, Chang SF (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147

    Article  Google Scholar 

  34. Jiang YG, Wu Z, Wang J, Xue X, Chang SF (2017) Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans Pattern Anal Mach Intell 40(2):352–364

    Article  Google Scholar 

  35. Kahn J (2018) Meet ‘Millie’ the Avatar. She’d like to sell you a pair of sunglasses. In: Bloomberg. https://www.bloomberg.com/news/articles/2018-12-15/meet-millie-the-avatar-she-d-like-to-sell-you-a-pair-of-sunglasses

  36. Kangwei L, Jianhua W, Zhongzhi H (2018) Abnormal event detection and localization using level set based on hybrid features. Signal Image Video Process 12(2):255–261

    Article  Google Scholar 

  37. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  38. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D (2019) Key challenges for delivering clinical impact with artificial intelligence. BMC Med 17(1):195

    Article  Google Scholar 

  39. Kong Y, Fu Y (2018) Human action recognition and prediction: a survey. arXiv:1806.11230

  40. Kruger N, Janssen P, Kalkan S, Lappe M, Leonardis A, Piater J, Rodriguez-Sanchez AJ, Wiskott L (2012) Deep hierarchies in the primate visual cortex: what can we learn for computer vision? IEEE Trans Pattern Anal Mach Intell 35(8):1847–1871

    Article  Google Scholar 

  41. Kumaran SK, Dogra DP, Roy PP (2019) Anomaly detection in road traffic using visual surveillance: a survey. arXiv:1901.08292

  42. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  43. Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR—IEEE conference on computer vision & pattern recognition, Jun 2008, Anchorage, USA, pp 1–8

  44. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  45. Lenz I, Gemici M, Saxena A (2012) Low-power parallel algorithms for single image based obstacle avoidance in aerial robots. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 772–779

  46. Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Robot Res 37(4–5):421–436

    Article  Google Scholar 

  47. Li F, Du J (2012) October. Local spatio-temporal interest point detection for human action recognition. In: 2012 IEEE fifth international conference on advanced computational intelligence (ICACI). IEEE, pp 579–582

  48. Li Q, Qiu Z, Yao T, Mei T, Rui Y, Luo J (2017) Learning hierarchical video representation for action recognition. Int J Multimed Inf Retr 6(1):85–98

    Article  Google Scholar 

  49. Li X, Pang T, Liu W, Wang T (2017) Fall detection for elderly person care using convolutional neural networks. In: 2017 10th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI). IEEE, pp 1–6

  50. Liu J, Sun C, Xu X, Xu B, Yu S (2019) A spatial and temporal features mixture model with body parts for video-based person re-identification. Appl Intell 49(9):3436–3446

    Article  Google Scholar 

  51. Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc., pp 855–863. http://papers.nips.cc/paper/5267-on-the-computational-efficiency-of-training-neural-networks.pdf

  52. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  53. Lu C, Shi J, Jia J (2013) Abnormal event detection at 150 fps in matlab. In: Proceedings of the IEEE international conference on computer vision, pp 2720–2727

  54. Mamoshina P, Vieira A, Putin E, Zhavoronkov A (2016) Applications of deep learning in biomedicine. Mol Pharm 13(5):1445–1454

    Article  Google Scholar 

  55. Marcus G (2018) Deep learning: a critical appraisal. arXiv:1801.00631

  56. Melfi R, Kondra S, Petrosino A (2013) Human activity modeling by spatio temporal textural appearance. Pattern Recogn Lett 34(15):1990–1994

    Article  Google Scholar 

  57. Menze M, Geiger A (2015) Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3061–3070

  58. Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2019) Joint embeddings with multimodal cues for video-text retrieval. Int J Multimed Inf Retr 8(1):3–18

    Article  Google Scholar 

  59. Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1

    Article  Google Scholar 

  60. Naseer S, Saleem Y, Khalid S, Bashir MK, Han J, Iqbal MM, Han K (2018) Enhanced network anomaly detection based on deep neural networks. IEEE Access 6:48231–48246

    Article  Google Scholar 

  61. Ouadiay FZ, Bouftaih H, Bouyakhf EH, Himmi MM (2018) Simultaneous object detection and localization using convolutional neural networks. In: 2018 international conference on intelligent systems and computer vision (ISCV). IEEE, pp 1–8

  62. Palmer R, West G, Tan T (2012) Scale proportionate histograms of oriented gradients for object detection in co-registered visual and range data. In: 2012 international conference on digital image computing techniques and applications (DICTA). IEEE, pp 1–8

  63. Papadopoulos K, Demisse G, Ghorbel E, Antunes M, Aouada D, Ottersten B (2019) Localized trajectories for 2D and 3D action recognition. arXiv:1904.05244

    Article  Google Scholar 

  64. Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A (2016) The limitations of deep learning in adversarial settings. In: 2016 IEEE European symposium on security and privacy (EuroS&P). IEEE, pp 372–387

  65. Peng K, Chen X, Zhou D, Liu Y (2009) 3D reconstruction based on SIFT and Harris feature points. In: 2009 IEEE international conference on robotics and biomimetics (ROBIO). IEEE, pp 960–964

  66. Peng Y, Zhao Y, Zhang J (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786

    Article  Google Scholar 

  67. Qiu Z, Yao T, Mei T (2017) Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Trans Multimed 20(4):939–949

    Article  Google Scholar 

  68. Ray KS, Chakraborty S (2019) Object detection by spatio-temporal analysis and tracking of the detected objects in a video with variable background. J Vis Commun Image Represent 58:662–674

    Article  Google Scholar 

  69. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc., pp 91–99 http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

  70. Rolnick D, Donti PL, Kaack LH, Kochanski K, Lacoste A, Sankaran K, Ross AS, Milojevic-Dupont N, Jaques N, Waldman-Brown A, Luccioni A (2019) Tackling climate change with machine learning. arXiv:1906.05433

  71. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  72. Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM international conference on multimedia. ACM, pp 357–360

  73. Sekma M, Mejdoub M, Amar CB (2015) Human action recognition based on multi-layer fisher vector encoding method. Pattern Recogn Lett 65:37–43

    Article  Google Scholar 

  74. Seligman L (2016) How swarming drones could change the face of air warfare. In: Def. News. https://www.defensenews.com/2016/05/17/how-swarming-drones-could-change-the-face-of-air-warfare/

  75. Sermanet P, Chintala S, LeCun Y (2012) Convolutional neural networks applied to house numbers digit classification. arXiv:1204.3968

  76. Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang SF, Yan Z (2019) Dmc-net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1268–1277

  77. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  78. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KD (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc., pp 568–576. http://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf

  79. Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1961–1970

  80. Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of Big Data challenges and analytical methods. J Bus Res 70:263–286

    Article  Google Scholar 

  81. Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision

  82. Sreenu G, Durai MS (2019) Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J Big Data 6(1):48

    Article  Google Scholar 

  83. Sun C, Shetty S, Sukthankar R, Nevatia R (2015) Temporal localization of fine-grained actions in videos by domain transfer from web images. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 371–380

  84. Sun D, Yang X, Liu MY, Kautz J (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8934–8943

  85. Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605

  86. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  87. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018) A survey on deep transfer learning. In: International conference on artificial neural networks. Springer, Cham, pp 270–279

    Chapter  Google Scholar 

  88. Tan M, Le QV (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946

  89. Thakkar K, Narayanan PJ (2018) Part-based graph convolutional network for action recognition. arXiv:1809.04983

  90. Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Proceedings of the 40th international conference on software engineering. ACM, pp 303–314

  91. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  92. Tripathi RK, Jalal AS, Agrawal SC (2018) Suspicious human activity recognition: a review. Artif Intell Rev 50(2):283–339

    Article  Google Scholar 

  93. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2017) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166

    Article  Google Scholar 

  94. Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. CVPR. In: IEEE conference on computer vision & pattern recognition, June 2011. Colorado Springs, United States, pp 3169–3176

  95. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  96. Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recogn Lett 92:33–40

    Article  Google Scholar 

  97. Wang L, Hu W, Tan T (2003) Recent developments in human motion analysis. Pattern Recogn 36(3):585–601

    Article  Google Scholar 

  98. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

  99. Wang P, Li W, Ogunbona P, Wan J, Escalera S (2018) RGB-D-based human motion recognition with deep learning: a survey. Comput Vis Image Underst 171:118–139

    Article  Google Scholar 

  100. Wang T, Snoussi H (2012) Histograms of optical flow orientation for visual abnormal events detection. In: 2012 IEEE ninth international conference on advanced video and signal-based surveillance. IEEE, pp 13–18

  101. Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1529–1538

  102. Wang Z, Ren J, Zhang D, Sun M, Jiang J (2018) A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos. Neurocomputing 287:68–83

    Article  Google Scholar 

  103. Weng X (2019) On the importance of video action recognition for visual lipreading. arXiv:1903.09616

  104. Wu Z, Jiang YG, Wang J, Pu J, Xue X (2014) November. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 167–176

  105. Wu Z, Wang X, Jiang YG, Ye H, Xue X (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 461–470

  106. Wu Z, Yao T, Fu Y, Jiang YG (2016) Deep learning for video classification and captioning. arXiv:1609.06782

  107. Xu Z, Yang Y, Hauptmann AG (2015) A discriminative CNN video representation for event detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1798–1807

  108. Yao L (2016) Extract the relational information of static features and motion features for human activities recognition in videos. Intell Neurosci 2016:3. https://doi.org/10.1155/2016/1760172

    Article  Google Scholar 

  109. Ye H, Wu Z, Zhao RW, Wang X, Jiang YG, Xue X (2015) Evaluating two-stream CNN for video classification. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 435–442

  110. Yuan Y, Zheng X, Lu X (2016) A discriminative representation for human action recognition. Pattern Recogn 59:88–97

    Article  Google Scholar 

  111. Zabłocki M, Gościewska K, Frejlichowski D, Hofman R (2014) Intelligent video surveillance systems for public spaces—a survey. J Theor Appl Comput Sci 8(4):13–27

    Google Scholar 

  112. Zhan F, Zhu H, Lu S (2019) Spatial fusion gan for image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3653–3662

  113. Zhang C, Vinyals O, Munos R, Bengio S (2018) A study on overfitting in deep reinforcement learning. arXiv:1804.06893

  114. Zhang H, Liu D, Xiong Z (2019) Two-stream oriented video super-resolution for action recognition. arXiv:1903.05577

  115. Zhang J, Feng Z, Su Y, Xing M, Xue W (2019) Riemannian spatio-temporal features of locomotion for individual recognition. Sensors 19(1):56

    Article  Google Scholar 

  116. Zhang W, Luo Y, Chen Z, Du Y, Zhu D, Liu P (2019) A robust visual tracking algorithm based on spatial-temporal context hierarchical response fusion. Algorithms 12(1):8

    Article  MATH  Google Scholar 

  117. Zhang XY, Shi H, Li C, Zheng K, Zhu X, Duan L (2019) Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision. In: Proceedings of the 33rd AAAI conference on artificial intelligence, pp 1–8

    Article  Google Scholar 

  118. Zhao R, Ali H, Van der Smagt P (2017) Two-stream RNN/CNN for action recognition in 3D videos. In: 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 4260–4267

  119. Zhu AZ, Yuan L, Chaney K, Daniilidis K (2018) EV-FlowNet: self-supervised optical flow estimation for event-based cameras. arXiv:1802.06898

  120. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232

Download references

Acknowledgements

We would like to thank Editor for helpful suggestions, and the anonymous reviewers for constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Suresha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suresha, M., Kuppa, S. & Raghukumar, D.S. A study on deep learning spatiotemporal models and feature extraction techniques for video understanding. Int J Multimed Info Retr 9, 81–101 (2020). https://doi.org/10.1007/s13735-019-00190-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-019-00190-x

Keywords

Navigation