Skip to main content
Log in

Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directions

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The exponential growth in the usage of computing technologies in various applications has led to the creation of huge amount of multimedia information such as, video, audio, and text. The enormous amount of video data generated over the past years necessitates the use of video summarization techniques that has become an emerging field of research. These techniques may facilitate quick browsing, indexing and faster sharing of content among various sources. Video summarization has been popular method to generate a short summary of a longer sized video and these approaches may be broadly classified into handcrafted (using features descriptors) or deep learning (DL) based algorithms. In this paper, we expound a comprehensive review of state-of-the-art (SOTA) techniques for video summarization from traditional to modern data-driven approaches. In addition, we proposed a taxonomy for the classification of video summarization methods based on a plenty of criteria. We also present an analysis of evaluation protocols for these approaches using benchmark datasets and performance metrices. We identify and list various research challenges specifically for each sub-category of video summarization. It may be clearly inferred that modern deep learning-based approaches outperformed traditional methods in terms of accuracy with an additional training overhead. Furthermore, most of the handcrafted-based approaches offer limited performance in dynamic video scenario and there exist several inconsistencies such as scaling or rotational variations under different illumination conditions. Besides, our analysis investigates that multi-criteria-based video summarization is an area that requisite further exploration by the research community. This survey may serve as a reference article to the new researchers for carrying out investigations in this active field of computer vision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Aggarwal JK, Ryoo MS (2011) Human activity analysis: A review. ACM Computing Surveys 43(3):1–43

  2. Agyeman R, Muhammad R, Choi GS (2019) Soccer Video Summarization Using Deep Learning. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) 2019 Mar 28, pp. 270–273

  3. Ahmad Z, Illanko K, Khan N, Androutsos D (2019) Human action recognition using convolutional neural network and depth sensor data. In: Proceedings of the 2019 International Conference on Information Technology and Computer Communications 2019 Aug 16, pp. 1–5

  4. Ali H, Sharif M, Yasmin M, Rehmani MH, Riaz F (2020) A survey of feature extraction and fusion of deep learning for detection of abnormalities in video endoscopy of gastrointestinal-tract. Artif Intell Rev 53:2635–2707

  5. Ali JJ, Shati NM, Gaata MT (2020) Abnormal activity detection in surveillance video scenes. Telkomnika (Telecommun Comput Electron Control) 18(5):2447–2453

  6. Benjak J, Hofman D, Knezović J, Žagar M (2022) Performance Comparison of H. 264 and H. 265 Encoders in a 4K FPV Drone Piloting System. Appl Sci 12(13):6386

  7. Arev I, Park HS, Sheikh Y, Hodgins J, Shamir A (2014) Automatic editing of footage from multiple social cameras. ACM Trans Graph 33(4):1–11. https://doi.org/10.1145/2601097.2601198

    Article  Google Scholar 

  8. Aslan MF, Durdu A, Sabanci K (2020) Human action recognition with bag of visual words using different machine learning methods and hyperparameter optimization. Neural Comput. & Applic. 32(12):8585–8597. https://doi.org/10.1007/s00521-019-04365-9

    Article  Google Scholar 

  9. B. World (2019) World Population Ageing 2019. [Online]. Available: http://link.springer.com/chapter/10.1007/978-94-007-5204-7_6

  10. Baillie M, Jose JM (2003) Audio-based event detection for sports video. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2728:300–309. https://doi.org/10.1007/3-540-45113-7_30

    Article  MATH  Google Scholar 

  11. Basavarajaiah M, Sharma P (2019) Survey of Compressed Domain Video Summarization. ACM Comput Surv 52(6):1–29

  12. Bir B (2020) Wildfires, forest fires around world in 2020. https://www.aa.com.tr/en/environment/wildfires-forest-fires-around-world-in-2020/2088198

  13. Bojukyan E (2022) 52 video marketing statistics 2022 [infographic]. https://www.renderforest.com/blog/video-marketing-statistics. Accessed 14 Jan 2022

  14. Calic J, Izquierdo E (2002) Efficient key-frame extraction and video analysis. Proceedings - International Conference on Information Technology: Coding and Computing, ITCC 2002, pp 28–33. https://doi.org/10.1109/ITCC.2002.1000355.

  15. Chaquet JM, Carmona EJ, Fernández-Caballero A (2013) A survey of video datasets for human action and activity recognition. Comput Vis Image Underst 117(6):633–659. https://doi.org/10.1016/j.cviu.2013.01.013

    Article  Google Scholar 

  16. Chen T, Lu A, Hu SM (2012) Visual storylines: semantic visualization of movie sequence. Elsevier 36(4):241–249. https://doi.org/10.1016/j.cag.2012.02.010

    Article  Google Scholar 

  17. Choroś K (2014) Categorization of sports video shots and scenes in tv sports news based on ball detection. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8397 LNAI, no. PART 1, pp 591–600. https://doi.org/10.1007/978-3-319-05476-6_60.

  18. Das Dawn D, Shaikh SH (2016) A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput 32(3):289–306. https://doi.org/10.1007/s00371-015-1066-2

    Article  Google Scholar 

  19. Dilawari A, Khan MUG (2019) ASoVS: abstractive summarization of video sequences. IEEE Access 7:29253–29263. https://doi.org/10.1109/ACCESS.2019.2902507

    Article  Google Scholar 

  20. Donchev D (2022) “40 Mind Blowing YouTube Facts, Figures and Statistics – 2022,”. https://fortunelords.com/youtube-statistics/#:~:text=300 hours of video are,on Youtube every single day.&text=In an average month%2C 8,to a pay-TV service.

  21. Dov D, Talmon R, Cohen I (2015) Audio-visual voice activity detection using diffusion maps. IEEE Trans Audio Speech Lang Process 23(4):732–745. https://doi.org/10.1109/TASLP.2015.2405481

    Article  Google Scholar 

  22. Elharrouss O, Almaadeed N, Al-Maadeed S, Bouridane A, Beghdadi A (2021) A combined multiple action recognition and summarization for surveillance video sequences. Appl Intell 51(2):690–712. https://doi.org/10.1007/s10489-020-01823-z

    Article  Google Scholar 

  23. Evangelopoulos G et al. (2009) “Video event detection and summarization using audio, visual and text saliency,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, no. April, pp. 3553–3556, https://doi.org/10.1109/ICASSP.2009.4960393.

  24. Fei M, Jiang W, Mao W (2018) “Creating personalized video summaries via semantic event detection,” J. Ambient. Intell. Humaniz. Comput., vol. 0, no. 0, pp. 1–12, https://doi.org/10.1007/s12652-018-0797-0.

  25. Feng W, Liu R, Zhu M (2014) Fall detection for elderly person care in a vision-based home surveillance environment using a monocular camera. SIViP 8(6):1129–1138. https://doi.org/10.1007/s11760-014-0645-4

    Article  Google Scholar 

  26. Furini M, Ghini V (2006) “<(34) an Audio-Video Summarization Scheme Based on Audio and Video Analysis.Pdf>,” pp. 1209–1213

  27. Furini M, Geraci F, Montangero M, Pellegrini M (2010) STIMO: STIll and MOving video storyboard for the web scenario. Multimed. Tools Appl. 46(1):47–69. https://doi.org/10.1007/s11042-009-0307-7

    Article  Google Scholar 

  28. G. of India (2020) “Accidental Deaths and Suicides in India by NCRB,”https://ncrb.gov.in/en/accidental-deaths-suicides-in-india?page=1

  29. Ghafoor HA, Javed A, Irtaza A, Dawood H, Dawood H, Banjar A (2018) Egocentric Video Summarization Based on People Interaction Using Deep Learning. vol. 2018

  30. Ghatak S, Rup S, Majhi B, Swamy MNS (2020) An improved surveillance video synopsis framework: a HSATLBO optimization approach. Multimed Tools Appl 79(7–8):4429–4461

  31. Gong Y, Liu X (2000) Video summarization using singular value decomposition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2:174–180. https://doi.org/10.1109/cvpr.2000.854772

    Article  Google Scholar 

  32. Gong F et al. (2019) A real-time fire detection method from video with multifeature fusion. Comput Intell Neurosci vol 2019. https://doi.org/10.1155/2019/1939171.

  33. Guan G, Wang Z, Mei S, Ott M, He M, Feng DD (2014) A top-down approach for video summarization. ACM Trans Multimed Comput Commun Appl 11(1). https://doi.org/10.1145/2632267.

  34. Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018

    Article  Google Scholar 

  35. Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recogn Lett 107:83–90. https://doi.org/10.1016/j.patrec.2017.08.015

    Article  Google Scholar 

  36. He L, Wen S, Wang L, Li F (2020) Vehicle theft recognition from surveillance video based on spatiotemporal attention. Appl Intell pp 2128–2143. https://doi.org/10.1007/s10489-020-01933-8.

  37. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 07-12-June:961–970. https://doi.org/10.1109/CVPR.2015.7298698

    Article  Google Scholar 

  38. Herranz L, Martinez JM (2010) A framework for scalable summarization of video. IEEE Trans Circ Syst Vid Technol 20(9):1265–1270. https://doi.org/10.1109/TCSVT.2010.2057020

    Article  Google Scholar 

  39. Huang C, Wang H (2020) A novel key-frames selection framework for comprehensive video summarization. IEEE Trans Circ Syst Vid Technol 30(2):577–589. https://doi.org/10.1109/TCSVT.2019.2890899

    Article  Google Scholar 

  40. Hussain T et al. (2021) A comprehensive survey of multi-view video summarization. Elsevier 109. https://doi.org/10.1016/j.patcog.2020.107567.

  41. Hussein F, Piccardi M (2017) V-Jaune. ACM Trans. Multimed. Comput. Commun. Appl 13(2):1–19. https://doi.org/10.1145/3063532

  42. Iosifidis A, Mouroutsos SG, Gasteratos A (2010) Real-time video surveillance by a hybrid static/active camera mechatronic system. Int Conf Adv Intell Mechatron pp 84–89

  43. Itazuri T, Fukusato T, Yamaguchi S, Morishima S (2017) Court-Based Volleyball Video Summarization Focusing on Rally Scene. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2017-July, pp. 179–186, https://doi.org/10.1109/CVPRW.2017.28.

  44. Jegham I, Khalifa AB, Alouani I, Mahjoub MA (2019) MDAD: A Multimodal and Multiview in-Vehicle Driver Action Dataset, vol. 11678 LNCS. Springer International Publishing. https://doi.org/10.1007/978-3-030-29888-3_42.

  45. Jegham I, Khalifa AB, Alouani I, Mahjoub MA (2020) Vision-based human action recognition: An overview and real world challenges. Forensic Sci Int Digit Investig 32:200901. https://doi.org/10.1016/j.fsidi.2019.200901

    Article  Google Scholar 

  46. Jeyanthi Suresh A, Visumathi J (2020) Inception ResNet deep transfer learning model for human action recognition using LSTM. Materials Today: Proceedings, no. xxxx. https://doi.org/10.1016/j.matpr.2020.09.609.

  47. Ji Z, Xiong K, Pang Y, Li X (2020) Video summarization with attention-based encoder-decoder networks. IEEE Trans Circ Syst Vid Technol 30(6):1709–1717. https://doi.org/10.1109/TCSVT.2019.2904996

    Article  Google Scholar 

  48. Kakadiya R, Lemos R, Mangalan S, Pillai M, Nikam S (2019) “AI Based Automatic Robbery/Theft Detection using Smart Surveillance in Banks,” Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA 2019, pp. 201–204, https://doi.org/10.1109/ICECA.2019.8822186.

  49. Kalaivani P, Roomi SMM (2017) Towards comprehensive understanding of event detection and video summarization approaches. Proceedings - 2017 2nd International Conference on Recent Trends and Challenges in Computational Models, ICRTCCM 2017, pp 61–66. https://doi.org/10.1109/ICRTCCM.2017.84.

  50. Kamel A, Sheng B, Yang P, Li P, Shen R, Feng DD (2019) Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans Syst Man Cybern Syst 49(9):1806–1819. https://doi.org/10.1109/TSMC.2018.2850149

    Article  Google Scholar 

  51. Kim G, Kim J, Kim S (2019) “Fire Detection Using Video Images and Temporal Variations,” 1st International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2019, pp. 564–567, https://doi.org/10.1109/ICAIIC.2019.8669083.

  52. Koidan K (2018) New datasets for action recognition. https://neurohive.io/en/datasets/new-datasets-for-action-recognition/

  53. Koutras P, Zlatinsi A, Maragos P (2018) Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos. 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop, IVMSP 2018 - Proceedings, pp 1–5, https://doi.org/10.1109/IVMSPW.2018.8448977.

  54. Kushwaha A (2017) Theft-Detection using Motion Sensing Camera. 2(11):90–97

  55. Li Y, Zhai Q, Ding S, Yang F, Li G, Zheng YF (2019) Efficient health-related abnormal behavior detection with visual and inertial sensor integration. Pattern Anal Applic 22(2):601–614. https://doi.org/10.1007/s10044-017-0660-5

    Article  MathSciNet  Google Scholar 

  56. Li A, Miao Z, Cen Y, Zhang XP, Zhang L, Chen S (2020) Abnormal event detection in surveillance videos based on low-rank and compact coefficient dictionary learning. Pattern Recogn 108:107355. https://doi.org/10.1016/j.patcog.2020.107355

    Article  Google Scholar 

  57. Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 1159–1168. https://doi.org/10.1109/CVPR.2018.00127.

  58. Liu H, Feris R, Sun M (2011) Visual Analysis of Humans. Vis Anal Hum. https://doi.org/10.1007/978-0-85729-997-0.

  59. Liu AA, Xu N, Su YT, Lin H, Hao T, Yang ZX (2015) Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing 151(P2):544–553. https://doi.org/10.1016/j.neucom.2014.04.090

    Article  Google Scholar 

  60. Luna E, Miguel JCS, Ortego D, Martínez JM (2018) Abandoned object detection in video-surveillance: Survey and comparison. Sensors (Switzerland), vol. 18, no. 12, https://doi.org/10.3390/s18124290.

  61. Ma Y, Lu L, Zhang H, Li M (2002) A User Attention Model for Video Summarization. ACM, pp 1–10, [Online]. Available: papers2://publication/uuid/DE9F0C43-0DAB-459B-ADDC-928A1433801B

  62. Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Exp Syst Appl 91:480–491. https://doi.org/10.1016/j.eswa.2017.09.029

  63. Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp 2982–2991. https://doi.org/10.1109/CVPR.2017.318.

  64. Mahesh Kini M, Pai K (2019) A Survey on Video Summarization Techniques. 2019 Innovations in Power and Advanced Computing Technologies, i-PACT 2019, pp 1–5. https://doi.org/10.1109/i-PACT44901.2019.8960003.

  65. Marvaniya S, Damoder M, Gopalakrishnan V, Iyer KN, Soni K (2016) Real-time video summarization on mobile. Proceedings - International Conference on Image Processing, ICIP, vol. 2016-Augus, no. September 2016, pp 176–18. https://doi.org/10.1109/ICIP.2016.7532342.

  66. McCue T (2018) Video Marketing Trends (Forbes). https://www.forbes.com/sites/tjmccue/2018/06/22/video-marketing-2018-trends-continues-to-explode-as-the-way-to-reach-customers/?sh=5fd70755598d

  67. Mei T, Tang LX, Tang J, Hua XS (2013) Near-lossless semantic video summarization and its applications to video analysis. ACM Trans Multimed Comput Commun Appl 9(3). https://doi.org/10.1145/2487268.2487269.

  68. Milotta FLM, Furnari A, Battiato S, Signorello G, Farinella GM (2019) Egocentric visitors localization in natural sites. J Vis Commun Image Represent 65(2). https://doi.org/10.1016/j.jvcir.2019.102664.

  69. Mlik N, Barhoumi W, Zagrouba E (2014) Object-based event detection for the extraction of video key-frames (no. January 2012)

  70. Muhammad K, Ahmad J, Mehmood I, Rho S, Baik SW (2018) Convolutional Neural Networks Based Fire Detection in Surveillance Videos. IEEE Access 6(March):18174–18183. https://doi.org/10.1109/ACCESS.2018.2812835

    Article  Google Scholar 

  71. Muhammad K, Ahmad J, Lv Z, Bellavista P, Yang P, Baik SW (2019) Efficient deep CNN-based fire detection and localization in video surveillance applications. IEEE Trans Syst Man Cybern Syst 49(7):1419–1434. https://doi.org/10.1109/TSMC.2018.2830099

    Article  Google Scholar 

  72. Münzer B, Schoeffmann K, Böszörmenyi L (2018) Content-based processing and analysis of endoscopic images and videos: a survey. Multimed Tools Appl 77(1):1323–1362. https://doi.org/10.1007/s11042-016-4219-z

    Article  Google Scholar 

  73. Muszynski M, Kostoulas T, Lombardo P, Pun T, Chanel G (2018) Aesthetic highlight detection in movies based on synchronization of spectators’ reactions. ACM Trans Multimed Comput Commun Appl 14(3). https://doi.org/10.1145/3175497.

  74. Nie L, Hong R, Zhang L, Xia Y, Tao D, Sebe N (2016) Perceptual attributes optimization for multivideo summarization. IEEE Trans Cybern 46(12):2991–3003. https://doi.org/10.1109/TCYB.2015.2493558

    Article  Google Scholar 

  75. Oskouie P, Alipour S, Eftekhari-Moghadam AM (2014) Multimodal feature extraction and fusion for semantic mining of soccer video: a survey. Artif Intell Rev 42(2):173–210

  76. Pareek P, Thakkar A (2021) A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications, vol. 54, no. 3. Springer Netherlands. https://doi.org/10.1007/s10462-020-09904-8.

  77. Park H, Park S, Joo Y (2019) Robust detection of abandoned object for smart video surveillance in illumination changes. Sensors (Switzerland), vol. 19, no. 23, https://doi.org/10.3390/s19235114.

  78. Park H, Park S, Joo Y (2020) Detection of abandoned and stolen objects based on dual background model and mask R-CNN. IEEE Access 8:80010–80019. https://doi.org/10.1109/ACCESS.2020.2990618

    Article  Google Scholar 

  79. Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp 1052–1060. https://doi.org/10.1109/CVPR.2017.118.

  80. Rouast PV, Adam MTP (2020) Learning deep representations for video-based intake gesture detection. IEEE J Biomed Health Inf 24(6):1727–1737. https://doi.org/10.1109/JBHI.2019.2942845

    Article  Google Scholar 

  81. Rouvier M, Oger S, Linarès G, Matrouf D, Merialdo B, Li Y (2015) Audio-based video genre identification. IEEE Trans. Audio Speech Lang Process 23(6):1031–1041. https://doi.org/10.1109/TASLP.2014.2387411

  82. Sabha A, Selwal A (2021) HAVS: Human action-based video summarization, Taxonomy, Challenges, and Future Perspectives. Proceedings of the 2021 IEEE International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems, ICSES 2021, pp 1–9. https://doi.org/10.1109/ICSES52305.2021.9633804.

  83. Sahu A, Chowdhury AS (2020) Multiscale summarization and action ranking in egocentric videos. Pattern Recogn Lett 133:256–263. https://doi.org/10.1016/j.patrec.2020.02.029

    Article  Google Scholar 

  84. Sanal Kumar KP, Bhavani R (2019) Human activity recognition in egocentric video using PNN, SVM, kNN and SVM+kNN classifiers. Clust Comput 22(s5):10577–10586. https://doi.org/10.1007/s10586-017-1131-x

    Article  Google Scholar 

  85. Sarika (2022) 135 Video Marketing Statistics You Can’t Ignore in 2022. https://invideo.io/blog/video-marketing-statistics/

  86. Savage C (2016) Does length matter? It does for video!. https://wistia.com/learn/marketing/does-length-matter-it-does-for-video

  87. Schuldt C, Barbara L, Stockholm S (2004) Recognizing human actions: a local SVM approach ∗ Dept. of Numerical Analysis and Computer Science. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th international conference on, vol. 3, pp 32–36

  88. Vivekraj VK, Debashis S, Balasubramanian R (2019) Video Skimming: taxonomy and comprehensive survey. ACM Comput Surv 52(5):(Article 106)38. https://doi.org/10.1145/3347712

  89. Shammi S, Islam S, Rahman HA, Zaman HU (2019) An automated way of vehicle theft detection in parking facilities by identifying moving vehicles in CCTV video stream. Proceedings of the 2018 International Conference On Communication, Computing and Internet of Things, IC3IoT 2018, pp 36–41. https://doi.org/10.1109/IC3IoT.2018.8668135

  90. Shang X, Yuan Z, Wang A, Wang C (2021) Multimodal video summarization via time-aware transformers. MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia, pp. 1756–1765. https://doi.org/10.1145/3474085.3475321

  91. Sharma D, Selwal A (2021) HyFiPAD: a hybrid approach for fingerprint presentation attack detection using local and adaptive image features. Vis Comput no. 0123456789, https://doi.org/10.1007/s00371-021-02173-8.

  92. Sharma D, Selwal A (2021) An intelligent approach for fingerprint presentation attack detection using ensemble learning with improved local image features, no. 0123456789. Springer US, https://doi.org/10.1007/s11042-021-11254-8.

  93. Singh Parihar A, Pal J, Sharma I (2021) Multiview video summarization using video partitioning and clustering. J Vis Commun Image Represent 74(April 2020):102991. https://doi.org/10.1016/j.jvcir.2020.102991

    Article  Google Scholar 

  94. Singh T, Vishwakarma DK (2021) A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput Applic 33(1):469–485. https://doi.org/10.1007/s00521-020-05018-y

  95. Song X, Sun L, Lei J, Tao D, Yuan G, Song M (2016) Event-based large scale surveillance video summarization. Neurocomputing 187:66–74. https://doi.org/10.1016/j.neucom.2015.07.131

    Article  Google Scholar 

  96. Sood M (2020) The Hindustan Times. https://www.hindustantimes.com/mumbai-news/india-had-most-deaths-in-road-accidents-in-2019-report/story-pikRXxsS4hptNVvf6J2g9O.html#:~:text=India.continued to have the,in 2019%2C the report revealed

  97. Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118. https://doi.org/10.1016/0893-6080(90)90049-Q

    Article  Google Scholar 

  98. Sridevi M, Kharde M (2020) Video summarization using highlight detection and pairwise deep ranking model. Procedia Comput Sci 167(2019):1839–1848. https://doi.org/10.1016/j.procs.2020.03.203

    Article  Google Scholar 

  99. Srivastava AK, Biswas KK (2018) Human activity recognition using local motion histogram. In: Bhattacharyya P, Sastry H, Marriboyina V, Sharma R (eds), Smart and innovative trends in next generation computing technologies. NGCT 2017. Communications in Computer and Information Science, vol 828. Springer, Singapore. https://doi.org/10.1007/978-981-10-8660-1_69

  100. Staff R (2020) Video marketing statistics 2021 [infographic]. https://www.renderforest.com/blog/video-marketing-statistics

  101. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488

  102. Sun S, Wang F, He L (2018) Movie summarization using bullet screen comments. Multimed Tools Appl 77(7):9093–9110. https://doi.org/10.1007/s11042-017-4807-6

    Article  Google Scholar 

  103. Tabish M, Tanooli ZUR, Shaheen M (2021) Activity recognition framework in sports videos. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10519-6.

  104. Tang K, Bao Y, Zhao Z, Zhu L, Lin Y, Peng Y (2019) AutoHighlight: automatic highlights detection and segmentation in soccer matches. In 2018 IEEE International Conference on Big Data (Big Data), pp 4619–4624. IEEE.

  105. Terms I (2015) A multi-view video synopsis framework Ansuman Mahapatra, Pankaj K Sa, and Banshidhar Majhi Department of Computer Science and Engineering National Institute of Technology Rourkela. Int Conf Image Process (ICIP), pp 1–5

  106. Tian Z, Xue J, Lan X, Li C, Zheng N (2011) Key object-based static video summarization. MM’11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, pp 1301–1304. https://doi.org/10.1145/2072298.2071999.

  107. Tian Z, Xue J, Lan X, Li C, Zheng N (2014) Object segmentation and key-pose based summarization for motion video. Multimed. Tools Appl 72(2):1773–1802. https://doi.org/10.1007/s11042-013-1488-7

  108. Tribune T (2022) Rash driving to blame for 92% accidents in 2019-road crash analysis cell report. https://www.tribuneindia.com/news/chandigarh/rash-driving-to-blame-for-92-accidents-in-2019-114422.Accessed 18 Jul 2020

  109. Tripathi RK, Jalal AS, Agrawal SC (2018) Suspicious human activity recognition: a review. Artif Intell Rev 50(2):283–339. https://doi.org/10.1007/s10462-017-9545-7

    Article  Google Scholar 

  110. Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):3-es. https://doi.org/10.1145/1198302.1198305

  111. Uemura H, Ishikawa S, Mikolajczyk K (2008) Feature tracking and motion compensation for action recognition. In BMVC, pp 1–10

  112. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2017) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166. https://doi.org/10.1109/ACCESS.2017.2778011

    Article  Google Scholar 

  113. Vaswani A et al. (2017) Attention is all you need. Adv Neural Inf Process Syst, vol. 2017-Decem, no. Nips, pp 5999–6009

  114. Verma KK, Singh BM, Dixit A (2019) A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. Int J Inf Technol pp 1–14. https://doi.org/10.1007/s41870-019-00364-0.

  115. Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009. https://doi.org/10.1007/s00371-012-0752-6

    Article  Google Scholar 

  116. Wang F, Ngo CW (2012) Summarizing rushes videos by motion, object, and event understanding. IEEE Trans Multimed 14(1):76–87. https://doi.org/10.1109/TMM.2011.2165531

    Article  Google Scholar 

  117. Wang T, Chen J, Snoussi H (2013) Online detection of abnormal events in video streams. J Electr Comput Eng 2013, https://doi.org/10.1155/2013/837275.

  118. Wang J, Chen Y, Hao S, Peng X, Hu L (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recogn Lett 119:3–11. https://doi.org/10.1016/j.patrec.2018.02.010

    Article  Google Scholar 

  119. World Health Organization (2018) Global status report on road safety 2018. https://www.who.int/publications/i/item/9789241565684

  120. Xiao Q, Song R (2018) Action recognition based on hierarchical dynamic Bayesian network. Multimed Tools Appl 77(6):6955–6968. https://doi.org/10.1007/s11042-017-4614-0

    Article  Google Scholar 

  121. Xu L, Yan S, Chen X, Wang P (2019) Motion recognition algorithm based on deep edge-aware pyramid pooling network in human-computer interaction. IEEE Access 7:163806–163813

  122. Xu J, Sun Z, Ma C (2021) Crowd aware summarization of surveillance videos by deep reinforcement learning. Multimed. Tools Appl. 80(4):6121–6141. https://doi.org/10.1007/s11042-020-09888-1

    Article  Google Scholar 

  123. Yasmin G, Chowdhury S, Nayak J, Das P, Das AK (2021) Key moment extraction for designing an agglomerative clustering algorithm-based video summarization framework. Neural Comput Appl, vol. 1, https://doi.org/10.1007/s00521-021-06132-1.

  124. Yoon DH, Cho NG, Lee SW (2020) A novel online action detection framework from untrimmed video streams. Pattern Recogn 106:107396. https://doi.org/10.1016/j.patcog.2020.107396

    Article  Google Scholar 

  125. Zhang Y, Zhang L, Zimmermann R (2014) Aesthetics-guided summarization from multiple user generated videos. ACM Trans Multimed Comput Commun Appl 11(2). https://doi.org/10.1145/2659520.

  126. Zhang B, Conci N, de Natale FGB (2015) Segmentation of discriminative patches in human activity video. ACM Trans Multimed Comput Commun Appl 12(1):1–19. https://doi.org/10.1145/2750780.

  127. Zhang Z et al. (2019) Multi-scale visualization based on sketch interaction for massive surveillance video data. Pers Ubiquit Comput. https://doi.org/10.1007/s00779-019-01281-6.

  128. Zhang Y, Liang X, Zhang D, Tan M, Xing EP (2020) Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recogn Lett 130:376–385. https://doi.org/10.1016/j.patrec.2018.07.030

    Article  Google Scholar 

  129. Zhao B, Li X, Lu X (2018) HSA-RNN: hierarchical structure-adaptive RNN for video summarization. Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 7405–7414, https://doi.org/10.1109/CVPR.2018.00773.

  130. Zhao B, Gong M, Li X (2022) Hierarchical multimodal transformer to summarize videos. Neurocomputing 468:360–369. https://doi.org/10.1016/j.neucom.2021.10.039

    Article  Google Scholar 

  131. Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp 7582–7589

  132. Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55:42–52. https://doi.org/10.1016/j.imavis.2016.06.007

    Article  Google Scholar 

  133. Zhu W, Lu J, Li J, Zhou J (2021) DSNet: a flexible detect-to-summarize network for video summarization. IEEE Trans Image Process 30:948–962. https://doi.org/10.1109/TIP.2020.3039886

    Article  Google Scholar 

  134. Zhuang Y, Rui Y, Huang TS, Mehrotra S (1998) Adaptive key frame extraction using unsupervised clustering. IEEE Int Conf Image Process 1(94):866–870. https://doi.org/10.1109/icip.1998.723655

    Article  Google Scholar 

  135. Zutshi A, Gupta A, Raj A (2021) TRACS Transformer for Video Captioning and Summarisation TRACS: transformer for Video Captioning and Summarisation (no. January)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ambreen Sabha.

Ethics declarations

Competing interests

All the authors declare that they do not have any conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Table 9 The symbols used in the overall manuscript with their meanings

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sabha, A., Selwal, A. Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directions. Multimed Tools Appl 82, 32635–32709 (2023). https://doi.org/10.1007/s11042-023-14925-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14925-w

Keywords

Navigation