Skip to main content
Log in

Creating personalized video summaries via semantic event detection

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Video summarization has great potential in many application areas that enable fast browsing and efficient video indexing. Viewers prefer to browse a video summary containing the contents that they enjoy since watching an entire video may be time-consuming. We believe that it is necessary to create an automated tool that is capable of generating personalized video summaries. In this paper, we propose a new event detection-based personalized video summarization framework and deploy it to create film and soccer video summaries. In order to obtain effective event detection performance, we introduce two transfer learning method. The first event detection method is achieved based on the combination of convolutional neural network and support vector machine (CNNs–SVM). The second method is achieved using a fine-tuned summarization network (SumNet) that fuses fine-tuned object and scene networks. In this study, the training data consists of two datasets: (1) a 21K set of web images of back hugging, hand shaking, and standing talking used to detect a film event, and (2) a 30K set of web soccer match images of goals, fouls, and yellow cards to detect soccer events. Given an original video, we first segment it into shots and then use the trained model for event detection. Finally, based on the specification of user preferences, we generate a personalized event-based summary. We test our framework with several film videos and soccer videos. Experimental results demonstrate that the proposed fine-tuned SumNet achieves the best performance of 96.88% and \(98.50\%\), which is effective for generating personalized video summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Akilan T, Wu QJ, Safaei A, Jiang W (2017) A late fusion approach for harnessing multi-cnn model high-level features. In: Systems, man, and cybernetics (SMC), 2017 IEEE International Conference on, pp 566–571. https://doi.org/10.1109/SMC.2017.8122666

  • Akilan T, Wu QJ, Yang Y (2018) Fusion-based foreground enhancement for background subtraction using multivariate multi-model gaussian distribution. Inform Sci 430:414–431. https://doi.org/10.1016/j.ins.2017.11.062

  • Amel AM, Abdessalem BA, Abdellatif M (2010) Video shot boundary detection using motion activity descriptor. J Telecommun 2(1):54–59

    Google Scholar 

  • Baber J, Afzulpurkar N, Dailey MN, Bakhtyar M (2011) Shot boundary detection from videos using entropy and local descriptor. In: Digital signal processing (DSP), 2011 17th International Conference on IEEE, pp 1–6. https://doi.org/10.1109/ICDSP.2011.6004918

  • Cernekova Z, Pitas I, Nikou C (2006) Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on circuits and systems for video technology 16(1):82–91. https://doi.org/10.1109/TCSVT.2005.856896

  • Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27. https://doi.org/10.1145/1961189.1961199

  • Cucchiara R, Grana C, Prati A, Vezzani R (2005) Probabilistic posture classification for human-behavior analysis. IEEE Trans Syst Man Cybern Part A: Syst Hum 35(1):42–54. https://doi.org/10.1109/TSMCA.2004.838501

  • Darabi K, Ghinea G (2014) Personalized video summarization by highest quality frames. In: Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference, pp 1–6. https://doi.org/10.1109/ICMEW.2014.6890674

  • De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recog Lett 32(1):56–68. https://doi.org/10.1016/j.patrec.2010.08.004

  • Furini M, Geraci F, Montangero M, Pellegrini M (2010) Stimo: Still and moving video storyboard for the web scenario. Multimed Tools Appl 46(1):47–69. https://doi.org/10.1007/s11042-009-0307-7

  • Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European conference on computer vision, IEEE Workshop, pp 505–520. https://doi.org/10.1007/978-3-319-10584-0-33

  • Han B, Hamm J, Sim J (2011) Personalized video summarization with human in the loop. In: Applications of computer vision (WACV), 2011 IEEE Workshop, pp 51–57. https://doi.org/10.1109/BIOROB.2006.1639128

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  • Jiang RM, Sadka AH, Crookes D (2009) Advances in video summarization and skimming. In: Recent advances in multimedia signal processing and communications, pp 27–50. https://doi.org/10.1007/978-3-642-02900-4-2

  • Joho H, Staiano J, Sebe N, Jose JM (2011) Looking at the viewer: analysing facial activity to detect personal highlights of multimedia contents. Multimed Tools Appl 51(2):505–523. https://doi.org/10.1007/s11042-010-0632-x

  • Juang CF, Chang CM (2007) Human body posture classification by a neural fuzzy network and home care system application. IEEE Trans Syst Man Cybern Part A: Syst Hum 37(6):984–994. https://doi.org/10.1109/TSMCA.2007.897609

  • Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223

  • Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2698–2705. https://doi.org/10.1109/CVPR.2013.348

  • Kim G, Sigal L, Xing EP (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. IEEE Conference on computer vision and pattern recognition (CVPR), pp 4225–4232. https://doi.org/10.1109/CVPR.2014.538

  • Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, vol 1, pp 1097–1105

  • Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer vision and pattern recognition, 2008. IEEE Conference on CVPR 2008, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587756

  • Lee YJ, Grauman K (2015) Predicting important objects for egocentric video summarization. Int J Comput Vis 114(1):38–55. https://doi.org/10.1007/s11263-014-0794-5

  • Li Z, Tang J, Wang X, Liu J, Lu H (2016) Multimedia news summarization in search. ACM Trans Intell Syst Technol (TIST) 7(3):33. https://doi.org/10.1145/2822907

  • Liu Y, Xiao Y (2013) A robust image hashing algorithm resistant against geometrical attacks. Radio Eng 22(4):1072–1081

    MathSciNet  Google Scholar 

  • Ma J, Wu F, Zhu J, Xu D, Kong D (2017) A pre-trained convolutional neural network based method for thyroid nodule diagnosis. Ultrasonics 73:221–230. https://doi.org/10.1016/j.ultras.2016.09.011

  • Miniakhmetova M, Zymbler M (2015) An approach to personalized video summarization based on user preferences analysis. In: Application of information and communication technologies (AICT), 2015 9th International Conference, pp 153–155. https://doi.org/10.1109/ICAICT.2015.7338536

  • Money AG, Agius H (2008) Video summarisation: A conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143. https://doi.org/10.1016/j.jvcir.2007.04.002

  • Pal SK, Leigh AB (1995) Motion frame analysis and scene abstraction: discrimination ability of fuzziness measures. J Intell Fuzzy Syst 3(3):247–256. https://doi.org/10.3233/IFS-1995-3306

  • Pont-Tuset J, Arbelaez P, Barron JT, Marques F, Malik J (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transa Pattern Anal Machine Intell 39(1):128–140. https://doi.org/10.1109/TPAMI.2016.2537320

  • Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

  • Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115(3):211–252

    Article  MathSciNet  Google Scholar 

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR). arXiv:1409.1556

  • Song X, Sun L, Lei J, Tao D, Yuan G, Song M (2016) Event-based large scale surveillance video summarization. Neurocomputing 187:66–74. https://doi.org/10.1016/j.neucom.2015.07.131

  • Sun C, Nevatia R (2013) Large-scale web video event classification by use of fisher vectors. In: Applications of Computer Vision (WACV), 2013 IEEE Workshop, pp 15–22. https://doi.org/10.1109/WACV.2013.6474994

  • Szegedy C, Liu W, Jia Y, et al (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594

  • Xiong B, Grauman K (2014) Detecting snap points in egocentric video with a web photo prior. In: European conference on computer vision, pp 282–298. https://doi.org/10.1007/978-3-319-10602-1-19

  • Yoshitaka A, Sawada K (2012) Personalized video summarization based on behavior of viewer. In: Signal Image Technology and Internet Based Systems (SITIS), 2012 Eighth International Conference, pp 661–667. https://doi.org/10.1109/SITIS.2012.100

  • Zawbaa HM, El-Bendary N, Hassanien AE, Kim Th (2012) Event detection based approach for soccer video summarization using machine learning. Int J Multimed Ubiquitous Eng 7(2):63–80

    Google Scholar 

  • Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, pp 818–833

  • Zhang H, Hu R, Song L (2011) A shot boundary detection method based on color feature. In: Computer science and network technology (ICCSNT), 2011 International Conference, vol 4, pp 2541–2544. https://doi.org/10.1109/ICCSNT.2011.6182487

  • Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2513–2520. https://doi.org/10.1109/CVPR.2014.322

  • Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017a) Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence. https://doi.org/10.1109/TPAMI.2017.2723009

  • Zhou Z, Wu QJ, Huang F, Sun X (2017b) Fast and accurate near-duplicate image elimination for visual sensor networks. Int J Distrib Sens Netw 13(2): https://doi.org/10.1177/1550147717694172

  • Zhou Z, Wu QJ, Yang CN, Sun X, Pan Z (2017c) Coverless image steganography using histograms of oriented gradients-based hashing algorithm. J Intern Technol 18(5):1177–1184. https://doi.org/10.6138/JIT.2017.18.5.20160815b

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61633019) and the Public Projects of Zhejiang Province, China (No. LGF18F030002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Jiang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fei, M., Jiang, W. & Mao, W. Creating personalized video summaries via semantic event detection. J Ambient Intell Human Comput 14, 14931–14942 (2023). https://doi.org/10.1007/s12652-018-0797-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-018-0797-0

Keywords

Navigation