Abstract
In the last 15 years much effort has been made in the field of segmentation of videos into scenes. We give a comprehensive overview of the published approaches and classify them into seven groups based on three basic classes of low-level features used for the segmentation process: (1) visual-based, (2) audio-based, (3) text-based, (4) audio-visual-based, (5) visual-textual-based, (6) audio-textual-based and (7) hybrid approaches. We try to make video scene detection approaches better assessable and comparable by making a categorization of the evaluation strategies used. This includes size and type of the dataset used as well as the evaluation metrics. Furthermore, in order to let the reader make use of the survey, we list eight possible application scenarios, including an own section for interactive video scene segmentation, and identify those algorithms that can be applied to them. At the end, current challenges for scene segmentation algorithms are discussed. In the appendix the most important characteristics of the algorithms presented in this paper are summarized in table form.
Similar content being viewed by others
Notes
http://mpeg.chiariglione.org/standards/mpeg-7/mpeg-7.htm (February 1, 2013).
http://trecvid.nist.gov/ (February 1, 2013).
http://www.beeldengeluid.nl/en (February 1, 2013).
http://www.nist.gov/srd/nistsd26.cfm (February 1, 2013).
If an approach has been evaluated with multiple video types, it is counted once for each corresponding genre. For the total number of approaches, such approaches are counted multiple times, once for each type of video. Therefore, the sum of all percentages in the chart in Fig. 12 is 100 %.
References
Adams, B., Dorai, C., Venkatesh, S.: Toward automatic extraction of expressive elements from motion pictures: tempo. IEEE Trans. Multimed. 4(4), 472–481 (2002)
Aner, A., Kender, J.: Video Summaries through mosaic-based shot and scene clustering. In: Heyden, A., Sparr, G., Nielsen, M., Johansen P. (eds.) Computer Vision ECCV 2002, Lecture Notes in Computer Science, vol. 2353, Chap. 26, pp. 45–49. Springer, Berlin (2006)
Arifin, S., Cheung, P.Y.K.: Affective level video segmentation by utilizing the Pleasure-Arousal-dominance information. IEEE Trans. Multimed. 10(7), 1325–1341 (2008)
Ariki, Y., Kumano, M., Tsukada, K.: Highlight scene extraction in real time from baseball live video. In: Proceedings of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR ’03, pp. 209–214. ACM, New York, NY, USA (2003)
Benini, S., Xu, L.Q., Leonardi, R.: Identifying video content consistency by vector quantization. In: Proceedings of the 2005 International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2005) (2005)
Bredin, H.: Segmentation of tv shows into scenes using speaker diarization and speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 2377–2380 (2012)
Cao, J.R.: Algorithm of scene segmentation based on svm for scenery documentary. In: Third International Conference on Natural Computation, 2007 (ICNC 2007), vol. 3, pp. 95–98 (2007)
Chaisorn, L., Chua, T.S., Lee, C.H.: The segmentation of news video into story units. In: IEEE International Conference on Multimedia and Expo, 2002. ICME ’02, 2002, vol. 1, pp. 73–76 (2002)
Chasanis, V.T., Likas, A.C., Galatsanos, N.P.: Scene detection in videos using shot clustering and sequence alignment. IEEE Trans. Multimed. 11(1), 89–100 (2009)
Chen, L., Ozsu, M.: Rule-based scene extraction from video. In: Proceedings of 2002 International Conference on Image Processing (2002)
Chen, L.H., Lai, Y.C., Mark Liao, H.Y.: Movie scene segmentation using background information. Pattern Recognit. 41, 1056–1065 (2008)
Chen, S.C., Shyu, M.L., Liao, W., Zhang, C.: Scene change detection by audio and video clues, pp. 365–368
Cheng, W., Lu, J.: Video scene oversegmentation reduction by tempo analysis. In: Fourth International Conference on Natural Computation, 2008 (ICNC ’08), vol. 4, pp. 296–300 (2008)
Chu, W.T., Li, C.J., Tseng, S.C.: Travelmedia: an intelligent management system for media captured in travel. J. Vis. Commun. Image Represent. 22(1), 93–104 (2011)
Chu, W.T., Lin, C.C., Yu, J.Y.: Using cross-media correlation for scene detection in travel videos. In: Proceedings of the ACM International Conference on Image and Video Retrieval, CIVR ’09. ACM, New York, NY, USA (2009)
Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/script: alignment and parsing of video and text transcription. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) Computer Vision ECCV 2008, Lecture Notes in Computer Science, vol. 5305, Chap. 12, pp. 158–171. Springer, Berlin (2008)
Del Fabro, M., Böszörmenyi, L.: Video scene detection based on recurring motion patterns. In: Second International Conferences on Advances in Multimedia (MMEDIA), pp. 113–118 (2010)
Del Fabro, M., Böszörmenyi, L.: Summarization and presentation of real-life events using community-contributed content. In: Schoeffmann, K., Merialdo, B., Hauptmann, A., Ngo, C.W., Andreopoulos, Y., Breiteneder, C. (eds.) Advances in Multimedia Modeling, Lecture Notes in Computer Science, vol. 7131, pp. 630–632. Springer, Berlin (2012)
Del Fabro, M., Sobe, A., Böszörmenyi, L.: Summarization of real-life events based on community-contributed content. In: The Fourth International Conferences on Advances in Multimedia, pp. 119–126. IARIA (2012)
Ellouze, M., Boujemaa, N., Alimi, A.: Scene pathfinder: unsupervised clustering techniques for movie scenes extraction. Multimed. Tools Appl. 47(2), 325–346 (2010)
Ercolessi, P., Bredin, H., Sénac, C., Joly, P.: Segmenting TV series into scenes using speaker diarization. In: WIAMIS 2011: 12th International Workshop on Image Analysis for Multimedia Interactive Services. Delft, The Netherlands (2011)
Friedland, G., Gottlieb, L., Janin, A.: Joke-o-mat: browsing sitcoms punchline by punchline. In: Proceedings of the Seventeen ACM International Conference on Multimedia, MM ’09, pp. 1115–1116. ACM, New York, NY, USA (2009)
Gatica-Perez, D., Loui, A., Sun, M.T.: Finding structure in home videos by probabilistic hierarchical clustering. IEEE Trans. Circuits Syst. Video Technol. 13(6), 539– 548 (2003)
Goela, N., Wilson, K., Niu, F., Divakaran, A., Otsuka, I.: An SVM framework for Genre-Independent scene change detection. In: IEEE International Conference on Multimedia and Expo, pp. 532–535 (2007)
Gu, Z., Mei, T., Hua, X.S., Wu, X., Li, S.: EMS: Energy Minimization Based Video Scene Segmentation. In: IEEE International Conference on Multimedia and Expo, pp. 520–523 (2007)
Han, B., Wu, W.: Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: IEEE International Conference on Multimedia and Expo (ICME), 2011, pp. 1–6 (2011)
Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 9(4), 580–588 (1999)
Hauptmann, A., Witbrock, M.: Story segmentation and detection of commercials in broadcast news video. In: Proceedings. IEEE International Forum on Research and Technology Advances in Digital Libraries, 1998. ADL 98, pp. 168–179 (1998)
Hsu, W.H.M., Chang, S.F.: Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation. In: IEEE International Conference on Multimedia and Expo, 2004. ICME ’04, vol. 2, pp. 1091–1094 (2004)
Huang, J., Liu, Z., Wang, Y.: Joint scene classification and segmentation based on hidden markov model. IEEE Trans. Multimed. 7(3), 538–550 (2005)
Huang, J., Liu, Z., Yao, W.: Integration of audio and visual information for content-based video segmentation. In: International Conference on Image Processing, ICIP 98, vol. 3, pp. 526–529 (1998)
Janin, A., Gottlieb, L., Friedland, G.: Joke-o-Mat HD: browsing sitcoms with human derived transcripts. In: Proceedings of the International Conference on Multimedia, MM ’10, pp. 1591–1594. ACM, New York, NY, USA (2010)
Javed, O., Rasheed, Z., Shah, M.: A framework for segmentation of talk and game shows. In: Eighth IEEE International Conference on Computer Vision, ICCV 2001, (2001)
Katz, E., Klein, F., Nolen, R.: The film encyclopedia. Film Encyclopedia. HarperPerennial (1998). http://books.google.com/books?id=jhx0QgAACAAJ
Kender, J., Yeo, B.L.: Video scene segmentation via continuous video coherence. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 367–373 (1998)
Kohonen, T.: The self-organizing map. Neurocomputing 21(1–3), 1–6 (1998)
Kwon, Y.M., Song, C.J., Kim, I.J.: A new approach for high level video structuring. In: IEEE International Conference on Multimedia and Expo, ICME 2000. (2000)
Kyperountas, M., Kotropoulos, C., Pitas, I.: Enhanced Eigen-Audioframes for audiovisual scene change detection. IEEE Trans. Multimed. 9(4), 785–797 (2007)
Liang, C., Zhang, Y., Cheng, J., Xu, C., Lu, H.: A novel role-based movie scene segmentation method. In: Muneesawang, P., Wu, F., Kumazawa, I., Roeksabutr, A., Liao, M., Tang, X. (eds.) Advances in Multimedia Information Processing—PCM 2009, Lecture Notes in Computer Science, vol. 5879, Chap. 82, pp. 917–922. Springer, Berlin (2009)
Lienbart, R., Pfeiffer, S., Effelsberg, W.: Scene determination based on video and audio features. In: IEEE International Conference on Multimedia Computing and Systems, vol. 1, pp. 685–690 (1999)
Lin, T., Zhang, H.J., Shi, Q.Y.: Video scene extraction by force competition. In: IEEE International Conference on Multimedia and Expo, p. 192 (2001)
Liu, C., Huang, Q., Jiang, S., Xing, L., Ye, Q., Gao, W.: A framework for flexible summarization of racquet sports video using multiple modalities. Comput. Vis. Image Underst. 113(3), 415–424 (2009)
Lu, L., Cai, R., Hanjalic, A.: Audio elements based auditory scene segmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, vol. 5, p. V (2006)
Lu, L., Zhang, H.J., Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)
Mitrović, D., Hartlieb, S., Zeppelzauer, M., Zaharieva, M.: Scene segmentation in artistic archive documentaries. In: Leitner, G., Hitz, M., Holzinger, A. (eds.) HCI in Work and Learning, Life and Leisure, Lecture Notes in Computer Science, vol. 6389, Chap. 27, pp. 400–410. Springer, Berlin (2010)
Monaco, J.: How to Read a Film: The World of Movies, Media, Multimedia: Language, History, Theory, 3 edn. Oxford University Press, USA (2000)
Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Technol. 15(2), 296–305 (2005)
Ngo, C.W., Pong, T.C., Zhang, H.J.: Motion-based video representation for scene change detection. Int. J. Comput. Vis. 50(2), 127–142 (2002)
Nitanda, N., Haseyama, M., Kitajima, H.: Audio signal segmentation and classification for scene-cut detection. In: IEEE International Symposium on Circuits and Systems, 2005. ISCAS 2005, Vol. 4, pp. 4030– 4033 (2005)
Niu, F., Goela, N., Divakaran, A., Abdel-Mottaleb, M.: Audio scene segmentation for video with generic content. In: Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series. Presented at the Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, vol. 6820 (2008)
Odobez, J.M., Gatica-Perez, D., Guillemot, M.: Spectral structuring of home videos. In: Bakker, E., Lew, M., Huang, T., Sebe, N., Zhou, X. (eds.) Image and Video Retrieval, Lecture Notes in Computer Science, vol. 2728, Chap. 31, pp. 85–90. Springer, Berlin (2003)
Over, P., Awad, G., Fiscus, J., Antonishek, B., Michel, M., Smeaton, A.F., Kraaij, W., Quenot, G.: Trecvid 2010—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2010. NIST, USA (2010)
Parshin, V., Paradzinets, A., Chen, L.: Multimodal data fusion for video scene segmentation. In: Bres, S., Laurini, R. (eds.) Visual Information and Information Systems, Lecture Notes in Computer Science, vol. 3736, pp. 279–289. Springer, Berlin (2006)
Petersohn, C.: Temporal video structuring for preservation and annotation of video content. In: 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 93–96 (2009)
Poulisse, G., Moens, M.: Unsupervised scene detection in olympic video using multi-modal chains. In: 9th International Workshop on Content-Based Multimedia Indexing (CBMI), 2011, pp. 103–108 (2011)
Rasheed, Z., Shah, M.: Scene Detection in Hollywood Movies and TV Shows. IEEE Computer Society, Los Alamitos, CA, USA, p. 343 (2003)
Rasheed, Z., Shah, M.: Detection and representation of scenes in videos. IEEE Trans. Multimed. 7(6), 1097–1105 (2005)
Rui, Y., Huang, T.S., Mehrotra, S.: Constructing table-of-content for videos. Multimed. Syst. 7(5), 359–368 (1999)
Sakarya, U., Telatar, Z.: Graph-based multilevel temporal video segmentation. Multimed. Syst. 14(5), 277–290 (2008)
Sakarya, U., Telatar, Z.: Video scene detection using dominant sets. In: 15th IEEE International Conference on Image Processing, 2008. ICIP 2008, pp. 73–76 (2008)
Sakarya, U., Telatar, Z.: Video scene detection using graph-based representations. Signal Process. Image Commun. 25(10), 774–783 (2010)
Sang, J., Xu, C.: Character-based movie summarization. In: Proceedings of the International Conference on Multimedia, MM ’10, pp. 855–858. ACM, New York, NY, USA (2010)
Schoeffmann, K., Lux, M., Taschwer, M., Boeszoermenyi, L.: Visualization of video motion in context of video browsing. In: Proceedings of the IEEE International Conference on Multimedia and Expo. IEEE, New York, USA (2009)
Schoeffmann, K., Taschwer, M., Boeszoermenyi, L.: The video explorer: a tool for navigation and searching within a single video based on fast content analysis. In: MMSys 10: Proceedings of the First Annual ACM SIGMM Conference on Multimedia Systems, p. 247–258. ACM, New York, NY, USA (2010)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Kittler, J.: Differential edit distance: a metric for scene segmentation evaluation. IEEE Transa. Circuits Syst. Video Technol. 22(6), 904–914 (2012)
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: Temporal video segmentation to scenes using High-Level audiovisual features. IEEE Trans. Circuits Syst. Video Technol. 21(8), 1163–1177 (2011)
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Trancoso, I.: Multi-modal scene segmentation using scene transition graphs. In: Proceedings of the Seventeen ACM International Conference on Multimedia, MM ’09, pp. 665–668. ACM, New York, NY, USA (2009)
Song, Y., Ogawa, T., Haseyama, M.: MCMC-based scene segmentation method using structure of video. In: IEEE International Symposium on Communications and Information Technologies (ISCIT), pp. 862–866 (2010)
Sundaram, H., Chang, S.F.: Video scene segmentation using video and audio features. In: IEEE International Conference on Multimedia and Expo, 2000. ICME 2000 (2000)
Sundaram, H., Chang, S.F.: Computable scenes and structures in films. IEEE Trans. Multimed. 4(4), 482–491 (2002)
Surowiecki, J.: The Wisdom of Crowds. Anchor, New York (2005)
Tavanapong, W., Zhou, J.: Shot Clustering Techniques for Story Browsing. IEEE Trans. Multimed. 6(4), 517–527 (2004)
Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. 3(1), 3+ (2007)
Truong, B.T., Venkatesh, S., Dorai, C.: Scene extraction in motion pictures. IEEE Trans. Circuits Syst. Video Technol. 13(1), 5–15 (2003)
Velivelli, A., Ngo, C.W., Huang, T.S.: Detection of documentary scene changes by Audio-Visual fusion image and video retrieval. In: Bakker, E.M., Lew, M.S., Huang, T.S., Sebe, N., Zhou, X.S. (eds.) Image and Video Retrieval, Lecture Notes in Computer Science, vol. 2728, Chap. 23, pp. 227–238. Springer, Berlin (2003)
Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 4(4), 492–499 (2002)
Vinciarelli, A., Favre, S.: Broadcast news story segmentation using social network analysis and hidden markov models. In: Proceedings of the 15th International Conference on Multimedia, MULTIMEDIA ’07, pp. 261–264. ACM, New York, NY, USA (2007)
Wang, J., Duan, L., Liu, Q., Lu, H., Jin, J.S.: A multimodal scheme for program segmentation and representation in broadcast video streams. IEEE Trans. Multimed. 10(3), 393–408 (2008)
Wang, X., Wang, S., Xuejun, S., Gabbouj, M.: A shot clustering based algorithm for scene segmentation. In: International Conference on Computational Intelligence and Security Workshops, CISW 2007, pp. 259–252 (2007)
Weng, C.Y., Chu, W.T., Wu, J.L.: RoleNet: Movie analysis from the perspective of social networks. IEEE Trans. Multimed. 11(2), 256–271 (2009)
Wengang, C., De, X.: A novel approach of generating video scene structure. In: TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region, vol. 1, pp. 350– 353 (2003)
Wilson, K.W., Divakaran, A.: Discriminative genre-independent audio-visual scene change detection. SPIE, p. 725502 (2009)
Xie, L.: Structure analysis of soccer video with domain knowledge and hidden markov models. Pattern Recognit. Lett. 25(7), 767–775 (2004)
Yaşaroğlu, Y., Alatan, A.: Summarizing video: Content, features, and HMM topologies. In: García, N., Salgado, L., Martínez, J.M. (eds.) Visual Content Processing and Representation, Lecture Notes in Computer Science, vol. 2849, Chap. 15, pp. 101–110. Springer, Berlin (2003)
Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998)
Zhai, Y., Shah, M.: Video scene segmentation using markov chain monte carlo. IEEE Trans. Multimed. 8(4), 686–697 (2006)
Zhai, Y., Yilmaz, A., Shah, M.: Story segmentation in news videos using visual and text cues. In: Leow, W.K., Lew, M., Chua, T.S., Ma, W.Y., Chaisorn, L., Bakker, E. (eds.) Image and Video Retrieval, Lecture Notes in Computer Science, vol. 3568, Chap. 13, pp. 92–102. Springer, Berlin (2005)
Zhang, Z., Li, B., Lu, H., Xue, X.: Scene segmentation based on video structure and spectral methods. In: 10th International Conference on Control, Automation, Robotics and Vision, 2008. ICARCV 2008, pp. 1093–1096 (2008)
Zhao, L., Yang, S.Q., Feng, B.: Video scene detection using slide windows method based on temporal constrain shot similarity. In: IEEE International Conference on Multimedia and Expo, ICME 2001, pp. 1171– 1174 (2001)
Zhao, Y., Wang, T., Wang, P., Hu, W., Du, Y., Zhang, Y., Xu, G.: Scene segmentation and categorization using ncuts. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’07, pp. 1–7 (2007)
Zhou, J., Tavanapong, W.: Shot Weave: A shot clustering technique for story browsing for large video databases. In: Chaudhri, A., Unland, R., Djeraba, C., Lindner, W. (eds.) XML-Based Data Management and Multimedia Engineering EDBT 2002 Workshops, Lecture Notes in Computer Science, vol. 2490, Chap. 17, pp. 529–533. Springer, Berlin (2002)
Zhu, S., Liu, Y.: Video scene segmentation and semantic representation using a novel scheme. Multimed. Tools Appl. 42(2), 183–205 (2009)
Acknowledgments
Special thanks to Professor Alan Hanjalic from Delft University of Technology for his valuable thoughts and suggestions on how to structure this survey. This work was supported by Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European Regional Development Fund and the Carinthian Economic Promotion Fund (KWF) under Grant KWF-20214 17097 24774 and Grant KWF-20214 22573 33955.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by P. Pala.
Rights and permissions
About this article
Cite this article
Del Fabro, M., Böszörmenyi, L. State-of-the-art and future challenges in video scene detection: a survey. Multimedia Systems 19, 427–454 (2013). https://doi.org/10.1007/s00530-013-0306-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-013-0306-4