Skip to main content
Log in

Multimodal Video Indexing: A Review of the State-of-the-art

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Abney, Part-of-speech tagging and partial parsing, in Corpus-Based Methods in Language and Speech Processing, S. Young and G. Bloothooft (Eds.), Kluwer Academic Publishers, Dordrecht, 1997, pp. 118–136.

    Google Scholar 

  2. S. Adali, K.S. Candan, S.S. Chen, K. Erol, and V.S. Subrahmanian, The advanced video information system: Data structures and query processing, Multimedia Systems, Vol. 4, No. 4, pp. 172–186,1996

    Google Scholar 

  3. A.A. Alatan, A.N. Akansu, and W. Wolf, Multi-modal dialogue scene detection using hidden markov models for content-based multimedia indexing, Multimedia Tools and Applications, Vol. 14, No. 2, pp. 137–151,2001

    Google Scholar 

  4. Y. Altunbasak, P.E. Eren, and A.M. Tekalp, Region-based parametric motion segmentation using color information, Graphical Models and Image Processing, Vol. 60, No. 1, pp. 13–23,1998

    Google Scholar 

  5. N. Babaguchi, Y. Kawai, and T. Kitahashi, Event based indexing of broadcasted sports video by intermodal collaboration, IEEE Transactions on Multimedia, Vol. 4, No. 1, pp. 68–75,2002

    Google Scholar 

  6. P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 711–720,1997

    Google Scholar 

  7. M. Bertini, A. Del Bimbo, and P. Pala, Content-based indexing and retrieval of TV news, Pattern Recog-nition Letters, Vol. 22, No. 5, pp. 503–516,2001

    Google Scholar 

  8. D. Bikel, R. Schwartz, and R.M. Weischedel, An algorithm that learns what's in a name, Machine Learning, Vol. 34, Nos. 1-3, pp. 211–231,1999

    Google Scholar 

  9. J.M. Boggs and D.W. Petrie, The Art of Watching Films, 5th edition, Mayfield Publishing Company: Moun-tain View, USA,2000

  10. R.M. Bolle, B.-L. Yeo, and M.M. Yeung, Video query: Research directions, IBM Journal of Research and Development, Vol. 42, No. 2, pp. 233–252,1998

    Google Scholar 

  11. A. Bonzanini, R. Leonardi, and P. Migliorati, Event recognition in sport programs using low-level motion indices, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 1208–1211.

  12. M. Brown, J. Foote, G. Jones, K. Sparck-Jones, and S. Young, Automatic content-based retrieval of broadcast news, in ACM Multimedia 1995, San Francisco, USA,1995

  13. R. Brunelli, O. Mich, and C.M. Modena, A survey on the automatic indexing of video data, Journal of Visual Communication and Image Representation, Vol. 10, No. 2, pp. 78–112,1999

    Google Scholar 

  14. M. La Cascia, S. Sethi, and S. Sclaroff, Combining textual and visual cues for content-based image retrieval on the world wide web, in IEEE Workshop on Content-Based Access of Image and Video Libraries,1998

  15. M. Christel, A. Olligschlaeger, and C. Huang, Interactive maps for a digital video library, IEEE Multimedia, Vol. 7, No. 1, pp. 60–67,2000

    Google Scholar 

  16. C. Colombo, A. Del Bimbo, and P. Pala, Semantics in visual information retrieval, IEEE Multimedia, Vol. 6, No. 3, pp. 38–53,1999

    Google Scholar 

  17. Convera. http://www.convera.com.

  18. G. Davenport, T. Aguierre Smith, and N. Pincever, Cinematic principles for multimedia, in IEEE Computer Graphics & Applications, Vol. 11, No. 4, pp. 67–74,1991

    Google Scholar 

  19. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by latent seman-tic analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, pp. 391–407,1990

    Google Scholar 

  20. N. Dimitrova, L. Agnihotri, and G. Wei, Video classification based on HMM using text and faces, in European Signal Processing Conference, Tampere, Finland,2000

  21. S. Eickeler and S. Müller, Content-based video indexing of TV broadcast news using hidden markov models, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, USA, 1999, pp. 2997–3000.

  22. K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, Speech/music discrimination for multimedia applications, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 2000, pp. 2445–2448.

  23. S. Fischer, R. Lienhart, and W. Effelsberg, Automatic recognition of film genres, in ACM Multimedia 1995, San Francisco, USA, 1995, pp. 295–304.

  24. M.M. Fleck, D.A. Forsyth, and C. Bregler, Finding naked people, in European Conference on Computer Vision, Cambridge, UK, 1996, Vol. 2, pp. 593–602.

    Google Scholar 

  25. B. Furht, S.W. Smoliar, and H.J. Zhang, Video and Image Processing in Multimedia Systems, 2nd edition, Kluwer Academic Publishers: Norwell, USA,1996

    Google Scholar 

  26. A. Ghias, J. Logan, D. Chamberlin, and B.C. Smith, Query by humming-musical information retrieval in an audio database, in ACM Multimedia 1995, San Francisco, USA,1995

  27. Y. Gong, L.T. Sin, and C.H. Chuan, Automatic parsing of TV soccer programs, in IEEE International Conference on Multimedia Computing and Systems, 1995, pp. 167–174.

  28. B. Günsel, A.M. Ferman, and A.M. Tekalp, Video indexing through integration of syntactic and semantic features, in Third IEEE Workshop on Applications of Computer Vision, Sarasota, USA,1996

  29. N. Haering, R. Qian, and I. Sezan, A semantic event-detection approach and its application to detecting hunts in wildlife video, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 6, pp. 857–868,2000

    Google Scholar 

  30. A. Hampapur, R. Jain, and T. Weymouth, Feature based digital video indexing, in IFIP 2.6 Third Working Conference on Visual Database Systems, Lausanne, Switzerland,1995

  31. A. Hanjalic, G. Kakes, R.L. Lagendijk, and J. Biemond, Dancers: Delft advanced news retrieval system, in IS & T/SPIE Electronic Imaging 2001: Storage and Retrieval for Media Databases 2001, San Jose, USA,2001

  32. A. Hanjalic, G.C. Langelaar, P.M.B. van Roosmalen, J. Biemond, and R.L. Lagendijk, Image and Video Databases: Restoration, Watermarking and Retrieval, Elsevier Science: Amsterdam, The Netherlands,2000

    Google Scholar 

  33. A.G. Hauptmann, D. Lee, and P.E. Kennedy, Topic labeling of multilingual broadcast news in the informedia digital video library, in ACM DL/SIGIR MIDAS Workshop, Berkely, USA,1999

  34. A.G. Hauptmann and M.J. Witbrock, Story segmentation and detection of commercials in broadcast news video, in ADL-98 Advances in Digital Libraries, Santa Barbara, USA, 1998, pp. 168–179.

  35. J. Huang, Z. Liu, Y. Wang, Y. Chen, and E.K. Wong, Integration of multimodal features for video scene classification based on HMM, in IEEE Workshop on Multimedia Signal Processing, Copenhagen, Denmark,1999

  36. I. Ide, K. Yamamoto, and H. Tanaka, Automatic video indexing based on shot classification, in First International Conference on Advanced Multimedia Content Processing, Vol. 1554 of Lecture Notes in Computer Science, Springer-Verlag: Osaka, Japan,1999

    Google Scholar 

  37. A.K. Jain, R.P.W. Duin, and J. Mao, Statistical pattern recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, pp. 4–37,2000

    Google Scholar 

  38. R. Jain and A. Hampapur, Metadata in video databases, ACM SIGMOD, Vol. 23, No. 4, pp. 27–33,1994

    Google Scholar 

  39. P.J. Jang and A.G. Hauptmann, Learning to recognize speech by watching television, IEEE Intelligent Systems, Vol. 14, No. 5, pp. 51–58,1999

    Google Scholar 

  40. R.S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, and D. Li, Integrated multimedia processing for topic segmentation and classification, in IEEE International Conference on Image Processing, Thessaloniki, Greece, 2001, pp. 366–369.

  41. O. Javed, Z. Rasheed, and M. Shah, A framework for segmentation of talk & game shows, in IEEE International Conference on Computer Vision, Vancouver, Canada,2001

  42. V. Kobla, D. DeMenthon, and D. Doermann, Identification of sports videos using replay, text, and camera motion features, in SPIE Conference on Storage and Retrieval for Media Databases, Vol. 3972, pp. 332–343,2000

    Google Scholar 

  43. D. Li, I.K. Sethi, N. Dimitrova, and T. McGee, Classification of general audio data for content-based retrieval, Pattern Recognition Letters, Vol. 22, No. 5, pp. 533–544,2001

    Google Scholar 

  44. H. Li, D. Doermann, and O. Kia, Automatic text detection and tracking in digital video, IEEE Transactions on Image Processing, Vol. 9, No. 1, pp. 147–156,2000

    Google Scholar 

  45. R. Lienhart, C. Kuhmünch, and W. Effelsberg, On the detection and recognition of television commer-cials, in IEEE Conference on Multimedia Computing and Systems, Ottawa, Canada, 1997, pp. 509–516.

  46. C.D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, USA,1999

    Google Scholar 

  47. K. Minami, A. Akutsu, H. Hamada, and Y. Tomomura, Video handling with music and speech detection, IEEE Multimedia, Vol. 5, No. 3, pp. 17–25,1998

    Google Scholar 

  48. H. Miyamori and S. Iisaku, Video annotation for content-based retrieval using human behavior analysis and domain knowledge, in IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, 2000, pp. 26–30.

  49. A. Mohan, C. Papageorgiou, and T. Poggio, Example-based object detection in images by compo-nents, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 4, pp. 349–361,2001

    Google Scholar 

  50. S. Moncrieff, C. Dorai, and S. Venkatesh, Detecting indexical signs in film audio for scene inter-pretation, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 11921195

  51. F. Nack and A.T. Lindsay, Everything you always wanted to knowabout MPEG-7: Part 1, IEEE Multimedia, Vol. 6, No. 3, pp. 65–77,1999

    Google Scholar 

  52. F. Nack and A.T. Lindsay, Everything you always wanted to knowabout MPEG-7: Part 2, IEEE Multimedia, Vol. 6, No. 4, pp. 64–73,1999

    Google Scholar 

  53. J. Nam, M. Alghoniemy, and A.H. Tewfik, Audio-visual content-based violent scene characterization, in IEEE International Conference on Image Processing, Chicago, USA, 1998, Vol. 1, pp. 353–357.

  54. J. Nam, A. Enis Cetin, and A.H. Tewfik, Speaker identification and video analysis for hierarchical video shot classification, in IEEE International Conference on Image Processing, Washington DC, USA, 1997, Vol. 2.

  55. M.R. Naphade and T.S. Huang, A probabilistic framework for semantic video indexing, filtering, and retrieval, IEEE Transactions on Multimedia, Vol. 3, No. 1, pp. 141–151,2001

    Google Scholar 

  56. H.T. Nguyen, M. Worring, and A. Dev, Detection of moving objects in video using a robust motion similarity measure, IEEE Transactions on Image Processing, Vol. 9, No. 1, pp. 137–141,2000

    Google Scholar 

  57. L. Nigay and J. Coutaz, A design space for multimodal systems: concurrent processing and data fusion. in INTERCHI'93 Proceedings, Amsterdam, the Netherlands, 1993, pp. 172–178.

  58. D.W. Oard, The state of the art in text filtering, User Modeling and User-Adapted Interaction, Vol.7, No. 3, pp. 141–178,1997

    Google Scholar 

  59. H. Pan, P. Van Beek, and M.I. Sezan, Detection of slow-motion replay segments in sports video for highlights generation, in IEEE International Conference on Acoustic, Speech and Signal Processing,2001

  60. N.V. Patel and I.K. Sethi, Audio characterization for video indexing, in Proceedings SPIE on Storage and Retrieval for Still Image and Video Databases, San Jose, USA, 1996, Vol.2670, pp. 373–384.

    Google Scholar 

  61. N.V. Patel and I.K. Sethi, Video classification using speaker identification, in IS & T SPIE, Proceedings: Storage and Retrieval for Image and Video Databases IV, San Jose, USA,1997

  62. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann: San Mateo, USA,1988

    Google Scholar 

  63. A.K. Peker, A.A. Alatan, and A.N. Akansu, Low-level motion activity features for semantic characterization of video, in IEEE International Conference on Multimedia & Expo, New York City, USA,2000

  64. A. Pentland, B. Moghaddam, and T. Starner, View-based and modular eigenspaces for face recognition, in IEEE International Conference on Computer Vision and Pattern Recognition, Seattle, USA,1994

  65. S. Pfeiffer, S. Fischer, and W. Effelsberg, Automatic audio content analysis, in ACM Multimedia 1996, Boston, USA, 1996, pp. 21–30.

  66. S. Pfeiffer, R. Lienhart, and W. Effelsberg, Scene determination based on video and audio features, Mul-timedia Tools and Applications, Vol. 15, No. 1, pp. 59–81,2001

    Google Scholar 

  67. T.V. Pham and M. Worring, Face detection methods: A critical evaluation, Technical Report 2000-11, Intelligent Sensory Information Systems, University of Amsterdam, 2000

  68. Praja. http://www.praja.com.

  69. L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Pro-ceedings of the IEEE, Vol. 77, No. 2, pp. 257–286,1989

    Google Scholar 

  70. H.A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, pp. 23–38,1998

    Google Scholar 

  71. Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlights for TV baseball programs, in ACM Multimedia 2000, Los Angeles, USA, 2000, pp. 105–115.

  72. E. Sahouria and A. Zakhor, Content analysis of video using principal components, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8, pp. 1290–1298,1999

    Google Scholar 

  73. C. Saraceno and R. Leonardi, Identification of story units in audio-visual sequences by joint audio and video processing, in IEEE International Conference on Image Processing, Chicago, USA,1998

  74. S. Satoh, Y. Nakamura, and T. Kanade, Name-It: Naming and detecting faces in news videos, IEEE Multimedia, Vol. 6, No. 1, pp. 22–35,1999

    Google Scholar 

  75. D.D. Saur, Y.-P. Tan, S.R. Kulkarni, and P.J. Ramadge, Automated analysis and annotation of basketball video, in SPIE's Electronic Imaging conference on Storage and Retrieval for Image and Video Databases V, San Jose, USA, 1997, Vol. 3022, pp. 176–187.

    Google Scholar 

  76. H. Schneiderman and T. Kanade, A statistical method for 3D object detection applied to faces and cars, in IEEE Computer Vision and Pattern Recognition, Hilton Head, USA,2000

  77. K. Shearer, C. Dorai, and S. Venkatesh, Incorporating domain knowledge with video and voice data analysis in news broadcasts, in ACM International Conference on Knowledge Discovery and Data Mining, Boston, USA, 2000, pp. 46–53.

  78. J. Shim, C. Dorai, and R. Bolle, Automatic text extraction from video for content-based annotation and retrieval, in IEEE International Conference on Pattern Recognition, 1998, pp. 618–620.

  79. A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content based image retrieval at the end of the early years, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, pp. 1349–1380,2000

    Google Scholar 

  80. R.K. Srihari, Automatic indexing and content-based retrieval of captioned images, IEEE Computer, Vol. 28, No. 9, pp. 49–56,1995

    Google Scholar 

  81. G. Sudhir, J.C.M. Lee, and A.K. Jain, Automatic classification of tennis video for high-level content-based retrieval, in IEEE International Workshop on Content-Based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India,1998

  82. M. Szummer and R.W. Picard, Indoor-outdoor image classification, in IEEE International Workshop on Content-based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India,1998

  83. B.T. Truong and S. Venkatesh, Determining dramatic intensification via flashing lights in movies, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 61–64.

  84. B.T. Truong, S. Venkatesh, and C. Dorai, Automatic genre identification for content-based video catego-rization, in IEEE International Conference on Pattern Recognition, Barcelona, Spain,2000

  85. S. Tsekeridou and I. Pitas, Content-based video parsing and indexing based on audio-visual interac-tion, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 4, pp. 522–535,2001

    Google Scholar 

  86. A. Vailaya and A.K. Jain, Detecting sky and vegetation in outdoor images, in Proceedings of SPIE: Storage and Retrieval for Image and Video Databases VIII, San Jose, USA, 2000, Vol.3972

  87. A. Vailaya, A.K. Jain, and H.-J. Zhang, On image classification: City images vs. landscapes, Pattern Recognition, Vol. 31, pp. 1921–1936,1998

    Google Scholar 

  88. J. Vendrig and M. Worring, Systematic evaluation of logical story unit segmentation, IEEE Transactions on Multimedia, Vol. 4, No. 4, pp. 492–499,2002

    Google Scholar 

  89. Virage. http://www.virage.com.

  90. Y. Wang, Z. Liu, and J. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Vol. 17, No. 6, pp. 12–36,2000

    Google Scholar 

  91. T. Westerveld, Image retrieval: Content versus context, in Content-Based Multimedia Information Access, RIAO 2000 Conference, Paris, France, 2000, pp. 276–284.

  92. E. Wold, T. Blum, D. Keislar, and J. Wheaton, Content-based classification, search, and retrieval of audio, IEEE Multimedia, Vol. 3, No. 3, pp. 27–36,1996

    Google Scholar 

  93. L. Wu, J. Benois-Pineau, and D. Barba, Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding, Image Communication, Vol. 8, No. 6, pp. 513–544,1996

    Google Scholar 

  94. P. Xu, L. Xie, S.-F. Chang, A. Divakaran, A. Vetro, and H. Sun, Algorithms and systems for segmentation and structure analysis in soccer video, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 928–931.

  95. M.-H. Yang, D. Kriegman, and N. Ahuja, Detecting faces in images: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 1, pp. 34–58,2002

    Google Scholar 

  96. M.M. Yeung and B.-L. Yeo, Video content characterization and compaction for digital library applications, in IS & T/SPIE Storage and Retrieval of Image and Video Databases V, 1997, Vol. 3022, pp. 45–58.

    Google Scholar 

  97. H.-J. Zhang, A. Kankanhalli, and S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems, Vol. 1, No. 1, pp. 10–28,1993

    Google Scholar 

  98. H.-J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong, Automatic parsing and indexing of news video, Multimedia Systems, Vol. 2, No. 6, pp. 256–266,1995

    Google Scholar 

  99. T. Zhang and C.-C.J. Kuo, Hierarchical classification of audio data for archiving and retrieving, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, USA,1999, Vol. 6, pp. 3001–3004

    Google Scholar 

  100. D. Zhong and S.-F. Chang, Structure analysis of sports video using domain models, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 920–923.

  101. Y. Zhong, H.-J. Zhang, and A.K. Jain, Automatic caption localization in compressed video, IEEE Trans-actions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, pp. 385–392,2000

    Google Scholar 

  102. W. Zhou, A. Vellaikal, and C.-C.J. Kuo, Rule-based video classification system for basketball video index-ing, in ACM Multimedia 2000, Los Angeles, USA,2000

  103. W. Zhu, C. Toklu, and S.-P. Liou, Automatic news video segmentation and categorization based on closed-captioned text, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 1036–1039

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Snoek, C.G., Worring, M. Multimodal Video Indexing: A Review of the State-of-the-art. Multimedia Tools and Applications 25, 5–35 (2005). https://doi.org/10.1023/B:MTAP.0000046380.27575.a5

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:MTAP.0000046380.27575.a5

Navigation