Multimodal Video Indexing: A Review of the State-of-the-art

Snoek, Cees G.M.; Worring, Marcel

doi:10.1023/B:MTAP.0000046380.27575.a5

Multimodal Video Indexing: A Review of the State-of-the-art

Published: January 2005

Volume 25, pages 5–35, (2005)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Cees G.M. Snoek¹ &
Marcel Worring¹

1553 Accesses
293 Citations
9 Altmetric
Explore all metrics

Abstract

Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

S. Abney, Part-of-speech tagging and partial parsing, in Corpus-Based Methods in Language and Speech Processing, S. Young and G. Bloothooft (Eds.), Kluwer Academic Publishers, Dordrecht, 1997, pp. 118–136.
Google Scholar
S. Adali, K.S. Candan, S.S. Chen, K. Erol, and V.S. Subrahmanian, The advanced video information system: Data structures and query processing, Multimedia Systems, Vol. 4, No. 4, pp. 172–186,1996
Google Scholar
A.A. Alatan, A.N. Akansu, and W. Wolf, Multi-modal dialogue scene detection using hidden markov models for content-based multimedia indexing, Multimedia Tools and Applications, Vol. 14, No. 2, pp. 137–151,2001
Google Scholar
Y. Altunbasak, P.E. Eren, and A.M. Tekalp, Region-based parametric motion segmentation using color information, Graphical Models and Image Processing, Vol. 60, No. 1, pp. 13–23,1998
Google Scholar
N. Babaguchi, Y. Kawai, and T. Kitahashi, Event based indexing of broadcasted sports video by intermodal collaboration, IEEE Transactions on Multimedia, Vol. 4, No. 1, pp. 68–75,2002
Google Scholar
P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 711–720,1997
Google Scholar
M. Bertini, A. Del Bimbo, and P. Pala, Content-based indexing and retrieval of TV news, Pattern Recog-nition Letters, Vol. 22, No. 5, pp. 503–516,2001
Google Scholar
D. Bikel, R. Schwartz, and R.M. Weischedel, An algorithm that learns what's in a name, Machine Learning, Vol. 34, Nos. 1-3, pp. 211–231,1999
Google Scholar
J.M. Boggs and D.W. Petrie, The Art of Watching Films, 5th edition, Mayfield Publishing Company: Moun-tain View, USA,2000
R.M. Bolle, B.-L. Yeo, and M.M. Yeung, Video query: Research directions, IBM Journal of Research and Development, Vol. 42, No. 2, pp. 233–252,1998
Google Scholar
A. Bonzanini, R. Leonardi, and P. Migliorati, Event recognition in sport programs using low-level motion indices, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 1208–1211.
M. Brown, J. Foote, G. Jones, K. Sparck-Jones, and S. Young, Automatic content-based retrieval of broadcast news, in ACM Multimedia 1995, San Francisco, USA,1995
R. Brunelli, O. Mich, and C.M. Modena, A survey on the automatic indexing of video data, Journal of Visual Communication and Image Representation, Vol. 10, No. 2, pp. 78–112,1999
Google Scholar
M. La Cascia, S. Sethi, and S. Sclaroff, Combining textual and visual cues for content-based image retrieval on the world wide web, in IEEE Workshop on Content-Based Access of Image and Video Libraries,1998
M. Christel, A. Olligschlaeger, and C. Huang, Interactive maps for a digital video library, IEEE Multimedia, Vol. 7, No. 1, pp. 60–67,2000
Google Scholar
C. Colombo, A. Del Bimbo, and P. Pala, Semantics in visual information retrieval, IEEE Multimedia, Vol. 6, No. 3, pp. 38–53,1999
Google Scholar
Convera. http://www.convera.com.
G. Davenport, T. Aguierre Smith, and N. Pincever, Cinematic principles for multimedia, in IEEE Computer Graphics & Applications, Vol. 11, No. 4, pp. 67–74,1991
Google Scholar
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by latent seman-tic analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, pp. 391–407,1990
Google Scholar
N. Dimitrova, L. Agnihotri, and G. Wei, Video classification based on HMM using text and faces, in European Signal Processing Conference, Tampere, Finland,2000
S. Eickeler and S. Müller, Content-based video indexing of TV broadcast news using hidden markov models, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, USA, 1999, pp. 2997–3000.
K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, Speech/music discrimination for multimedia applications, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 2000, pp. 2445–2448.
S. Fischer, R. Lienhart, and W. Effelsberg, Automatic recognition of film genres, in ACM Multimedia 1995, San Francisco, USA, 1995, pp. 295–304.
M.M. Fleck, D.A. Forsyth, and C. Bregler, Finding naked people, in European Conference on Computer Vision, Cambridge, UK, 1996, Vol. 2, pp. 593–602.
Google Scholar
B. Furht, S.W. Smoliar, and H.J. Zhang, Video and Image Processing in Multimedia Systems, 2nd edition, Kluwer Academic Publishers: Norwell, USA,1996
Google Scholar
A. Ghias, J. Logan, D. Chamberlin, and B.C. Smith, Query by humming-musical information retrieval in an audio database, in ACM Multimedia 1995, San Francisco, USA,1995
Y. Gong, L.T. Sin, and C.H. Chuan, Automatic parsing of TV soccer programs, in IEEE International Conference on Multimedia Computing and Systems, 1995, pp. 167–174.
B. Günsel, A.M. Ferman, and A.M. Tekalp, Video indexing through integration of syntactic and semantic features, in Third IEEE Workshop on Applications of Computer Vision, Sarasota, USA,1996
N. Haering, R. Qian, and I. Sezan, A semantic event-detection approach and its application to detecting hunts in wildlife video, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 6, pp. 857–868,2000
Google Scholar
A. Hampapur, R. Jain, and T. Weymouth, Feature based digital video indexing, in IFIP 2.6 Third Working Conference on Visual Database Systems, Lausanne, Switzerland,1995
A. Hanjalic, G. Kakes, R.L. Lagendijk, and J. Biemond, Dancers: Delft advanced news retrieval system, in IS & T/SPIE Electronic Imaging 2001: Storage and Retrieval for Media Databases 2001, San Jose, USA,2001
A. Hanjalic, G.C. Langelaar, P.M.B. van Roosmalen, J. Biemond, and R.L. Lagendijk, Image and Video Databases: Restoration, Watermarking and Retrieval, Elsevier Science: Amsterdam, The Netherlands,2000
Google Scholar
A.G. Hauptmann, D. Lee, and P.E. Kennedy, Topic labeling of multilingual broadcast news in the informedia digital video library, in ACM DL/SIGIR MIDAS Workshop, Berkely, USA,1999
A.G. Hauptmann and M.J. Witbrock, Story segmentation and detection of commercials in broadcast news video, in ADL-98 Advances in Digital Libraries, Santa Barbara, USA, 1998, pp. 168–179.
J. Huang, Z. Liu, Y. Wang, Y. Chen, and E.K. Wong, Integration of multimodal features for video scene classification based on HMM, in IEEE Workshop on Multimedia Signal Processing, Copenhagen, Denmark,1999
I. Ide, K. Yamamoto, and H. Tanaka, Automatic video indexing based on shot classification, in First International Conference on Advanced Multimedia Content Processing, Vol. 1554 of Lecture Notes in Computer Science, Springer-Verlag: Osaka, Japan,1999
Google Scholar
A.K. Jain, R.P.W. Duin, and J. Mao, Statistical pattern recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, pp. 4–37,2000
Google Scholar
R. Jain and A. Hampapur, Metadata in video databases, ACM SIGMOD, Vol. 23, No. 4, pp. 27–33,1994
Google Scholar
P.J. Jang and A.G. Hauptmann, Learning to recognize speech by watching television, IEEE Intelligent Systems, Vol. 14, No. 5, pp. 51–58,1999
Google Scholar
R.S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, and D. Li, Integrated multimedia processing for topic segmentation and classification, in IEEE International Conference on Image Processing, Thessaloniki, Greece, 2001, pp. 366–369.
O. Javed, Z. Rasheed, and M. Shah, A framework for segmentation of talk & game shows, in IEEE International Conference on Computer Vision, Vancouver, Canada,2001
V. Kobla, D. DeMenthon, and D. Doermann, Identification of sports videos using replay, text, and camera motion features, in SPIE Conference on Storage and Retrieval for Media Databases, Vol. 3972, pp. 332–343,2000
Google Scholar
D. Li, I.K. Sethi, N. Dimitrova, and T. McGee, Classification of general audio data for content-based retrieval, Pattern Recognition Letters, Vol. 22, No. 5, pp. 533–544,2001
Google Scholar
H. Li, D. Doermann, and O. Kia, Automatic text detection and tracking in digital video, IEEE Transactions on Image Processing, Vol. 9, No. 1, pp. 147–156,2000
Google Scholar
R. Lienhart, C. Kuhmünch, and W. Effelsberg, On the detection and recognition of television commer-cials, in IEEE Conference on Multimedia Computing and Systems, Ottawa, Canada, 1997, pp. 509–516.
C.D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, USA,1999
Google Scholar
K. Minami, A. Akutsu, H. Hamada, and Y. Tomomura, Video handling with music and speech detection, IEEE Multimedia, Vol. 5, No. 3, pp. 17–25,1998
Google Scholar
H. Miyamori and S. Iisaku, Video annotation for content-based retrieval using human behavior analysis and domain knowledge, in IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, 2000, pp. 26–30.
A. Mohan, C. Papageorgiou, and T. Poggio, Example-based object detection in images by compo-nents, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 4, pp. 349–361,2001
Google Scholar
S. Moncrieff, C. Dorai, and S. Venkatesh, Detecting indexical signs in film audio for scene inter-pretation, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 11921195
F. Nack and A.T. Lindsay, Everything you always wanted to knowabout MPEG-7: Part 1, IEEE Multimedia, Vol. 6, No. 3, pp. 65–77,1999
Google Scholar
F. Nack and A.T. Lindsay, Everything you always wanted to knowabout MPEG-7: Part 2, IEEE Multimedia, Vol. 6, No. 4, pp. 64–73,1999
Google Scholar
J. Nam, M. Alghoniemy, and A.H. Tewfik, Audio-visual content-based violent scene characterization, in IEEE International Conference on Image Processing, Chicago, USA, 1998, Vol. 1, pp. 353–357.
J. Nam, A. Enis Cetin, and A.H. Tewfik, Speaker identification and video analysis for hierarchical video shot classification, in IEEE International Conference on Image Processing, Washington DC, USA, 1997, Vol. 2.
M.R. Naphade and T.S. Huang, A probabilistic framework for semantic video indexing, filtering, and retrieval, IEEE Transactions on Multimedia, Vol. 3, No. 1, pp. 141–151,2001
Google Scholar
H.T. Nguyen, M. Worring, and A. Dev, Detection of moving objects in video using a robust motion similarity measure, IEEE Transactions on Image Processing, Vol. 9, No. 1, pp. 137–141,2000
Google Scholar
L. Nigay and J. Coutaz, A design space for multimodal systems: concurrent processing and data fusion. in INTERCHI'93 Proceedings, Amsterdam, the Netherlands, 1993, pp. 172–178.
D.W. Oard, The state of the art in text filtering, User Modeling and User-Adapted Interaction, Vol.7, No. 3, pp. 141–178,1997
Google Scholar
H. Pan, P. Van Beek, and M.I. Sezan, Detection of slow-motion replay segments in sports video for highlights generation, in IEEE International Conference on Acoustic, Speech and Signal Processing,2001
N.V. Patel and I.K. Sethi, Audio characterization for video indexing, in Proceedings SPIE on Storage and Retrieval for Still Image and Video Databases, San Jose, USA, 1996, Vol.2670, pp. 373–384.
Google Scholar
N.V. Patel and I.K. Sethi, Video classification using speaker identification, in IS & T SPIE, Proceedings: Storage and Retrieval for Image and Video Databases IV, San Jose, USA,1997
J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann: San Mateo, USA,1988
Google Scholar
A.K. Peker, A.A. Alatan, and A.N. Akansu, Low-level motion activity features for semantic characterization of video, in IEEE International Conference on Multimedia & Expo, New York City, USA,2000
A. Pentland, B. Moghaddam, and T. Starner, View-based and modular eigenspaces for face recognition, in IEEE International Conference on Computer Vision and Pattern Recognition, Seattle, USA,1994
S. Pfeiffer, S. Fischer, and W. Effelsberg, Automatic audio content analysis, in ACM Multimedia 1996, Boston, USA, 1996, pp. 21–30.
S. Pfeiffer, R. Lienhart, and W. Effelsberg, Scene determination based on video and audio features, Mul-timedia Tools and Applications, Vol. 15, No. 1, pp. 59–81,2001
Google Scholar
T.V. Pham and M. Worring, Face detection methods: A critical evaluation, Technical Report 2000-11, Intelligent Sensory Information Systems, University of Amsterdam, 2000
Praja. http://www.praja.com.
L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Pro-ceedings of the IEEE, Vol. 77, No. 2, pp. 257–286,1989
Google Scholar
H.A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, pp. 23–38,1998
Google Scholar
Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlights for TV baseball programs, in ACM Multimedia 2000, Los Angeles, USA, 2000, pp. 105–115.
E. Sahouria and A. Zakhor, Content analysis of video using principal components, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8, pp. 1290–1298,1999
Google Scholar
C. Saraceno and R. Leonardi, Identification of story units in audio-visual sequences by joint audio and video processing, in IEEE International Conference on Image Processing, Chicago, USA,1998
S. Satoh, Y. Nakamura, and T. Kanade, Name-It: Naming and detecting faces in news videos, IEEE Multimedia, Vol. 6, No. 1, pp. 22–35,1999
Google Scholar
D.D. Saur, Y.-P. Tan, S.R. Kulkarni, and P.J. Ramadge, Automated analysis and annotation of basketball video, in SPIE's Electronic Imaging conference on Storage and Retrieval for Image and Video Databases V, San Jose, USA, 1997, Vol. 3022, pp. 176–187.
Google Scholar
H. Schneiderman and T. Kanade, A statistical method for 3D object detection applied to faces and cars, in IEEE Computer Vision and Pattern Recognition, Hilton Head, USA,2000
K. Shearer, C. Dorai, and S. Venkatesh, Incorporating domain knowledge with video and voice data analysis in news broadcasts, in ACM International Conference on Knowledge Discovery and Data Mining, Boston, USA, 2000, pp. 46–53.
J. Shim, C. Dorai, and R. Bolle, Automatic text extraction from video for content-based annotation and retrieval, in IEEE International Conference on Pattern Recognition, 1998, pp. 618–620.
A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content based image retrieval at the end of the early years, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, pp. 1349–1380,2000
Google Scholar
R.K. Srihari, Automatic indexing and content-based retrieval of captioned images, IEEE Computer, Vol. 28, No. 9, pp. 49–56,1995
Google Scholar
G. Sudhir, J.C.M. Lee, and A.K. Jain, Automatic classification of tennis video for high-level content-based retrieval, in IEEE International Workshop on Content-Based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India,1998
M. Szummer and R.W. Picard, Indoor-outdoor image classification, in IEEE International Workshop on Content-based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India,1998
B.T. Truong and S. Venkatesh, Determining dramatic intensification via flashing lights in movies, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 61–64.
B.T. Truong, S. Venkatesh, and C. Dorai, Automatic genre identification for content-based video catego-rization, in IEEE International Conference on Pattern Recognition, Barcelona, Spain,2000
S. Tsekeridou and I. Pitas, Content-based video parsing and indexing based on audio-visual interac-tion, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 4, pp. 522–535,2001
Google Scholar
A. Vailaya and A.K. Jain, Detecting sky and vegetation in outdoor images, in Proceedings of SPIE: Storage and Retrieval for Image and Video Databases VIII, San Jose, USA, 2000, Vol.3972
A. Vailaya, A.K. Jain, and H.-J. Zhang, On image classification: City images vs. landscapes, Pattern Recognition, Vol. 31, pp. 1921–1936,1998
Google Scholar
J. Vendrig and M. Worring, Systematic evaluation of logical story unit segmentation, IEEE Transactions on Multimedia, Vol. 4, No. 4, pp. 492–499,2002
Google Scholar
Virage. http://www.virage.com.
Y. Wang, Z. Liu, and J. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Vol. 17, No. 6, pp. 12–36,2000
Google Scholar
T. Westerveld, Image retrieval: Content versus context, in Content-Based Multimedia Information Access, RIAO 2000 Conference, Paris, France, 2000, pp. 276–284.
E. Wold, T. Blum, D. Keislar, and J. Wheaton, Content-based classification, search, and retrieval of audio, IEEE Multimedia, Vol. 3, No. 3, pp. 27–36,1996
Google Scholar
L. Wu, J. Benois-Pineau, and D. Barba, Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding, Image Communication, Vol. 8, No. 6, pp. 513–544,1996
Google Scholar
P. Xu, L. Xie, S.-F. Chang, A. Divakaran, A. Vetro, and H. Sun, Algorithms and systems for segmentation and structure analysis in soccer video, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 928–931.
M.-H. Yang, D. Kriegman, and N. Ahuja, Detecting faces in images: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 1, pp. 34–58,2002
Google Scholar
M.M. Yeung and B.-L. Yeo, Video content characterization and compaction for digital library applications, in IS & T/SPIE Storage and Retrieval of Image and Video Databases V, 1997, Vol. 3022, pp. 45–58.
Google Scholar
H.-J. Zhang, A. Kankanhalli, and S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems, Vol. 1, No. 1, pp. 10–28,1993
Google Scholar
H.-J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong, Automatic parsing and indexing of news video, Multimedia Systems, Vol. 2, No. 6, pp. 256–266,1995
Google Scholar
T. Zhang and C.-C.J. Kuo, Hierarchical classification of audio data for archiving and retrieving, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, USA,1999, Vol. 6, pp. 3001–3004
Google Scholar
D. Zhong and S.-F. Chang, Structure analysis of sports video using domain models, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 920–923.
Y. Zhong, H.-J. Zhang, and A.K. Jain, Automatic caption localization in compressed video, IEEE Trans-actions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, pp. 385–392,2000
Google Scholar
W. Zhou, A. Vellaikal, and C.-C.J. Kuo, Rule-based video classification system for basketball video index-ing, in ACM Multimedia 2000, Los Angeles, USA,2000
W. Zhu, C. Toklu, and S.-P. Liou, Automatic news video segmentation and categorization based on closed-captioned text, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 1036–1039

Download references

Author information

Authors and Affiliations

Intelligent Sensory Information Systems, Informatics Institute, University of Amsterdam, Kruislaan 403, 1098, SJ Amsterdam, The Netherlands.
Cees G.M. Snoek & Marcel Worring

Authors

Cees G.M. Snoek
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Worring
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Snoek, C.G., Worring, M. Multimodal Video Indexing: A Review of the State-of-the-art. Multimedia Tools and Applications 25, 5–35 (2005). https://doi.org/10.1023/B:MTAP.0000046380.27575.a5

Download citation

Issue Date: January 2005
DOI: https://doi.org/10.1023/B:MTAP.0000046380.27575.a5

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Video Indexing: A Review of the State-of-the-art

Abstract

Access this article

Similar content being viewed by others

A Generic Approach for Video Indexing

Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition

VERGE in VBS 2019

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Multimodal Video Indexing: A Review of the State-of-the-art

Abstract

Access this article

Similar content being viewed by others

A Generic Approach for Video Indexing

Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition

VERGE in VBS 2019

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation