Abstract
Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.
Similar content being viewed by others
References
S. Abney, Part-of-speech tagging and partial parsing, in Corpus-Based Methods in Language and Speech Processing, S. Young and G. Bloothooft (Eds.), Kluwer Academic Publishers, Dordrecht, 1997, pp. 118–136.
S. Adali, K.S. Candan, S.S. Chen, K. Erol, and V.S. Subrahmanian, The advanced video information system: Data structures and query processing, Multimedia Systems, Vol. 4, No. 4, pp. 172–186,1996
A.A. Alatan, A.N. Akansu, and W. Wolf, Multi-modal dialogue scene detection using hidden markov models for content-based multimedia indexing, Multimedia Tools and Applications, Vol. 14, No. 2, pp. 137–151,2001
Y. Altunbasak, P.E. Eren, and A.M. Tekalp, Region-based parametric motion segmentation using color information, Graphical Models and Image Processing, Vol. 60, No. 1, pp. 13–23,1998
N. Babaguchi, Y. Kawai, and T. Kitahashi, Event based indexing of broadcasted sports video by intermodal collaboration, IEEE Transactions on Multimedia, Vol. 4, No. 1, pp. 68–75,2002
P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 711–720,1997
M. Bertini, A. Del Bimbo, and P. Pala, Content-based indexing and retrieval of TV news, Pattern Recog-nition Letters, Vol. 22, No. 5, pp. 503–516,2001
D. Bikel, R. Schwartz, and R.M. Weischedel, An algorithm that learns what's in a name, Machine Learning, Vol. 34, Nos. 1-3, pp. 211–231,1999
J.M. Boggs and D.W. Petrie, The Art of Watching Films, 5th edition, Mayfield Publishing Company: Moun-tain View, USA,2000
R.M. Bolle, B.-L. Yeo, and M.M. Yeung, Video query: Research directions, IBM Journal of Research and Development, Vol. 42, No. 2, pp. 233–252,1998
A. Bonzanini, R. Leonardi, and P. Migliorati, Event recognition in sport programs using low-level motion indices, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 1208–1211.
M. Brown, J. Foote, G. Jones, K. Sparck-Jones, and S. Young, Automatic content-based retrieval of broadcast news, in ACM Multimedia 1995, San Francisco, USA,1995
R. Brunelli, O. Mich, and C.M. Modena, A survey on the automatic indexing of video data, Journal of Visual Communication and Image Representation, Vol. 10, No. 2, pp. 78–112,1999
M. La Cascia, S. Sethi, and S. Sclaroff, Combining textual and visual cues for content-based image retrieval on the world wide web, in IEEE Workshop on Content-Based Access of Image and Video Libraries,1998
M. Christel, A. Olligschlaeger, and C. Huang, Interactive maps for a digital video library, IEEE Multimedia, Vol. 7, No. 1, pp. 60–67,2000
C. Colombo, A. Del Bimbo, and P. Pala, Semantics in visual information retrieval, IEEE Multimedia, Vol. 6, No. 3, pp. 38–53,1999
Convera. http://www.convera.com.
G. Davenport, T. Aguierre Smith, and N. Pincever, Cinematic principles for multimedia, in IEEE Computer Graphics & Applications, Vol. 11, No. 4, pp. 67–74,1991
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by latent seman-tic analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, pp. 391–407,1990
N. Dimitrova, L. Agnihotri, and G. Wei, Video classification based on HMM using text and faces, in European Signal Processing Conference, Tampere, Finland,2000
S. Eickeler and S. Müller, Content-based video indexing of TV broadcast news using hidden markov models, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, USA, 1999, pp. 2997–3000.
K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, Speech/music discrimination for multimedia applications, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 2000, pp. 2445–2448.
S. Fischer, R. Lienhart, and W. Effelsberg, Automatic recognition of film genres, in ACM Multimedia 1995, San Francisco, USA, 1995, pp. 295–304.
M.M. Fleck, D.A. Forsyth, and C. Bregler, Finding naked people, in European Conference on Computer Vision, Cambridge, UK, 1996, Vol. 2, pp. 593–602.
B. Furht, S.W. Smoliar, and H.J. Zhang, Video and Image Processing in Multimedia Systems, 2nd edition, Kluwer Academic Publishers: Norwell, USA,1996
A. Ghias, J. Logan, D. Chamberlin, and B.C. Smith, Query by humming-musical information retrieval in an audio database, in ACM Multimedia 1995, San Francisco, USA,1995
Y. Gong, L.T. Sin, and C.H. Chuan, Automatic parsing of TV soccer programs, in IEEE International Conference on Multimedia Computing and Systems, 1995, pp. 167–174.
B. Günsel, A.M. Ferman, and A.M. Tekalp, Video indexing through integration of syntactic and semantic features, in Third IEEE Workshop on Applications of Computer Vision, Sarasota, USA,1996
N. Haering, R. Qian, and I. Sezan, A semantic event-detection approach and its application to detecting hunts in wildlife video, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 6, pp. 857–868,2000
A. Hampapur, R. Jain, and T. Weymouth, Feature based digital video indexing, in IFIP 2.6 Third Working Conference on Visual Database Systems, Lausanne, Switzerland,1995
A. Hanjalic, G. Kakes, R.L. Lagendijk, and J. Biemond, Dancers: Delft advanced news retrieval system, in IS & T/SPIE Electronic Imaging 2001: Storage and Retrieval for Media Databases 2001, San Jose, USA,2001
A. Hanjalic, G.C. Langelaar, P.M.B. van Roosmalen, J. Biemond, and R.L. Lagendijk, Image and Video Databases: Restoration, Watermarking and Retrieval, Elsevier Science: Amsterdam, The Netherlands,2000
A.G. Hauptmann, D. Lee, and P.E. Kennedy, Topic labeling of multilingual broadcast news in the informedia digital video library, in ACM DL/SIGIR MIDAS Workshop, Berkely, USA,1999
A.G. Hauptmann and M.J. Witbrock, Story segmentation and detection of commercials in broadcast news video, in ADL-98 Advances in Digital Libraries, Santa Barbara, USA, 1998, pp. 168–179.
J. Huang, Z. Liu, Y. Wang, Y. Chen, and E.K. Wong, Integration of multimodal features for video scene classification based on HMM, in IEEE Workshop on Multimedia Signal Processing, Copenhagen, Denmark,1999
I. Ide, K. Yamamoto, and H. Tanaka, Automatic video indexing based on shot classification, in First International Conference on Advanced Multimedia Content Processing, Vol. 1554 of Lecture Notes in Computer Science, Springer-Verlag: Osaka, Japan,1999
A.K. Jain, R.P.W. Duin, and J. Mao, Statistical pattern recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, pp. 4–37,2000
R. Jain and A. Hampapur, Metadata in video databases, ACM SIGMOD, Vol. 23, No. 4, pp. 27–33,1994
P.J. Jang and A.G. Hauptmann, Learning to recognize speech by watching television, IEEE Intelligent Systems, Vol. 14, No. 5, pp. 51–58,1999
R.S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, and D. Li, Integrated multimedia processing for topic segmentation and classification, in IEEE International Conference on Image Processing, Thessaloniki, Greece, 2001, pp. 366–369.
O. Javed, Z. Rasheed, and M. Shah, A framework for segmentation of talk & game shows, in IEEE International Conference on Computer Vision, Vancouver, Canada,2001
V. Kobla, D. DeMenthon, and D. Doermann, Identification of sports videos using replay, text, and camera motion features, in SPIE Conference on Storage and Retrieval for Media Databases, Vol. 3972, pp. 332–343,2000
D. Li, I.K. Sethi, N. Dimitrova, and T. McGee, Classification of general audio data for content-based retrieval, Pattern Recognition Letters, Vol. 22, No. 5, pp. 533–544,2001
H. Li, D. Doermann, and O. Kia, Automatic text detection and tracking in digital video, IEEE Transactions on Image Processing, Vol. 9, No. 1, pp. 147–156,2000
R. Lienhart, C. Kuhmünch, and W. Effelsberg, On the detection and recognition of television commer-cials, in IEEE Conference on Multimedia Computing and Systems, Ottawa, Canada, 1997, pp. 509–516.
C.D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, USA,1999
K. Minami, A. Akutsu, H. Hamada, and Y. Tomomura, Video handling with music and speech detection, IEEE Multimedia, Vol. 5, No. 3, pp. 17–25,1998
H. Miyamori and S. Iisaku, Video annotation for content-based retrieval using human behavior analysis and domain knowledge, in IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, 2000, pp. 26–30.
A. Mohan, C. Papageorgiou, and T. Poggio, Example-based object detection in images by compo-nents, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 4, pp. 349–361,2001
S. Moncrieff, C. Dorai, and S. Venkatesh, Detecting indexical signs in film audio for scene inter-pretation, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 11921195
F. Nack and A.T. Lindsay, Everything you always wanted to knowabout MPEG-7: Part 1, IEEE Multimedia, Vol. 6, No. 3, pp. 65–77,1999
F. Nack and A.T. Lindsay, Everything you always wanted to knowabout MPEG-7: Part 2, IEEE Multimedia, Vol. 6, No. 4, pp. 64–73,1999
J. Nam, M. Alghoniemy, and A.H. Tewfik, Audio-visual content-based violent scene characterization, in IEEE International Conference on Image Processing, Chicago, USA, 1998, Vol. 1, pp. 353–357.
J. Nam, A. Enis Cetin, and A.H. Tewfik, Speaker identification and video analysis for hierarchical video shot classification, in IEEE International Conference on Image Processing, Washington DC, USA, 1997, Vol. 2.
M.R. Naphade and T.S. Huang, A probabilistic framework for semantic video indexing, filtering, and retrieval, IEEE Transactions on Multimedia, Vol. 3, No. 1, pp. 141–151,2001
H.T. Nguyen, M. Worring, and A. Dev, Detection of moving objects in video using a robust motion similarity measure, IEEE Transactions on Image Processing, Vol. 9, No. 1, pp. 137–141,2000
L. Nigay and J. Coutaz, A design space for multimodal systems: concurrent processing and data fusion. in INTERCHI'93 Proceedings, Amsterdam, the Netherlands, 1993, pp. 172–178.
D.W. Oard, The state of the art in text filtering, User Modeling and User-Adapted Interaction, Vol.7, No. 3, pp. 141–178,1997
H. Pan, P. Van Beek, and M.I. Sezan, Detection of slow-motion replay segments in sports video for highlights generation, in IEEE International Conference on Acoustic, Speech and Signal Processing,2001
N.V. Patel and I.K. Sethi, Audio characterization for video indexing, in Proceedings SPIE on Storage and Retrieval for Still Image and Video Databases, San Jose, USA, 1996, Vol.2670, pp. 373–384.
N.V. Patel and I.K. Sethi, Video classification using speaker identification, in IS & T SPIE, Proceedings: Storage and Retrieval for Image and Video Databases IV, San Jose, USA,1997
J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann: San Mateo, USA,1988
A.K. Peker, A.A. Alatan, and A.N. Akansu, Low-level motion activity features for semantic characterization of video, in IEEE International Conference on Multimedia & Expo, New York City, USA,2000
A. Pentland, B. Moghaddam, and T. Starner, View-based and modular eigenspaces for face recognition, in IEEE International Conference on Computer Vision and Pattern Recognition, Seattle, USA,1994
S. Pfeiffer, S. Fischer, and W. Effelsberg, Automatic audio content analysis, in ACM Multimedia 1996, Boston, USA, 1996, pp. 21–30.
S. Pfeiffer, R. Lienhart, and W. Effelsberg, Scene determination based on video and audio features, Mul-timedia Tools and Applications, Vol. 15, No. 1, pp. 59–81,2001
T.V. Pham and M. Worring, Face detection methods: A critical evaluation, Technical Report 2000-11, Intelligent Sensory Information Systems, University of Amsterdam, 2000
Praja. http://www.praja.com.
L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Pro-ceedings of the IEEE, Vol. 77, No. 2, pp. 257–286,1989
H.A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, pp. 23–38,1998
Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlights for TV baseball programs, in ACM Multimedia 2000, Los Angeles, USA, 2000, pp. 105–115.
E. Sahouria and A. Zakhor, Content analysis of video using principal components, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8, pp. 1290–1298,1999
C. Saraceno and R. Leonardi, Identification of story units in audio-visual sequences by joint audio and video processing, in IEEE International Conference on Image Processing, Chicago, USA,1998
S. Satoh, Y. Nakamura, and T. Kanade, Name-It: Naming and detecting faces in news videos, IEEE Multimedia, Vol. 6, No. 1, pp. 22–35,1999
D.D. Saur, Y.-P. Tan, S.R. Kulkarni, and P.J. Ramadge, Automated analysis and annotation of basketball video, in SPIE's Electronic Imaging conference on Storage and Retrieval for Image and Video Databases V, San Jose, USA, 1997, Vol. 3022, pp. 176–187.
H. Schneiderman and T. Kanade, A statistical method for 3D object detection applied to faces and cars, in IEEE Computer Vision and Pattern Recognition, Hilton Head, USA,2000
K. Shearer, C. Dorai, and S. Venkatesh, Incorporating domain knowledge with video and voice data analysis in news broadcasts, in ACM International Conference on Knowledge Discovery and Data Mining, Boston, USA, 2000, pp. 46–53.
J. Shim, C. Dorai, and R. Bolle, Automatic text extraction from video for content-based annotation and retrieval, in IEEE International Conference on Pattern Recognition, 1998, pp. 618–620.
A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content based image retrieval at the end of the early years, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, pp. 1349–1380,2000
R.K. Srihari, Automatic indexing and content-based retrieval of captioned images, IEEE Computer, Vol. 28, No. 9, pp. 49–56,1995
G. Sudhir, J.C.M. Lee, and A.K. Jain, Automatic classification of tennis video for high-level content-based retrieval, in IEEE International Workshop on Content-Based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India,1998
M. Szummer and R.W. Picard, Indoor-outdoor image classification, in IEEE International Workshop on Content-based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India,1998
B.T. Truong and S. Venkatesh, Determining dramatic intensification via flashing lights in movies, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 61–64.
B.T. Truong, S. Venkatesh, and C. Dorai, Automatic genre identification for content-based video catego-rization, in IEEE International Conference on Pattern Recognition, Barcelona, Spain,2000
S. Tsekeridou and I. Pitas, Content-based video parsing and indexing based on audio-visual interac-tion, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 4, pp. 522–535,2001
A. Vailaya and A.K. Jain, Detecting sky and vegetation in outdoor images, in Proceedings of SPIE: Storage and Retrieval for Image and Video Databases VIII, San Jose, USA, 2000, Vol.3972
A. Vailaya, A.K. Jain, and H.-J. Zhang, On image classification: City images vs. landscapes, Pattern Recognition, Vol. 31, pp. 1921–1936,1998
J. Vendrig and M. Worring, Systematic evaluation of logical story unit segmentation, IEEE Transactions on Multimedia, Vol. 4, No. 4, pp. 492–499,2002
Virage. http://www.virage.com.
Y. Wang, Z. Liu, and J. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Vol. 17, No. 6, pp. 12–36,2000
T. Westerveld, Image retrieval: Content versus context, in Content-Based Multimedia Information Access, RIAO 2000 Conference, Paris, France, 2000, pp. 276–284.
E. Wold, T. Blum, D. Keislar, and J. Wheaton, Content-based classification, search, and retrieval of audio, IEEE Multimedia, Vol. 3, No. 3, pp. 27–36,1996
L. Wu, J. Benois-Pineau, and D. Barba, Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding, Image Communication, Vol. 8, No. 6, pp. 513–544,1996
P. Xu, L. Xie, S.-F. Chang, A. Divakaran, A. Vetro, and H. Sun, Algorithms and systems for segmentation and structure analysis in soccer video, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 928–931.
M.-H. Yang, D. Kriegman, and N. Ahuja, Detecting faces in images: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 1, pp. 34–58,2002
M.M. Yeung and B.-L. Yeo, Video content characterization and compaction for digital library applications, in IS & T/SPIE Storage and Retrieval of Image and Video Databases V, 1997, Vol. 3022, pp. 45–58.
H.-J. Zhang, A. Kankanhalli, and S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems, Vol. 1, No. 1, pp. 10–28,1993
H.-J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong, Automatic parsing and indexing of news video, Multimedia Systems, Vol. 2, No. 6, pp. 256–266,1995
T. Zhang and C.-C.J. Kuo, Hierarchical classification of audio data for archiving and retrieving, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, USA,1999, Vol. 6, pp. 3001–3004
D. Zhong and S.-F. Chang, Structure analysis of sports video using domain models, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 920–923.
Y. Zhong, H.-J. Zhang, and A.K. Jain, Automatic caption localization in compressed video, IEEE Trans-actions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, pp. 385–392,2000
W. Zhou, A. Vellaikal, and C.-C.J. Kuo, Rule-based video classification system for basketball video index-ing, in ACM Multimedia 2000, Los Angeles, USA,2000
W. Zhu, C. Toklu, and S.-P. Liou, Automatic news video segmentation and categorization based on closed-captioned text, in IEEE International Conference on Multimedia & Expo, Tokyo, Japan, 2001, pp. 1036–1039
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Snoek, C.G., Worring, M. Multimodal Video Indexing: A Review of the State-of-the-art. Multimedia Tools and Applications 25, 5–35 (2005). https://doi.org/10.1023/B:MTAP.0000046380.27575.a5
Issue Date:
DOI: https://doi.org/10.1023/B:MTAP.0000046380.27575.a5