This paper presents a new learning algorithm for audiovisual fusion and demonstrates its application to video classification for film database. The proposed system utilized perceptual features for content characterization of movie clips. These features are extracted from different modalities and fused through a machine learning process. More specifically, in order to capture the spatio-temporal information, an adaptive video indexing is adopted to extract visual feature, and the statistical model based on Laplacian mixture are utilized to extract audio feature. These features are fused at the late fusion stage and input to a support vector machine (SVM) to learn semantic concepts from a given video database. Based on our experimental results, the proposed system implementing the SVM-based fusion technique achieves high classification accuracy when applied to a large volume database containing Hollywood movies.
Similar content being viewed by others
Here we describe these concepts with textual descriptions for communication with the readers. However, our definition of semantic concept is based on perceptual features of video and not the texts.
We have chosen k = 5 for video indexing in our experiments reported in “Section 5”.
The software we used for video segmentation is not available recently. However, a new software product, Movavi SplitMovie may be found at: http://movavi.com/splitmovie.
Here we describe these concepts with textual descriptions for communication with the readers. However, our definition of semantic concept is based on perceptual features of video.
Yap, K.-H., & Wu, K. (2005). A soft relevance framework in content-based image retrieval systems. IEEE Transactions on Circuits Systems for Video Technology, 15, 1557–1568. doi:10.1109/TCSVT.2005.856912.
Naratology, M. B. (1985). Introduction to the theory of narrative. Toronto: University of Toronto Press.
Sudhir, G., Lee, J. C. M., & Jain, A. K. (1998). Automatic classification of tennis video for high-level content-based retrieval. In Proc. the IEEE International Workshop on Content-based Access of Image and Video Database, Bombay, India, pp. 81–90, January.
Miyamori, H., & Iisaku, S.-I. (2000). Video annotation for content-based retrieval using human behavior analysis and domain knowledge. In Proc. the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 320–325, March.
Petkovic, M., & Jonker, W. (2001). Content-based video retrieval by integrating spatio-temporal and stochastic recognition of events. In Proc. IEEE Workshop on Detection and Recognition of Events in Video, Vancouver, Canada, pp. 75–82, July.
Lay, J. A., & Guan, L. (2006). Semantic retrieval of multimedia by concept languages. IEEE Signal Processing Magazine, 23(2), 115–123 March.
Lay, J. A., & Guan, L. (2004). Retrieval for color artistry concepts. IEEE Transactions on Image Processing, 13(3), 326–339. doi:10.1109/TIP.2003.822971.
Bordwell, D., & Thompson, K. (2004). Film art: An introduction (7th ed.). New York: MaGraw-Hill.
Kohonen, T. (1997). Self-organizing MAPS (2nd ed.). Berlin: Springer-Verlag.
Haykin, S. (1999). Neural networks, a comprehensive foundation. Upper Saddle River: Prentice Hall.
Wikipedia. (2007). Type I and type II errors. http://en.wikipedia.org/wiki.
Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20(3), 273–297.
Ben-Yacoub, S., Abdeljaoued, Y., & Mayoraz, E. (1999). Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks, 10(5), 1065–1074. doi:10.1109/72.788647.
Chang, C.-C., & Lin, C.-J. (2001). Training ν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147. doi:10.1162/089976601750399335.
Chang, S.-F., Manmatha, R., & Chua, T.-S. (2005). Combining text and audio-visual features in video indexing. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, Vol. 5, pp. 1005–1008, March.
Nepal, S., Srinivasan, U., Reynolds, G. (2001). Automatic detection of goal segments in basketball videos. In Proc ACM International Conference on Multimedia, Ottawa, Canada, pp. 261–269, October.
Lazarescu, M., Venkatesh, S., & West, G. (2002). On the automatic indexing of cricket using camera motion parameters. In Proc. IEEE International conference on Multimedia and Expo, Lausanne, Switzerland, pp. 809–813, August.
Sadlier, D. A., & O’Connor, N. E. (2005). Event detection in field sports video using audio–visual features and a support vector machine. IEEE Transactions on Circuits and Systems for Video Technology, 15(10), 1225–1233. doi:10.1109/TCSVT.2005.854237.
Hanjalic, A. (2003). Generic approach to highlights extraction from a sport video. In Proc. IEEE International Conference on Image Processing, Barcelona, Spain, Vol. 1, pp. 1–4.
Wu, C., Ma, Y.-F., Zhang, H.-J., & Zhong, Y.-Z. (2002). Events recognition by semantic inference for sports video. In Proc. IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, pp. 805–808, August.
Rasheed, Z., Sheikh, Y., & Shah, M. (2005). On the use of computable features for film classification. IEEE Transactions on Circuits and Systems for Video Technology, 15(1), 52–64. doi:10.1109/TCSVT.2004.839993.
Manovich, L. (2001). The language of new media. Cambridge: MIT.
Salton, G., Fox, E. A., & Voorheers, E. (1985). Advanced feedback methods in information retrieval. Journal of the American Society for Information Science, 36(3), 200–210. doi:10.1002/asi.4630360311.
Chang, H. S., Sull, S., & Lee, S. U. (1999). Efficient video indexing scheme for content-based retrieval. IEEE Transactions on Circuits Systems for Video Technology, 9(8), 1269–1279. doi:10.1109/76.809161.
Ngo, C.-W., Pong, T.-C., & Zhang, H.-J. (2001). On clustering and retrieval of video shots. In Proc ACM International Conference on Multimedia, Ottawa, Canada, pp. 51–60, October.
Muneesawang, P., & Guan, L. (2005). Adaptive video indexing and automatic/semi-automatic relevance feedback. IEEE Transactions on Circuits and Systems for Video Technology, 15(8), 1032–1046. doi:10.1109/TCSVT.2005.852412.
Chang, S.-F., & Sundaram, H. (2000). Structural and semantic analysis of video. In Proc. Int. Conf. Multimedia and Expo, New York, USA, vol. 2, pp. 687–690, July.
Amin, T., Zeytinoglu, M., & Guan, L. (2007). Application of Laplacian mixture model to image and video retrieval. IEEE Transaction on Multimedia, 9(7), 1416–1429.
Figueiredo, M., & Jain, A. K. (2000). Unsupervised selection and estimation of finite mixture models. In Proc. International Conference on Pattern Recognition, Barcelona, Spain, vol. 2, pp. 87–90, September.
Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report ICSI-TR-97-021, University of Berkeley.
Muneesawang, P., & Guan, L. (2004). An interactive approach for CBIR using a network of radial basis functions. IEEE Transactions on Multimedia, 6(5), 703–716.
Meyer, G. F., Mulligan, J. B., & Wuerger, S. M. (2004). Continuous audio–visual digit recognition using N-best decision fusion, Elsevier International Journal on Multi-Sensor. Multi-Source Inf. Fusion, 5(2), 91–101. doi:10.1016/j.inffus.2003.07.001.
Massaro, D. W. (2001). Auditory visual speech processing. In European Conference on Speech Communication and Technology, Aalborg, Denmark, pp. 1153–1156.
Stauffer, C. (2005). Automated audio-visual analysis, MIT Artificial Intelligence Laboratory Memo. http://people.csail.mit.edu/stauffer/Home.
Chang, C. C., & Lin C. J. (2008). Library of SVMs: LIBSVM—A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Wu, K., & Yap, K.-H. (2006). Fuzzy SVM for content-based image retrieval—A pseudo-label support vector machine framework. IEEE Computational Intelligence Magazine, 1, 10–16.
Salton, G., & Buckley, C. (1987). Term-weighting approaches in automatic text retrieval, Technical Report: TR87-881, Cornell University.
Kolker, R. (2006). Film form and culture. New York: McGraw-Hill.
Naphade, M. R., & Huang, T. S. (2001). A probabilistic framework for semantic video indexing, filtering and retrieval over the Internet. IEEE Transactions on Multimedia, 3(1), 141–151. doi:10.1109/6046.909601.
Adams, W. H., Iyengar, G., Lin, C.-Y., Naphade, M. R., Neti, C., Nock, H. J., et al. (2003). Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP Journal on Applied Signal Processing, 2003(2), 170–185.
Zhou, J., Xin, L.-P., & Rong, G. (2000). Decision fusion based cartridge identification using support vector machine. In Proc. IEEE International Conference on Systems, Man, and Cybernetics, Tennessee, USA, pp. 2873–2877, October.
Manjunath, B. S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-7: Multimedia content description interface. Hoboken: Wiley.
Snoek, C., & Worring, M. (2005). Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25, 5–35. doi:10.1023/B:MTAP.0000046380.27575.a5.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Muneesawang, P., Guan, L. & Amin, T. A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips. J Sign Process Syst Sign Image Video Technol 59, 177–188 (2010). https://doi.org/10.1007/s11265-008-0290-7
Issue Date:
DOI: https://doi.org/10.1007/s11265-008-0290-7