Skip to main content
Log in

A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

This paper presents a new learning algorithm for audiovisual fusion and demonstrates its application to video classification for film database. The proposed system utilized perceptual features for content characterization of movie clips. These features are extracted from different modalities and fused through a machine learning process. More specifically, in order to capture the spatio-temporal information, an adaptive video indexing is adopted to extract visual feature, and the statistical model based on Laplacian mixture are utilized to extract audio feature. These features are fused at the late fusion stage and input to a support vector machine (SVM) to learn semantic concepts from a given video database. Based on our experimental results, the proposed system implementing the SVM-based fusion technique achieves high classification accuracy when applied to a large volume database containing Hollywood movies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

Notes

  1. Here we describe these concepts with textual descriptions for communication with the readers. However, our definition of semantic concept is based on perceptual features of video and not the texts.

  2. We have chosen k = 5 for video indexing in our experiments reported in “Section 5”.

  3. The software we used for video segmentation is not available recently. However, a new software product, Movavi SplitMovie may be found at: http://movavi.com/splitmovie.

  4. Here we describe these concepts with textual descriptions for communication with the readers. However, our definition of semantic concept is based on perceptual features of video.

References

  1. Yap, K.-H., & Wu, K. (2005). A soft relevance framework in content-based image retrieval systems. IEEE Transactions on Circuits Systems for Video Technology, 15, 1557–1568. doi:10.1109/TCSVT.2005.856912.

    Article  Google Scholar 

  2. Naratology, M. B. (1985). Introduction to the theory of narrative. Toronto: University of Toronto Press.

    Google Scholar 

  3. Sudhir, G., Lee, J. C. M., & Jain, A. K. (1998). Automatic classification of tennis video for high-level content-based retrieval. In Proc. the IEEE International Workshop on Content-based Access of Image and Video Database, Bombay, India, pp. 81–90, January.

  4. Miyamori, H., & Iisaku, S.-I. (2000). Video annotation for content-based retrieval using human behavior analysis and domain knowledge. In Proc. the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 320–325, March.

  5. Petkovic, M., & Jonker, W. (2001). Content-based video retrieval by integrating spatio-temporal and stochastic recognition of events. In Proc. IEEE Workshop on Detection and Recognition of Events in Video, Vancouver, Canada, pp. 75–82, July.

  6. Lay, J. A., & Guan, L. (2006). Semantic retrieval of multimedia by concept languages. IEEE Signal Processing Magazine, 23(2), 115–123 March.

    Article  Google Scholar 

  7. Lay, J. A., & Guan, L. (2004). Retrieval for color artistry concepts. IEEE Transactions on Image Processing, 13(3), 326–339. doi:10.1109/TIP.2003.822971.

    Article  Google Scholar 

  8. Bordwell, D., & Thompson, K. (2004). Film art: An introduction (7th ed.). New York: MaGraw-Hill.

    Google Scholar 

  9. Kohonen, T. (1997). Self-organizing MAPS (2nd ed.). Berlin: Springer-Verlag.

    MATH  Google Scholar 

  10. Haykin, S. (1999). Neural networks, a comprehensive foundation. Upper Saddle River: Prentice Hall.

    MATH  Google Scholar 

  11. Wikipedia. (2007). Type I and type II errors. http://en.wikipedia.org/wiki.

  12. Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20(3), 273–297.

    MATH  Google Scholar 

  13. Ben-Yacoub, S., Abdeljaoued, Y., & Mayoraz, E. (1999). Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks, 10(5), 1065–1074. doi:10.1109/72.788647.

    Article  Google Scholar 

  14. Chang, C.-C., & Lin, C.-J. (2001). Training ν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147. doi:10.1162/089976601750399335.

    Article  MATH  Google Scholar 

  15. Chang, S.-F., Manmatha, R., & Chua, T.-S. (2005). Combining text and audio-visual features in video indexing. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, Vol. 5, pp. 1005–1008, March.

  16. Nepal, S., Srinivasan, U., Reynolds, G. (2001). Automatic detection of goal segments in basketball videos. In Proc ACM International Conference on Multimedia, Ottawa, Canada, pp. 261–269, October.

  17. Lazarescu, M., Venkatesh, S., & West, G. (2002). On the automatic indexing of cricket using camera motion parameters. In Proc. IEEE International conference on Multimedia and Expo, Lausanne, Switzerland, pp. 809–813, August.

  18. Sadlier, D. A., & O’Connor, N. E. (2005). Event detection in field sports video using audio–visual features and a support vector machine. IEEE Transactions on Circuits and Systems for Video Technology, 15(10), 1225–1233. doi:10.1109/TCSVT.2005.854237.

    Article  Google Scholar 

  19. Hanjalic, A. (2003). Generic approach to highlights extraction from a sport video. In Proc. IEEE International Conference on Image Processing, Barcelona, Spain, Vol. 1, pp. 1–4.

  20. Wu, C., Ma, Y.-F., Zhang, H.-J., & Zhong, Y.-Z. (2002). Events recognition by semantic inference for sports video. In Proc. IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, pp. 805–808, August.

  21. Rasheed, Z., Sheikh, Y., & Shah, M. (2005). On the use of computable features for film classification. IEEE Transactions on Circuits and Systems for Video Technology, 15(1), 52–64. doi:10.1109/TCSVT.2004.839993.

    Article  Google Scholar 

  22. Manovich, L. (2001). The language of new media. Cambridge: MIT.

    Google Scholar 

  23. Salton, G., Fox, E. A., & Voorheers, E. (1985). Advanced feedback methods in information retrieval. Journal of the American Society for Information Science, 36(3), 200–210. doi:10.1002/asi.4630360311.

    Article  Google Scholar 

  24. Chang, H. S., Sull, S., & Lee, S. U. (1999). Efficient video indexing scheme for content-based retrieval. IEEE Transactions on Circuits Systems for Video Technology, 9(8), 1269–1279. doi:10.1109/76.809161.

    Article  Google Scholar 

  25. Ngo, C.-W., Pong, T.-C., & Zhang, H.-J. (2001). On clustering and retrieval of video shots. In Proc ACM International Conference on Multimedia, Ottawa, Canada, pp. 51–60, October.

  26. Muneesawang, P., & Guan, L. (2005). Adaptive video indexing and automatic/semi-automatic relevance feedback. IEEE Transactions on Circuits and Systems for Video Technology, 15(8), 1032–1046. doi:10.1109/TCSVT.2005.852412.

    Article  Google Scholar 

  27. Chang, S.-F., & Sundaram, H. (2000). Structural and semantic analysis of video. In Proc. Int. Conf. Multimedia and Expo, New York, USA, vol. 2, pp. 687–690, July.

  28. Amin, T., Zeytinoglu, M., & Guan, L. (2007). Application of Laplacian mixture model to image and video retrieval. IEEE Transaction on Multimedia, 9(7), 1416–1429.

    Article  Google Scholar 

  29. Figueiredo, M., & Jain, A. K. (2000). Unsupervised selection and estimation of finite mixture models. In Proc. International Conference on Pattern Recognition, Barcelona, Spain, vol. 2, pp. 87–90, September.

  30. Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report ICSI-TR-97-021, University of Berkeley.

  31. Muneesawang, P., & Guan, L. (2004). An interactive approach for CBIR using a network of radial basis functions. IEEE Transactions on Multimedia, 6(5), 703–716.

    Article  Google Scholar 

  32. Meyer, G. F., Mulligan, J. B., & Wuerger, S. M. (2004). Continuous audio–visual digit recognition using N-best decision fusion, Elsevier International Journal on Multi-Sensor. Multi-Source Inf. Fusion, 5(2), 91–101. doi:10.1016/j.inffus.2003.07.001.

    Article  Google Scholar 

  33. Massaro, D. W. (2001). Auditory visual speech processing. In European Conference on Speech Communication and Technology, Aalborg, Denmark, pp. 1153–1156.

  34. Stauffer, C. (2005). Automated audio-visual analysis, MIT Artificial Intelligence Laboratory Memo. http://people.csail.mit.edu/stauffer/Home.

  35. Chang, C. C., & Lin C. J. (2008). Library of SVMs: LIBSVM—A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.

  36. Wu, K., & Yap, K.-H. (2006). Fuzzy SVM for content-based image retrieval—A pseudo-label support vector machine framework. IEEE Computational Intelligence Magazine, 1, 10–16.

    Google Scholar 

  37. Salton, G., & Buckley, C. (1987). Term-weighting approaches in automatic text retrieval, Technical Report: TR87-881, Cornell University.

  38. Kolker, R. (2006). Film form and culture. New York: McGraw-Hill.

    Google Scholar 

  39. Naphade, M. R., & Huang, T. S. (2001). A probabilistic framework for semantic video indexing, filtering and retrieval over the Internet. IEEE Transactions on Multimedia, 3(1), 141–151. doi:10.1109/6046.909601.

    Article  Google Scholar 

  40. Adams, W. H., Iyengar, G., Lin, C.-Y., Naphade, M. R., Neti, C., Nock, H. J., et al. (2003). Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP Journal on Applied Signal Processing, 2003(2), 170–185.

    Article  Google Scholar 

  41. Zhou, J., Xin, L.-P., & Rong, G. (2000). Decision fusion based cartridge identification using support vector machine. In Proc. IEEE International Conference on Systems, Man, and Cybernetics, Tennessee, USA, pp. 2873–2877, October.

  42. Manjunath, B. S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-7: Multimedia content description interface. Hoboken: Wiley.

    Google Scholar 

  43. Snoek, C., & Worring, M. (2005). Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25, 5–35. doi:10.1023/B:MTAP.0000046380.27575.a5.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paisarn Muneesawang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muneesawang, P., Guan, L. & Amin, T. A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips. J Sign Process Syst Sign Image Video Technol 59, 177–188 (2010). https://doi.org/10.1007/s11265-008-0290-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-008-0290-7

Keywords

Navigation