A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips

Muneesawang, Paisarn; Guan, Ling; Amin, Tahir

doi:10.1007/s11265-008-0290-7

A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips

Published: 21 October 2008

Volume 59, pages 177–188, (2010)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Paisarn Muneesawang¹,
Ling Guan² &
Tahir Amin³

261 Accesses
8 Citations
Explore all metrics

Abstract

This paper presents a new learning algorithm for audiovisual fusion and demonstrates its application to video classification for film database. The proposed system utilized perceptual features for content characterization of movie clips. These features are extracted from different modalities and fused through a machine learning process. More specifically, in order to capture the spatio-temporal information, an adaptive video indexing is adopted to extract visual feature, and the statistical model based on Laplacian mixture are utilized to extract audio feature. These features are fused at the late fusion stage and input to a support vector machine (SVM) to learn semantic concepts from a given video database. Based on our experimental results, the proposed system implementing the SVM-based fusion technique achieves high classification accuracy when applied to a large volume database containing Hollywood movies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighting Multiple Features and Double Fusion Method for HMM Based Video Classification

Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

Video Soundtrack Evaluation with Machine Learning: Data Availability, Feature Extraction, and Classification

Notes

Here we describe these concepts with textual descriptions for communication with the readers. However, our definition of semantic concept is based on perceptual features of video and not the texts.
We have chosen k = 5 for video indexing in our experiments reported in “Section 5”.
The software we used for video segmentation is not available recently. However, a new software product, Movavi SplitMovie may be found at: http://movavi.com/splitmovie.
Here we describe these concepts with textual descriptions for communication with the readers. However, our definition of semantic concept is based on perceptual features of video.

References

Yap, K.-H., & Wu, K. (2005). A soft relevance framework in content-based image retrieval systems. IEEE Transactions on Circuits Systems for Video Technology, 15, 1557–1568. doi:10.1109/TCSVT.2005.856912.
Article Google Scholar
Naratology, M. B. (1985). Introduction to the theory of narrative. Toronto: University of Toronto Press.
Google Scholar
Sudhir, G., Lee, J. C. M., & Jain, A. K. (1998). Automatic classification of tennis video for high-level content-based retrieval. In Proc. the IEEE International Workshop on Content-based Access of Image and Video Database, Bombay, India, pp. 81–90, January.
Miyamori, H., & Iisaku, S.-I. (2000). Video annotation for content-based retrieval using human behavior analysis and domain knowledge. In Proc. the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 320–325, March.
Petkovic, M., & Jonker, W. (2001). Content-based video retrieval by integrating spatio-temporal and stochastic recognition of events. In Proc. IEEE Workshop on Detection and Recognition of Events in Video, Vancouver, Canada, pp. 75–82, July.
Lay, J. A., & Guan, L. (2006). Semantic retrieval of multimedia by concept languages. IEEE Signal Processing Magazine, 23(2), 115–123 March.
Article Google Scholar
Lay, J. A., & Guan, L. (2004). Retrieval for color artistry concepts. IEEE Transactions on Image Processing, 13(3), 326–339. doi:10.1109/TIP.2003.822971.
Article Google Scholar
Bordwell, D., & Thompson, K. (2004). Film art: An introduction (7th ed.). New York: MaGraw-Hill.
Google Scholar
Kohonen, T. (1997). Self-organizing MAPS (2nd ed.). Berlin: Springer-Verlag.
MATH Google Scholar
Haykin, S. (1999). Neural networks, a comprehensive foundation. Upper Saddle River: Prentice Hall.
MATH Google Scholar
Wikipedia. (2007). Type I and type II errors. http://en.wikipedia.org/wiki.
Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Ben-Yacoub, S., Abdeljaoued, Y., & Mayoraz, E. (1999). Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks, 10(5), 1065–1074. doi:10.1109/72.788647.
Article Google Scholar
Chang, C.-C., & Lin, C.-J. (2001). Training ν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147. doi:10.1162/089976601750399335.
Article MATH Google Scholar
Chang, S.-F., Manmatha, R., & Chua, T.-S. (2005). Combining text and audio-visual features in video indexing. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, Vol. 5, pp. 1005–1008, March.
Nepal, S., Srinivasan, U., Reynolds, G. (2001). Automatic detection of goal segments in basketball videos. In Proc ACM International Conference on Multimedia, Ottawa, Canada, pp. 261–269, October.
Lazarescu, M., Venkatesh, S., & West, G. (2002). On the automatic indexing of cricket using camera motion parameters. In Proc. IEEE International conference on Multimedia and Expo, Lausanne, Switzerland, pp. 809–813, August.
Sadlier, D. A., & O’Connor, N. E. (2005). Event detection in field sports video using audio–visual features and a support vector machine. IEEE Transactions on Circuits and Systems for Video Technology, 15(10), 1225–1233. doi:10.1109/TCSVT.2005.854237.
Article Google Scholar
Hanjalic, A. (2003). Generic approach to highlights extraction from a sport video. In Proc. IEEE International Conference on Image Processing, Barcelona, Spain, Vol. 1, pp. 1–4.
Wu, C., Ma, Y.-F., Zhang, H.-J., & Zhong, Y.-Z. (2002). Events recognition by semantic inference for sports video. In Proc. IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, pp. 805–808, August.
Rasheed, Z., Sheikh, Y., & Shah, M. (2005). On the use of computable features for film classification. IEEE Transactions on Circuits and Systems for Video Technology, 15(1), 52–64. doi:10.1109/TCSVT.2004.839993.
Article Google Scholar
Manovich, L. (2001). The language of new media. Cambridge: MIT.
Google Scholar
Salton, G., Fox, E. A., & Voorheers, E. (1985). Advanced feedback methods in information retrieval. Journal of the American Society for Information Science, 36(3), 200–210. doi:10.1002/asi.4630360311.
Article Google Scholar
Chang, H. S., Sull, S., & Lee, S. U. (1999). Efficient video indexing scheme for content-based retrieval. IEEE Transactions on Circuits Systems for Video Technology, 9(8), 1269–1279. doi:10.1109/76.809161.
Article Google Scholar
Ngo, C.-W., Pong, T.-C., & Zhang, H.-J. (2001). On clustering and retrieval of video shots. In Proc ACM International Conference on Multimedia, Ottawa, Canada, pp. 51–60, October.
Muneesawang, P., & Guan, L. (2005). Adaptive video indexing and automatic/semi-automatic relevance feedback. IEEE Transactions on Circuits and Systems for Video Technology, 15(8), 1032–1046. doi:10.1109/TCSVT.2005.852412.
Article Google Scholar
Chang, S.-F., & Sundaram, H. (2000). Structural and semantic analysis of video. In Proc. Int. Conf. Multimedia and Expo, New York, USA, vol. 2, pp. 687–690, July.
Amin, T., Zeytinoglu, M., & Guan, L. (2007). Application of Laplacian mixture model to image and video retrieval. IEEE Transaction on Multimedia, 9(7), 1416–1429.
Article Google Scholar
Figueiredo, M., & Jain, A. K. (2000). Unsupervised selection and estimation of finite mixture models. In Proc. International Conference on Pattern Recognition, Barcelona, Spain, vol. 2, pp. 87–90, September.
Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report ICSI-TR-97-021, University of Berkeley.
Muneesawang, P., & Guan, L. (2004). An interactive approach for CBIR using a network of radial basis functions. IEEE Transactions on Multimedia, 6(5), 703–716.
Article Google Scholar
Meyer, G. F., Mulligan, J. B., & Wuerger, S. M. (2004). Continuous audio–visual digit recognition using N-best decision fusion, Elsevier International Journal on Multi-Sensor. Multi-Source Inf. Fusion, 5(2), 91–101. doi:10.1016/j.inffus.2003.07.001.
Article Google Scholar
Massaro, D. W. (2001). Auditory visual speech processing. In European Conference on Speech Communication and Technology, Aalborg, Denmark, pp. 1153–1156.
Stauffer, C. (2005). Automated audio-visual analysis, MIT Artificial Intelligence Laboratory Memo. http://people.csail.mit.edu/stauffer/Home.
Chang, C. C., & Lin C. J. (2008). Library of SVMs: LIBSVM—A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Wu, K., & Yap, K.-H. (2006). Fuzzy SVM for content-based image retrieval—A pseudo-label support vector machine framework. IEEE Computational Intelligence Magazine, 1, 10–16.
Google Scholar
Salton, G., & Buckley, C. (1987). Term-weighting approaches in automatic text retrieval, Technical Report: TR87-881, Cornell University.
Kolker, R. (2006). Film form and culture. New York: McGraw-Hill.
Google Scholar
Naphade, M. R., & Huang, T. S. (2001). A probabilistic framework for semantic video indexing, filtering and retrieval over the Internet. IEEE Transactions on Multimedia, 3(1), 141–151. doi:10.1109/6046.909601.
Article Google Scholar
Adams, W. H., Iyengar, G., Lin, C.-Y., Naphade, M. R., Neti, C., Nock, H. J., et al. (2003). Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP Journal on Applied Signal Processing, 2003(2), 170–185.
Article Google Scholar
Zhou, J., Xin, L.-P., & Rong, G. (2000). Decision fusion based cartridge identification using support vector machine. In Proc. IEEE International Conference on Systems, Man, and Cybernetics, Tennessee, USA, pp. 2873–2877, October.
Manjunath, B. S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-7: Multimedia content description interface. Hoboken: Wiley.
Google Scholar
Snoek, C., & Worring, M. (2005). Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25, 5–35. doi:10.1023/B:MTAP.0000046380.27575.a5.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Naresuan University, Phitsanulok, Thailand, 65000
Paisarn Muneesawang
Department of Electrical and Computer Engineering, Ryerson University, Toronto, Canada, M5B 2K3
Ling Guan
Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
Tahir Amin

Authors

Paisarn Muneesawang
View author publications
You can also search for this author inPubMed Google Scholar
Ling Guan
View author publications
You can also search for this author inPubMed Google Scholar
Tahir Amin
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Paisarn Muneesawang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muneesawang, P., Guan, L. & Amin, T. A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips. J Sign Process Syst Sign Image Video Technol 59, 177–188 (2010). https://doi.org/10.1007/s11265-008-0290-7

Download citation

Received: 19 May 2008
Revised: 09 September 2008
Accepted: 23 September 2008
Published: 21 October 2008
Issue Date: May 2010
DOI: https://doi.org/10.1007/s11265-008-0290-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weighting Multiple Features and Double Fusion Method for HMM Based Video Classification

Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

Video Soundtrack Evaluation with Machine Learning: Data Availability, Feature Extraction, and Classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now