Skip to main content
Log in

Audio Feature Extraction and Analysis for Scene Segmentation and Classification

  • Published:
Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Abstract

Understanding of the scene content of a video sequence is very important for content-based indexing and retrieval of multimedia databases. Research in this area in the past several years has focused on the use of speech recognition and image analysis techniques. As a complimentary effort to the prior work, we have focused on using the associated audio information (mainly the nonspeech portion) for video scene analysis. As an example, we consider the problem of discriminating five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts. A set of low-level audio features are proposed for characterizing semantic contents of short audio clips. The linear separability of different classes under the proposed feature space is examined using a clustering analysis. The effective features are identified by evaluating the intracluster and intercluster scattering matrices of the feature space. Using these features, a neural net classifier was successful in separating the above five types of TV programs. By evaluating the changes between the feature vectors of adjacent clips, we also can identify scene breaks in an audio sequence quite accurately. These results demonstrate the capability of the proposed audio features for characterizing the semantic content of an audio sequence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. M.A. Smith and T. Kanade, “Video skimming and characterization through the combination of image and language understanding techniques,” Carnegie Mellon University Technical Report CMU-CS-97-111, Feb. 1997.

  2. Y. Chang, W. Zeng, I. Kamel, and R. Alonso, “Integrated image and speech analysis for content-based video indexing,” Proc. of the 3rd IEEE International Conference on Multimedia Computing and Systems, pp. 306-313, 1996.

  3. M. Flickner et al., “Query by image and video content: The QBIC system,” IEEE Computer, Vol. 28, No.9, pp. 23-32, Sept. 1995.

    Article  Google Scholar 

  4. S.W. Smoliar and H. Zhang, “Content-based video indexing and retrieval,” IEEE Multimedia Magazine, Vol. 1, No.2, pp. 62-72, Summer 1994.

    Article  Google Scholar 

  5. J.R. Smith and S.-F. Chang, “SaFe: A general framework for integrated spatial and feature image search,” Proc. IEEE 1st Multimedia Signal Processing Workshop, pp. 301-306, June 1997.

  6. H.J. Zhang, A. Kankanhalli, and S.W. Smoliar, “Automatic partitioning of full-motion video,” Multimedia Systems, Vol. 1, No.1, pp. 10-28, 1993.

    Article  Google Scholar 

  7. H.J. Zhang et al., “An integrated system for content-based video retrieval and browsing,” Pattern Recognition, Vol. 30, No.4, pp. 643-658, 1997.

    Article  Google Scholar 

  8. J.D. Courtney, “Automatic video indexing via object motion analysis,” Pattern Recognition, Vol. 30, No.4, pp. 607-625, 1997.

    Article  Google Scholar 

  9. M. Irani, P. Anandan, J. Bergern, R. Kumar, and S. Hsu, “Efficient representations of video sequences and their applications,” Signal Processing: Image Communication, Vol. 8, pp. 327-351, 1996.

    Google Scholar 

  10. B. Shahraray and D.C. Gibbon, “Pictorial transcripts: Multimedia processing applied to digital library creation,” Proc. IEEE 1st Multimedia Signal Processing Workshop, pp. 581-586, June 1997.

  11. M.M. Yeung and B.-L. Yeo, “Video visualization for compact presentation and fast browsing of pictorial content,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 7, No.5, pp. 771-785, Oct. 1997.

    Article  Google Scholar 

  12. C. Saraceno and R. Leonardi, “Audio as a support to scene change detection and characterization of video sequences,” Proc. of ICASSP'97, Vol. 4, pp. 2597-2600, 1997.

    Google Scholar 

  13. S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” Proc. ACM Multimedia'96, pp. 21-30, 1996.

  14. J. Nam and A.H. Tewfik, “Combined audio and visual streams analysis for video sequence segmentation,” Proc. of ICASSP'97, Vol. 3, pp. 2665-2668, 1997.

    Google Scholar 

  15. Y. Wang, J. Huang, Z. Liu, and T. Chen, “Multimedia content classification using motion and audio information,” Proc. of IEEE ISCAS'97, Vol. 2, pp. 1488-1491, 1997.

    Google Scholar 

  16. J. Saunders, “Real-time discrimination of broadcast speech/music,” Proc. of ICASSP'96, Vol. 2, pp. 993-996, 1996.

    Google Scholar 

  17. E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” Proc. of ICASSP'97, Vol. 2, pp. 1331-1334, 1997.

    Google Scholar 

  18. W. Hess, Pitch Determination of Speech Signals, Springer-Verlag, 1983.

  19. E. Wold et al., “Content-based classification, search, and retrieval of audio,” IEEE Multimedia Magazine, Vol. 3, No.3, pp. 27-36, Fall 1996.

    Article  MathSciNet  Google Scholar 

  20. N. Jayant, J. Johnston, and R. Safranek, “Signal compression based on models of human perception,” Proceedings of IEEE, Vol. 81, No.10, pp. 1385-1422, Oct. 1993.

    Article  Google Scholar 

  21. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1972.

  22. B. Kosko, Neural Networks for Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1992.

    Google Scholar 

  23. R.P. Lippman, “An introduction to computing with neural nets,” IEEE ASSP Magazine, Vol. 4, No.2, pp. 4-22, April 1987.

    Article  MathSciNet  Google Scholar 

  24. S.H. Lin, S.Y. Kung, and L.J. Lin, “Face recognition/detection by probabilistic decision-based neural network,” IEEE Trans. Neural Networks, Vol. 8, No.1, pp. 114-132, Jan. 1997.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Wang, Y. & Chen, T. Audio Feature Extraction and Analysis for Scene Segmentation and Classification. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 20, 61–79 (1998). https://doi.org/10.1023/A:1008066223044

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008066223044

Keywords

Navigation