Skip to main content
Log in

A general audio classifier based on human perception motivated model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The audio channel conveys rich clues for content-based multimedia indexing. Interesting audio analysis includes, besides widely known speech recognition and speaker identification problems, speech/music segmentation, speaker gender detection, special effect recognition such as gun shots or car pursuit, and so on. All these problems can be considered as an audio classification problem which needs to generate a label from low audio signal analysis. While most audio analysis techniques in the literature are problem specific, we propose in this paper a general framework for audio classification. The proposed technique uses a perceptually motivated model of the human perception of audio classes in the sense that it makes a judicious use of certain psychophysical results and relies on a neural network for classification. In order to assess the effectiveness of the proposed approach, large experiments on several audio classification problems have been carried out, including speech/music discrimination in Radio/TV programs, gender recognition on a subset of the switchboard database, highlights detection in sports videos, and musical genre recognition. The classification accuracies of the proposed technique are comparable to those obtained by problem specific techniques while offering the basis of a general approach for audio classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Ajmera J, McCowan I, Bourlard H (2003) Speech/Music discrimination using entropy and dynamism features in a HMM classification framework. Speech Commun 40(3):351–363

    Article  Google Scholar 

  2. Carey M, Parris E, Lloyd-Thomas H ( 1999) A comparison of features for speech, music discrimination. Proceedings of IEEE ICASSP99, pp149–152

  3. Chang Y-L, Zeng W, Kamel I, Alonso R (1996) Integrated image and speech analysis for content-based video indexing. Proceedings, the third IEEE international conference on multimedia computing and systems, pp306–313

  4. Chao L, Nielsen-Bohlman L, Knight R (1995) Auditory event-related potentials dissociate early and late memory processes. Electroencephalogr Clin Neurophysiol 96:157–168, Elsevier

    Article  Google Scholar 

  5. Dagtas S, Abdel-Mottaleb M (2001) Extraction of TV highlights using multimedia features. Proceedings, IEEE 4th workshop on multimedia signal processing

  6. De Santo M et al (2001) Classifying audio of movies by a multi expert system. Proceedings of the IEEE 11th international conference on image analysis and processing, pp386–391

  7. Dongge L et al (2001) Classification of general audio data for content-based retrieval. Pattern Recogn Lett 22:533–544, Elsevier

    Article  MATH  Google Scholar 

  8. El-Maleh K, Klein M, Petrucci G, Kabal P (2000) Speech/music discrimination for multimedia applications. Proceedings of IEEE ICASSP00, pp2445–2449

  9. Foote J (1997) A similarity measure for automatic audio classification. In Proc. AAAI 1997 spring symposium on intelligent integration and use of text, image, video, and audio corpora. Stanford (March)

  10. Gauvain J-L, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. Proc. ICSLP’98 5:1335–1338

    Google Scholar 

  11. Goto M, Hashiguchi H, Nishimura T, Oka R (2002) RWC music database: popular, classical, and jazz music databases. Proceedings, the 3rd international conference on music information retrieval (ISMIR02), pp287–288

  12. Hagen S, Tanja S, Martin W (1998) Recognition of music types. Proceedings, the 1998 IEEE international conference on acoustics, speech and signal processing, ICASSP

  13. Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ (1998) Segment generation and clustering in the HTK broadcast news transcription system. Proc. 1998 DARPA broadcast news transcription and understanding workshop, pp133–137

  14. Hanjalic A, Xu L-Q (2001) User-oriented affective video analysis. Proceedings, IEEE workshop on content-based access of image and video libraries, in conjunction with the IEEE CVPR 2001 conference

  15. Harb H, Chen L (2003) Gender identification using a general audio classifier. Proceedings, the IEEE international conference on multimedia & expo ICME, pp733–736

  16. Harb H, Chen L (2003) Robust speech/music discrimination using spectrum’s first order statistics and neural networks. Proceedings. the IEEE international symposium on signal processing and its applications ISSPA2003, pp125–128

  17. Haykin S (1994) Neural networks a comprehensive foundation. Macmillan

  18. Huang XD, Lee KF, Hon HW, Hwang MY (1991) Improved acoustic modeling with the SPHINX speech recognition system. Proceedings of the IEEE ICASSP-91, 1:345–348

  19. Jiang D-N, Lu L, Zhang H-J, Cai L-H, Tao J-H (2002) Music type classification by spectral contrast features. Proceedings, IEEE international conference on multimedia and expo (ICME02)

  20. Jung E, Schwarzbacher A, Lawlor R (2002) Implementation of real-time AMDF pitch-detection for voice gender normalization. Proceedings of the 14th international conference on digital signal processing. DSP 2002 2:827–830

  21. Kimber D, Wilcox L (1996) Acoustic segmentation for audio browsers. Proceedings of interface conference, Sydney, Australia (July)

  22. Kiranyaz S, Aubazac M, Gabbouj M (2003) Unsupervised segmentation and classification over MP3 and AAC audio bitstreams. In the Proc. of the 4th European workshop on image analysis for multimedia interactive services WIAMIS 03, World Scientific, London UK

  23. Konig Y, Morgan N (1992) GDNN a gender dependent neural network for continuous speech recognition. Proceedings, international joint conference on neural networks, IJCNN, Volume: 2, 7–11 2:332–337

  24. Li S (2000) Content-based classification and retrieval of audio using the nearest feature line method. IEEE Trans Speech Audio Process 8:619–625

    Article  Google Scholar 

  25. Li G, Khokhar A (2000) Content-based indexing and retrieval of audio data using wavelets. Proceedings, the IEEE international conference on multimedia and expo (II), pp885–888

  26. Liu F, Stern R, Huang X, Acero A (1993) Efficient cepstral normalization for robust speech recognition. Proceedings of ARPA speech and natural language workshop, pp69–74 (March)

  27. Liu Z, Wang T, Chen T (1998) Audio feature extraction and analysis for multimedia content classification. J VLSI Signal Process Syst

  28. Miyamori H (2002) Improving accuracy in behaviour identification for content-based retrieval by using audio and video information. Proceedings of IEEE ICPR02, 2:826–830

  29. Moncrieff S, Dorai C, Venkatesh S (2001) Affect computing in film through sound energy dynamics. Proceedings of ACM MM

  30. Moore, BCJ (ed) (1995), Hearing. Academic, Toronto

  31. Neti C, Roukos S (1997) Phone-context specific gender-dependent acoustic-models for continuous speech recognition. Proceedings, IEEE workshop on automatic speech recognition and understanding, 192–198

  32. Noppeney U, Price CJ (2002) Retrieval of visual, auditory, and abstract semantics. NeuroImage 15:917–926, Elsevier

    Article  Google Scholar 

  33. Parris ES, Carey MJ (1996) Language independent gender identification. Proceedings of IEEE ICASSP, pp685–688

  34. Perrot, D, Gjerdigen, RO Scanning the dial: an exploration of factors in the identification of musical style. Proceedings, the 1999 Society for Music Perception and Cognition

  35. Pfeiffer S, Fischer S, Effelsberg W (1996) Automatic audio content analysis. Proceedings of ACM Multimedia, pp21–30

  36. Pinquier J, Sénac C André-Obrecht R (2002) Speech and music classification in audio documents. Proceedings, the IEEE ICASSP’2002, pp4164–4167

  37. Pye D (2000) Content-based methods for the management of digital music. Proceedings, IEEE international conference on, acoustics, speech, and signal processing, ICASSP’00.volume:4, 4:2437–2440

  38. Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83

    Article  Google Scholar 

  39. Reyes-Gomez M, Ellis D (2003) Selection, parameter estimation, and discriminative training of hidden Markov models for general audio modeling. Proceedings, the IEEE international conference on multimedia & expo ICME

  40. Rivarol V, Farhat A, O’Shaughnessy D (1996) Robust gender-dependent acoustic-phonetic modelling in continuous speech recognition based on a new automatic male female classification. Proceedings, fourth international conference on spoken language, ICSLP 96, Volume: 2 3–6 2:1081–1084 (Oct)

  41. Saunders J (1996) Real time discrimination of broadcast speech/music, Proc. Of ICASSP96 2: 993–996

    Google Scholar 

  42. Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of IEEE ICASSP’97, Munich, Germany (April)

  43. Seck M, Magrin-Chagnolleau I, Bimbot F (2001) Experiments on speech tracking in audio documents using Gaussian mixture modeling. Proceedings of IEEE ICASSP01, 1:601–604

  44. Slaney M (2002) Mixtures of probability experts for audio retrieval and indexing. Proceedings, IEEE international conference on multimedia and expo, ICME 2002, 1:345–348

  45. Slomka S, Sridharan S (1997) Automatic gender identification optimised for language independence. Proceeding of IEEE TENCON-speech and image technologies for computing and telecommunications, pp145–148

  46. Sundaram H, Chang S-F (2000) Video scene segmentation using video and audio features. IEEE international conference on multimedia and expo, New York (July)

  47. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302

    Article  Google Scholar 

  48. Tzanetakis G, Essl G, Cook P (2001) Automatic musical genre classification of audio signals. Proceedings, international symposium on music information retrieval (ISMIR)

  49. Wang Y, Liu Z, Huang J-C (2000) Multimedia content analysis using both audio and visual cues. IEEE Signal Process Mag 116:12–36

    Article  Google Scholar 

  50. Williams G, Ellis D (1999) Speech/music discrimination based on posterior probability features. Proceedings of Eurospeech

  51. Wold E, Blum T, Keislar D, Wheaton J (1996) Content-based classification search and retrieval of audio. IEEE Multimedia Magazine 3(3):27–36

    Article  Google Scholar 

  52. Yabe H et al (2001) Organizing sound sequences in the human brain: the interplay of auditory streaming and temporal integration. Brain Res 897:222–227, Elsevier

    Article  Google Scholar 

  53. Zhang T, Jay Kuo C-C (2001) Audio content analysis for on-line audiovisual data segmentation. IEEE Trans Speech Audio Process 9(4):441–457

    Article  Google Scholar 

  54. Zhou W, Dao S, Jay Kuo C-C (2002) On line knowledge and rule-based video classification system for video indexing and dissemination. Inf Sys 27:559–586, Elsevier

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadi Harb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Harb, H., Chen, L. A general audio classifier based on human perception motivated model. Multimed Tools Appl 34, 375–395 (2007). https://doi.org/10.1007/s11042-007-0108-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-007-0108-9

Keywords

Navigation