Abstract
The audio channel conveys rich clues for content-based multimedia indexing. Interesting audio analysis includes, besides widely known speech recognition and speaker identification problems, speech/music segmentation, speaker gender detection, special effect recognition such as gun shots or car pursuit, and so on. All these problems can be considered as an audio classification problem which needs to generate a label from low audio signal analysis. While most audio analysis techniques in the literature are problem specific, we propose in this paper a general framework for audio classification. The proposed technique uses a perceptually motivated model of the human perception of audio classes in the sense that it makes a judicious use of certain psychophysical results and relies on a neural network for classification. In order to assess the effectiveness of the proposed approach, large experiments on several audio classification problems have been carried out, including speech/music discrimination in Radio/TV programs, gender recognition on a subset of the switchboard database, highlights detection in sports videos, and musical genre recognition. The classification accuracies of the proposed technique are comparable to those obtained by problem specific techniques while offering the basis of a general approach for audio classification.
Similar content being viewed by others
References
Ajmera J, McCowan I, Bourlard H (2003) Speech/Music discrimination using entropy and dynamism features in a HMM classification framework. Speech Commun 40(3):351–363
Carey M, Parris E, Lloyd-Thomas H ( 1999) A comparison of features for speech, music discrimination. Proceedings of IEEE ICASSP99, pp149–152
Chang Y-L, Zeng W, Kamel I, Alonso R (1996) Integrated image and speech analysis for content-based video indexing. Proceedings, the third IEEE international conference on multimedia computing and systems, pp306–313
Chao L, Nielsen-Bohlman L, Knight R (1995) Auditory event-related potentials dissociate early and late memory processes. Electroencephalogr Clin Neurophysiol 96:157–168, Elsevier
Dagtas S, Abdel-Mottaleb M (2001) Extraction of TV highlights using multimedia features. Proceedings, IEEE 4th workshop on multimedia signal processing
De Santo M et al (2001) Classifying audio of movies by a multi expert system. Proceedings of the IEEE 11th international conference on image analysis and processing, pp386–391
Dongge L et al (2001) Classification of general audio data for content-based retrieval. Pattern Recogn Lett 22:533–544, Elsevier
El-Maleh K, Klein M, Petrucci G, Kabal P (2000) Speech/music discrimination for multimedia applications. Proceedings of IEEE ICASSP00, pp2445–2449
Foote J (1997) A similarity measure for automatic audio classification. In Proc. AAAI 1997 spring symposium on intelligent integration and use of text, image, video, and audio corpora. Stanford (March)
Gauvain J-L, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. Proc. ICSLP’98 5:1335–1338
Goto M, Hashiguchi H, Nishimura T, Oka R (2002) RWC music database: popular, classical, and jazz music databases. Proceedings, the 3rd international conference on music information retrieval (ISMIR02), pp287–288
Hagen S, Tanja S, Martin W (1998) Recognition of music types. Proceedings, the 1998 IEEE international conference on acoustics, speech and signal processing, ICASSP
Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ (1998) Segment generation and clustering in the HTK broadcast news transcription system. Proc. 1998 DARPA broadcast news transcription and understanding workshop, pp133–137
Hanjalic A, Xu L-Q (2001) User-oriented affective video analysis. Proceedings, IEEE workshop on content-based access of image and video libraries, in conjunction with the IEEE CVPR 2001 conference
Harb H, Chen L (2003) Gender identification using a general audio classifier. Proceedings, the IEEE international conference on multimedia & expo ICME, pp733–736
Harb H, Chen L (2003) Robust speech/music discrimination using spectrum’s first order statistics and neural networks. Proceedings. the IEEE international symposium on signal processing and its applications ISSPA2003, pp125–128
Haykin S (1994) Neural networks a comprehensive foundation. Macmillan
Huang XD, Lee KF, Hon HW, Hwang MY (1991) Improved acoustic modeling with the SPHINX speech recognition system. Proceedings of the IEEE ICASSP-91, 1:345–348
Jiang D-N, Lu L, Zhang H-J, Cai L-H, Tao J-H (2002) Music type classification by spectral contrast features. Proceedings, IEEE international conference on multimedia and expo (ICME02)
Jung E, Schwarzbacher A, Lawlor R (2002) Implementation of real-time AMDF pitch-detection for voice gender normalization. Proceedings of the 14th international conference on digital signal processing. DSP 2002 2:827–830
Kimber D, Wilcox L (1996) Acoustic segmentation for audio browsers. Proceedings of interface conference, Sydney, Australia (July)
Kiranyaz S, Aubazac M, Gabbouj M (2003) Unsupervised segmentation and classification over MP3 and AAC audio bitstreams. In the Proc. of the 4th European workshop on image analysis for multimedia interactive services WIAMIS 03, World Scientific, London UK
Konig Y, Morgan N (1992) GDNN a gender dependent neural network for continuous speech recognition. Proceedings, international joint conference on neural networks, IJCNN, Volume: 2, 7–11 2:332–337
Li S (2000) Content-based classification and retrieval of audio using the nearest feature line method. IEEE Trans Speech Audio Process 8:619–625
Li G, Khokhar A (2000) Content-based indexing and retrieval of audio data using wavelets. Proceedings, the IEEE international conference on multimedia and expo (II), pp885–888
Liu F, Stern R, Huang X, Acero A (1993) Efficient cepstral normalization for robust speech recognition. Proceedings of ARPA speech and natural language workshop, pp69–74 (March)
Liu Z, Wang T, Chen T (1998) Audio feature extraction and analysis for multimedia content classification. J VLSI Signal Process Syst
Miyamori H (2002) Improving accuracy in behaviour identification for content-based retrieval by using audio and video information. Proceedings of IEEE ICPR02, 2:826–830
Moncrieff S, Dorai C, Venkatesh S (2001) Affect computing in film through sound energy dynamics. Proceedings of ACM MM
Moore, BCJ (ed) (1995), Hearing. Academic, Toronto
Neti C, Roukos S (1997) Phone-context specific gender-dependent acoustic-models for continuous speech recognition. Proceedings, IEEE workshop on automatic speech recognition and understanding, 192–198
Noppeney U, Price CJ (2002) Retrieval of visual, auditory, and abstract semantics. NeuroImage 15:917–926, Elsevier
Parris ES, Carey MJ (1996) Language independent gender identification. Proceedings of IEEE ICASSP, pp685–688
Perrot, D, Gjerdigen, RO Scanning the dial: an exploration of factors in the identification of musical style. Proceedings, the 1999 Society for Music Perception and Cognition
Pfeiffer S, Fischer S, Effelsberg W (1996) Automatic audio content analysis. Proceedings of ACM Multimedia, pp21–30
Pinquier J, Sénac C André-Obrecht R (2002) Speech and music classification in audio documents. Proceedings, the IEEE ICASSP’2002, pp4164–4167
Pye D (2000) Content-based methods for the management of digital music. Proceedings, IEEE international conference on, acoustics, speech, and signal processing, ICASSP’00.volume:4, 4:2437–2440
Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83
Reyes-Gomez M, Ellis D (2003) Selection, parameter estimation, and discriminative training of hidden Markov models for general audio modeling. Proceedings, the IEEE international conference on multimedia & expo ICME
Rivarol V, Farhat A, O’Shaughnessy D (1996) Robust gender-dependent acoustic-phonetic modelling in continuous speech recognition based on a new automatic male female classification. Proceedings, fourth international conference on spoken language, ICSLP 96, Volume: 2 3–6 2:1081–1084 (Oct)
Saunders J (1996) Real time discrimination of broadcast speech/music, Proc. Of ICASSP96 2: 993–996
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of IEEE ICASSP’97, Munich, Germany (April)
Seck M, Magrin-Chagnolleau I, Bimbot F (2001) Experiments on speech tracking in audio documents using Gaussian mixture modeling. Proceedings of IEEE ICASSP01, 1:601–604
Slaney M (2002) Mixtures of probability experts for audio retrieval and indexing. Proceedings, IEEE international conference on multimedia and expo, ICME 2002, 1:345–348
Slomka S, Sridharan S (1997) Automatic gender identification optimised for language independence. Proceeding of IEEE TENCON-speech and image technologies for computing and telecommunications, pp145–148
Sundaram H, Chang S-F (2000) Video scene segmentation using video and audio features. IEEE international conference on multimedia and expo, New York (July)
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Tzanetakis G, Essl G, Cook P (2001) Automatic musical genre classification of audio signals. Proceedings, international symposium on music information retrieval (ISMIR)
Wang Y, Liu Z, Huang J-C (2000) Multimedia content analysis using both audio and visual cues. IEEE Signal Process Mag 116:12–36
Williams G, Ellis D (1999) Speech/music discrimination based on posterior probability features. Proceedings of Eurospeech
Wold E, Blum T, Keislar D, Wheaton J (1996) Content-based classification search and retrieval of audio. IEEE Multimedia Magazine 3(3):27–36
Yabe H et al (2001) Organizing sound sequences in the human brain: the interplay of auditory streaming and temporal integration. Brain Res 897:222–227, Elsevier
Zhang T, Jay Kuo C-C (2001) Audio content analysis for on-line audiovisual data segmentation. IEEE Trans Speech Audio Process 9(4):441–457
Zhou W, Dao S, Jay Kuo C-C (2002) On line knowledge and rule-based video classification system for video indexing and dissemination. Inf Sys 27:559–586, Elsevier
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Harb, H., Chen, L. A general audio classifier based on human perception motivated model. Multimed Tools Appl 34, 375–395 (2007). https://doi.org/10.1007/s11042-007-0108-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-007-0108-9