Abstract
This work addresses the soundtrack indexing of multimedia documents. Our purpose is to detect and locate sound unity to structure the audio dataflow in program broadcasts (reports). We present two audio classification tools that we have developed. The first one, a speech music classification tool, is based on three original features: entropy modulation, stationary segment duration (with a Forward–Backward Divergence algorithm) and number of segments. They are merged with the classical 4 Hz modulation energy. It is divided into two classifications (speech/non-speech and music/non-music) and provides more than 90% of accuracy for speech detection and 89% for music detection. The other system, a jingle identification tool, uses an Euclidean distance in the spectral domain to index the audio data flow. Results show that is efficient: among 132 jingles to recognize, we have detected 130. Systems are tested on TV and radio corpora (more than 10 h). They are simple, robust and can be improved on every corpus without training or adaptation.
Similar content being viewed by others
Abbreviations
- GMM:
-
Gaussian Mixture Models
- pdf:
-
power density function
- FBD:
-
Forward–Backward Divergence
- FFT:
-
Fast Fourier Transform
- FIR:
-
Finite Impulse Response
References
Aigrain P, Joly P, Longueville V (1997) Medium knowledge-based macro-segmentation of video into sequences. In: Intelligent multimedia information retrieval, pp 159–173
Amaral R, Langlois T, Meinedo H, Neto J, Souto N, Trancoso I (2001) The development of a Portuguese version of a media watch system. block In: European Conference on Speech Communication and Technology. Aalborg, Denmark
André-Obrecht R (1988) A new statistical approach for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Signal Processing 36(1)
André-Obrecht R (1993) Segmentation et parole?. Master's thesis, IRISA
André-Obrecht R, Jacob B (1997) Direct identification vs. correlated models to process acoustic and articulatory informations in automatic speech recognition. In: International conference on audio, speech and signal processing. IEEE, Munich, Germany, pp 989–992
Atal B (1983) Efficient coding of LPC parameters by temporal decomposition. In: International Conference on Audio, Speech and Signal Processing. Boston, USA, pp 81–84
Bimbot F, Cholet G, Deleglise P, Montacie C (1988) Temporal decomposition and acoustic–phonetic decoding of speech. In: International conference on audio, speech and signal processing. Singapore, pp 425–428
Caelen J (1979) Un modèle d'oreille; analyse de la parole continue; reconnaissance phonémique. Ph.D. thesis, UPS Toulouse
Calliope (1989) La parole et son traitement automatique. Masson, Paris, France
Campione E, Véronis J (1998) A multilingual prosodic database. In: International conference on spoken language processing. Sydney, Australia, pp 3163–3166
Carey MJ, Parris EJ, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: International Conference on Audio, Speech and Signal Processing. IEEE, Phoenix, USA, pp 149–152
Carrive J, Pachet F, Ronfard R (2000) CLAViS—a temporal reasoning system for classification of audiovisual sequences. In: Proceedings of Content-Based Multimedia Information Access (RIAO) Conference. College de France, Paris, France
Foote J (2000) Automatic audio segmentation using a measure of audio novelty. In: IEEE international conference on multimedia and expo. IEEE, New-York, USA, pp 452–455
Franz M, Scott McCarley J, Ward T, Zhu W (2001) Topics styles in IR and TDT: Effect on System Behavior. In: European Conference on Speech Communication and Technology. Aalborg, Denmark, pp 287–290
Gauvain JL, Lamel L, Adda G (1999) Systèmes de processus légers: concepts et exemples. In: International workshop on content-based multimedia indexing. Toulouse, France, pp 67–73 GDR-PRC ISIS
Houtgast T, Steeneken JM (1985) A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J Acoust Soc Am 77(3):1069–1077
Johnson NL, Kotz S (1970) Continuous univariate distributions. Willey, New-York, USA
Moddemeijer R (1989) On estimation of entropy and mutual information of continuous distributions. Signal Process 16(3):233–246
Pinquier J, Rouas J-L, André-Obrecht R (2002a) Robust speech / music classification in audio documents. In: International Conference on Spoken Language Processing, Vol. 3. Denver, USA, pp 2005–2008
Pinquier J, Sénac C, André-Obrecht R (2002b) Indexation de la bande sonore : recherche des composantes Parole et Musique. In: Congrès de Reconnaissance des Formes et Intelligence Artificielle. Angers, France, pp 163–170
Rossignol S, Rodet X, Soumagne J, Collette JL, Depalle P (1999) Automatic characterization of musical signals: feature extraction and temporal segmentation. J New Music Res 28(4):281–295
Saunders J (1996) Real-time discrimination of broadcast speech/music. In: International Conference on Audio, Speech and Signal Processing. IEEE, Atlanta, USA, pp 993–996
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: International conference on audio, speech and signal processing. IEEE, Munich, Germany, pp 1331–1334
Suaudeau N (1994) Un modèle probabiliste pour intégrer la dimension temporelle dans un système de reconnaisance automatique de parole. Ph.D. thesis, IRISA
Zhang T, Kuo C, CJ (1998) Hierarchical system for content-based audio classification and retrieval. In: Conference on multimedia storage and archiving systems III, Vol. 3527. pp 398–409, SPIE
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pinquier, J., André-Obrecht, R. Audio indexing: primary components retrieval. Multimed Tools Appl 30, 313–330 (2006). https://doi.org/10.1007/s11042-006-0027-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-006-0027-1