Skip to main content
Log in

Audio indexing: primary components retrieval

Robust classification in audio documents

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This work addresses the soundtrack indexing of multimedia documents. Our purpose is to detect and locate sound unity to structure the audio dataflow in program broadcasts (reports). We present two audio classification tools that we have developed. The first one, a speech music classification tool, is based on three original features: entropy modulation, stationary segment duration (with a Forward–Backward Divergence algorithm) and number of segments. They are merged with the classical 4 Hz modulation energy. It is divided into two classifications (speech/non-speech and music/non-music) and provides more than 90% of accuracy for speech detection and 89% for music detection. The other system, a jingle identification tool, uses an Euclidean distance in the spectral domain to index the audio data flow. Results show that is efficient: among 132 jingles to recognize, we have detected 130. Systems are tested on TV and radio corpora (more than 10 h). They are simple, robust and can be improved on every corpus without training or adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. http://www.nist.gov/speech/tools/

Abbreviations

GMM:

Gaussian Mixture Models

pdf:

power density function

FBD:

Forward–Backward Divergence

FFT:

Fast Fourier Transform

FIR:

Finite Impulse Response

References

  1. Aigrain P, Joly P, Longueville V (1997) Medium knowledge-based macro-segmentation of video into sequences. In: Intelligent multimedia information retrieval, pp 159–173

  2. Amaral R, Langlois T, Meinedo H, Neto J, Souto N, Trancoso I (2001) The development of a Portuguese version of a media watch system. block In: European Conference on Speech Communication and Technology. Aalborg, Denmark

  3. André-Obrecht R (1988) A new statistical approach for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Signal Processing 36(1)

  4. André-Obrecht R (1993) Segmentation et parole?. Master's thesis, IRISA

  5. André-Obrecht R, Jacob B (1997) Direct identification vs. correlated models to process acoustic and articulatory informations in automatic speech recognition. In: International conference on audio, speech and signal processing. IEEE, Munich, Germany, pp 989–992

    Google Scholar 

  6. Atal B (1983) Efficient coding of LPC parameters by temporal decomposition. In: International Conference on Audio, Speech and Signal Processing. Boston, USA, pp 81–84

  7. Bimbot F, Cholet G, Deleglise P, Montacie C (1988) Temporal decomposition and acoustic–phonetic decoding of speech. In: International conference on audio, speech and signal processing. Singapore, pp 425–428

  8. Caelen J (1979) Un modèle d'oreille; analyse de la parole continue; reconnaissance phonémique. Ph.D. thesis, UPS Toulouse

  9. Calliope (1989) La parole et son traitement automatique. Masson, Paris, France

    Google Scholar 

  10. Campione E, Véronis J (1998) A multilingual prosodic database. In: International conference on spoken language processing. Sydney, Australia, pp 3163–3166

  11. Carey MJ, Parris EJ, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: International Conference on Audio, Speech and Signal Processing. IEEE, Phoenix, USA, pp 149–152

    Google Scholar 

  12. Carrive J, Pachet F, Ronfard R (2000) CLAViS—a temporal reasoning system for classification of audiovisual sequences. In: Proceedings of Content-Based Multimedia Information Access (RIAO) Conference. College de France, Paris, France

  13. Foote J (2000) Automatic audio segmentation using a measure of audio novelty. In: IEEE international conference on multimedia and expo. IEEE, New-York, USA, pp 452–455

    Google Scholar 

  14. Franz M, Scott McCarley J, Ward T, Zhu W (2001) Topics styles in IR and TDT: Effect on System Behavior. In: European Conference on Speech Communication and Technology. Aalborg, Denmark, pp 287–290

  15. Gauvain JL, Lamel L, Adda G (1999) Systèmes de processus légers: concepts et exemples. In: International workshop on content-based multimedia indexing. Toulouse, France, pp 67–73 GDR-PRC ISIS

  16. Houtgast T, Steeneken JM (1985) A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J Acoust Soc Am 77(3):1069–1077

    Article  Google Scholar 

  17. Johnson NL, Kotz S (1970) Continuous univariate distributions. Willey, New-York, USA

    MATH  Google Scholar 

  18. Moddemeijer R (1989) On estimation of entropy and mutual information of continuous distributions. Signal Process 16(3):233–246

    Article  MathSciNet  Google Scholar 

  19. Pinquier J, Rouas J-L, André-Obrecht R (2002a) Robust speech / music classification in audio documents. In: International Conference on Spoken Language Processing, Vol. 3. Denver, USA, pp 2005–2008

  20. Pinquier J, Sénac C, André-Obrecht R (2002b) Indexation de la bande sonore : recherche des composantes Parole et Musique. In: Congrès de Reconnaissance des Formes et Intelligence Artificielle. Angers, France, pp 163–170

  21. Rossignol S, Rodet X, Soumagne J, Collette JL, Depalle P (1999) Automatic characterization of musical signals: feature extraction and temporal segmentation. J New Music Res 28(4):281–295

    Article  Google Scholar 

  22. Saunders J (1996) Real-time discrimination of broadcast speech/music. In: International Conference on Audio, Speech and Signal Processing. IEEE, Atlanta, USA, pp 993–996

    Google Scholar 

  23. Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: International conference on audio, speech and signal processing. IEEE, Munich, Germany, pp 1331–1334

    Google Scholar 

  24. Suaudeau N (1994) Un modèle probabiliste pour intégrer la dimension temporelle dans un système de reconnaisance automatique de parole. Ph.D. thesis, IRISA

  25. Zhang T, Kuo C, CJ (1998) Hierarchical system for content-based audio classification and retrieval. In: Conference on multimedia storage and archiving systems III, Vol. 3527. pp 398–409, SPIE

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julien Pinquier.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pinquier, J., André-Obrecht, R. Audio indexing: primary components retrieval. Multimed Tools Appl 30, 313–330 (2006). https://doi.org/10.1007/s11042-006-0027-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-006-0027-1

Keywords

Navigation