Skip to main content
Log in

New speech/music discrimination approach based on fundamental frequency estimation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a robust and effective approach for speech/music discrimination, which relies on a set of features derived from fundamental frequency (F0) estimation. Comparison between the proposed set of features and some commonly used timbral features is performed, aiming to assess the good discriminatory power of the proposed F0-based feature set. The classification scheme is composed of a classical Statistical Pattern Recognition classifier followed by a Fuzzy Rules Based System. Comparison with other well-proven classification schemes is also performed. Experimental results reveal that our speech/music discriminator is robust enough, making it suitable for a wide variety of multimedia applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Booker L (1982) Intelligent behaviour as an adaption to the task environment. Ph.D. Thesis, University of Michigan

  2. Burred JJ, Lerch A (2004) Hierarchical automatic audio signal classification. J Audio Eng Soc 52:724–739

    Google Scholar 

  3. Carey MJ, Parris ES, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: Proc. IEEE ICASSP’99, Phoenix, USA. IEEE, Piscataway, pp 1432–1435

    Google Scholar 

  4. Cheveigne A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930, April

    Article  Google Scholar 

  5. Cordon O, Herrera F, Hoffmann F, Magdalena L (2001) Genetic fuzzy systems. Evolutionary tuning and learning of fuzzy knowledge bases. Advances in fuzzy systems. Applications and theory, vol 19. World Scientific, Singapore

    Google Scholar 

  6. Davis S, Mermelstein P (1980) Experiments in syllable-based recognition of continuous speech. IEEE Trans Acoust Speech Signal Process 28:357–366, Aug

    Article  Google Scholar 

  7. Duda R, Hart P, Stork D (2000) Pattern classification. Wiley, New York

    Google Scholar 

  8. El-Maleh K, Klein M, Petrucci G, Kabal, P (2000) Speech/music discrimination for multimedia applications. In: Proc. IEEE ICASSP’2000, vol 6. IEEE, Piscataway, pp 2445–2448

    Google Scholar 

  9. Every MR (2008) Discriminating between pitched sources in music audio. IEEE Trans Audio Speech Language Process 16(2):267–277, Feb

    Article  Google Scholar 

  10. Exposito JEM, Galan SG, Reyes NR, Candeas PV (2007) Audio coding improvement using evolutionary speech/music discrimination. In: IEEE international fuzzy systems conference, (FUZZ-IEEE), July 2007. IEEE, Piscataway, pp 1–6

    Chapter  Google Scholar 

  11. Ezzaidi H, Rouat J (2007) Comparison of the statistical and information theory measures: application to automatic musical genre classification. In: IEEE Workshop on Machine Learning for Signal Processing, August 2007. IEEE, Piscataway, pp 241–246

    Chapter  Google Scholar 

  12. Fujihara H, Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG (2006) F0 estimation method for singing voice in polyphonic audio signal based on statistical vocal model and viterbi search acoustics. In: Proc. IEEE int. conf. on acoustic, speech and signal processing (ICASSP), May 2006, vol 5. IEEE, Piscataway, pp 14–19

    Google Scholar 

  13. Garau G, Renals S (2008) Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Trans Audio Speech Lang Process 16(3):508–518, March

    Article  Google Scholar 

  14. Garcia Arnal Barbedo J, Lopes A (2007) Speech/music discriminator based on multiple fundamental Frequencies Estimation. IEEE Latin America Trans 5(5):294–300, Sept

    Article  Google Scholar 

  15. Gong C, Xiong-wei Z (2006) The application of speech/music automatic discrimination based on gray correlation analysis. In: 5th IEEE international conference on cognitive informatics (ICCI), July 2006, vol 1. IEEE, Piscataway, pp 68–72

    Chapter  Google Scholar 

  16. Harb H, Chen L (2003) Robust speech music discrimination using spectrum’s first order statistics and neural networks. Proc IEEE Int Symp Signal Process Appl 2:125–128

    Article  Google Scholar 

  17. Hess W (1983) Pitch determination of speech signals. Springer, Berlin

    Google Scholar 

  18. Hess WJ (1992) Pitch and voicing determination. In: Furui S, Sohndi MM (eds) Advances in speech signal processing. Marcel Dekker, New York, pp 3–48

    Google Scholar 

  19. Hirose K, Iwano K (2000) Detection of prosodic word boundaries by statistical modeling of mora transitions of fundamental frequency contours and its use for continuous speech recognition. In: Proc. IEEE int. conf. on acoustics, speech, and signal processing (ICASSP), June 2000, vol 3. IEEE, Piscataway, pp 1763–1766

    Chapter  Google Scholar 

  20. Ji-Soo Keum, Hyon-Soo Lee (2006) Speech/music discrimination using spectral peak feature for speaker indexing. In: International symposium on intelligent signal processing and communications (ISPACS), Dec. 2006. IEEE, Piscataway, pp 323–326

    Google Scholar 

  21. Karneback S (2001) Discrimination between speech and music based on a low frequency modulation feature. In: European conf. on speech comm. and technology, Alborg, 3–7 September 2001, pp 1891–1894

  22. Kawahara H, Masuda-Katsuse I, de Cheveigne A (1999) Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27:187–207

    Article  Google Scholar 

  23. Logan B (2000) Mel frequency cepstral coefficients for music modeling. In: Proc. int. symp. music information retrieval (ISMIR), Plymouth, 23–25 October 2000

  24. Lu L, Zhang H, Jiang H (2002) Content analysis for audio classification and segmentation. IEEE Trans Speech Audio Process 10(7):504–516, October

    Article  Google Scholar 

  25. Malik H, Khokhar A, Ansari R, Cappe de Baillon B (2002) Predominant pitch contour extraction from audio signals. In: IEEE International Conference on Multimedia and Expo (ICME), August 2002, vol 2. IEEE, Piscataway, pp 257–260

    Chapter  Google Scholar 

  26. Matsunaga S, Mizuno O, Ohtsuki K, Hayashi Y (2004) Audio source segmentation using spectral correlation features for automatic indexing of broadcast news. In: Proc. EUSIPCO, Vienna, Sep 2004, pp 2104–2106

  27. Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video handling with music and speech detection. IEEE Multimed 5(3):17–25

    Article  Google Scholar 

  28. Molla KI, Hirose K, Minematsu N, Hasan K (2007) Voiced/unvoiced detection of speech signals using empirical mode decomposition model. In: Int. Conf. on Information and Communication Technology (ICICT), March 2007. IEEE, Piscataway, pp 311–314

    Chapter  Google Scholar 

  29. Muñoz-Exposito JE, Ruiz-Reyes N, Garcia-Galan S, Vera-Candeas P (2006) New speech/music discrimination approach based on warping transformation and ANFIS. J New Music Res 35:237–247, Dec

    Article  Google Scholar 

  30. Muñoz-Exposito JE, Ruiz-Reyes N, Garcia-Galan S, Vera-Candeas P (2007) Adaptive network-based fuzzy inference system vs. other classification algorithms for warped LPC-based speech/music discrimination. Eng Appl Artif Intell 20:783–793, Sep

    Article  Google Scholar 

  31. Panagiotakis C, Tziritas G (2005) A speech/music discriminator based on RMS and zero–crossings. IEEE Trans Multimedia 7:155–166, Feb

    Article  Google Scholar 

  32. Paradzinets A, Kotov O, Harb H, Chen L (2007) Continuous wavelet-Like transform based music similarity features for intelligent music navigation. In: International workshop on content-based multimedia indexing (CBMI), Bordeaux, June 2007, pp 165–172

  33. Politis D, Linardis P, Tsoukalas I (2000) An audio signatures indexing scheme for dynamic content multimedia databases. In: 10th Mediterranean electrotechnical conference (MELECON), vol 2. IEEE, Piscataway, pp 725–728

    Google Scholar 

  34. Qiao RY (1997) Mixed wideband speech and music coding using a speech/music discriminator. In: Proc. IEEE TENCON. IEEE, Piscataway, pp 605–608

    Google Scholar 

  35. Rentzos D, Vaseghi S, Qin Yan, Ching-Hsiang Ho (2004) Voice conversion through transformation of spectral and intonation features. In: IEEE international onference on acoustics, speech, and signal processing (ICASSP), May 2004, vol 1. IEEE, Piscataway, pp 21–24

    Google Scholar 

  36. Richard G, Ramona M, Essid S (2007) Combined supervised and unsupervised approaches for automatic segmentation of radiophonic audio streams. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), April 2007, vol 2. IEEE, Piscataway, pp 461–464

    Google Scholar 

  37. Saitou T, Goto M, Unoki M, Akagi M (2007) Speech-to-singing synthesis: converting speaking voices to singing voices by controlling acoustic features unique to singing voices. In: IEEE workshop on applications of signal processing to audio and acoustics, October 2007. IEEE, Piscataway, pp 215–218

    Chapter  Google Scholar 

  38. Saunders J (1996) Real-time discrimination of broacast speech/music. In: Proc. IEEE ICASSP’96, Atlanta, May 1996, pp 993–996

  39. Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: Proc. IEEE ICASSP’97, Munich, April 1997, pp 1331–1334

  40. Smith SF (1980) A learning system based on genetic adaptive algorithms. Ph.D. thesis, University of Pittsburgh

  41. Tancerel L, Ragot S, Ruoppila VT, Lefebvre R (2000) Combined speech and audio coding by discrimination. In: Proc. IEEE workshop on speech coding. IEEE, Piscataway, pp 17–20

    Google Scholar 

  42. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5)

  43. Venturini G (1992) SIA: a supervised inductive algorithm with genetic search for learning attribute based concepts. In: Proc. European conference on machine learning (ECML’92), Viena. Springer, Heidelberg, pp 280–296

    Google Scholar 

  44. Wang WQ, Gao W, Ying DW (2003) A fast and robust speech/music discrimination approach. In: Proc. 4th pacific rim conference on multimedia, vol 3. IEEE, Piscataway, pp 1325–1329

    Google Scholar 

  45. Wang J, Wu Q, Deng H, Yan Q (2008) Real-time speech/music classification with a hierarchical oblique decision tree. In: IEEE international conference on acoustics, speech and signal Processing (ICASSP), March 2008. IEEE, Piscataway, pp 2033–2036

    Chapter  Google Scholar 

  46. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353

    Article  MATH  MathSciNet  Google Scholar 

  47. Zhang T, Kuo J (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans Speech Audio Process 9(4)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. Ruiz-Reyes.

Additional information

This work was supported in part by the Spanish Ministry of Education and Science under Project TEC2006-13883-C04-03.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ruiz-Reyes, N., Vera-Candeas, P., Muñoz, J.E. et al. New speech/music discrimination approach based on fundamental frequency estimation. Multimed Tools Appl 41, 253–286 (2009). https://doi.org/10.1007/s11042-008-0228-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-008-0228-x

Keywords

Navigation