Abstract
Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a robust and effective approach for speech/music discrimination, which relies on a set of features derived from fundamental frequency (F0) estimation. Comparison between the proposed set of features and some commonly used timbral features is performed, aiming to assess the good discriminatory power of the proposed F0-based feature set. The classification scheme is composed of a classical Statistical Pattern Recognition classifier followed by a Fuzzy Rules Based System. Comparison with other well-proven classification schemes is also performed. Experimental results reveal that our speech/music discriminator is robust enough, making it suitable for a wide variety of multimedia applications.




















Similar content being viewed by others
References
Booker L (1982) Intelligent behaviour as an adaption to the task environment. Ph.D. Thesis, University of Michigan
Burred JJ, Lerch A (2004) Hierarchical automatic audio signal classification. J Audio Eng Soc 52:724–739
Carey MJ, Parris ES, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: Proc. IEEE ICASSP’99, Phoenix, USA. IEEE, Piscataway, pp 1432–1435
Cheveigne A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930, April
Cordon O, Herrera F, Hoffmann F, Magdalena L (2001) Genetic fuzzy systems. Evolutionary tuning and learning of fuzzy knowledge bases. Advances in fuzzy systems. Applications and theory, vol 19. World Scientific, Singapore
Davis S, Mermelstein P (1980) Experiments in syllable-based recognition of continuous speech. IEEE Trans Acoust Speech Signal Process 28:357–366, Aug
Duda R, Hart P, Stork D (2000) Pattern classification. Wiley, New York
El-Maleh K, Klein M, Petrucci G, Kabal, P (2000) Speech/music discrimination for multimedia applications. In: Proc. IEEE ICASSP’2000, vol 6. IEEE, Piscataway, pp 2445–2448
Every MR (2008) Discriminating between pitched sources in music audio. IEEE Trans Audio Speech Language Process 16(2):267–277, Feb
Exposito JEM, Galan SG, Reyes NR, Candeas PV (2007) Audio coding improvement using evolutionary speech/music discrimination. In: IEEE international fuzzy systems conference, (FUZZ-IEEE), July 2007. IEEE, Piscataway, pp 1–6
Ezzaidi H, Rouat J (2007) Comparison of the statistical and information theory measures: application to automatic musical genre classification. In: IEEE Workshop on Machine Learning for Signal Processing, August 2007. IEEE, Piscataway, pp 241–246
Fujihara H, Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG (2006) F0 estimation method for singing voice in polyphonic audio signal based on statistical vocal model and viterbi search acoustics. In: Proc. IEEE int. conf. on acoustic, speech and signal processing (ICASSP), May 2006, vol 5. IEEE, Piscataway, pp 14–19
Garau G, Renals S (2008) Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Trans Audio Speech Lang Process 16(3):508–518, March
Garcia Arnal Barbedo J, Lopes A (2007) Speech/music discriminator based on multiple fundamental Frequencies Estimation. IEEE Latin America Trans 5(5):294–300, Sept
Gong C, Xiong-wei Z (2006) The application of speech/music automatic discrimination based on gray correlation analysis. In: 5th IEEE international conference on cognitive informatics (ICCI), July 2006, vol 1. IEEE, Piscataway, pp 68–72
Harb H, Chen L (2003) Robust speech music discrimination using spectrum’s first order statistics and neural networks. Proc IEEE Int Symp Signal Process Appl 2:125–128
Hess W (1983) Pitch determination of speech signals. Springer, Berlin
Hess WJ (1992) Pitch and voicing determination. In: Furui S, Sohndi MM (eds) Advances in speech signal processing. Marcel Dekker, New York, pp 3–48
Hirose K, Iwano K (2000) Detection of prosodic word boundaries by statistical modeling of mora transitions of fundamental frequency contours and its use for continuous speech recognition. In: Proc. IEEE int. conf. on acoustics, speech, and signal processing (ICASSP), June 2000, vol 3. IEEE, Piscataway, pp 1763–1766
Ji-Soo Keum, Hyon-Soo Lee (2006) Speech/music discrimination using spectral peak feature for speaker indexing. In: International symposium on intelligent signal processing and communications (ISPACS), Dec. 2006. IEEE, Piscataway, pp 323–326
Karneback S (2001) Discrimination between speech and music based on a low frequency modulation feature. In: European conf. on speech comm. and technology, Alborg, 3–7 September 2001, pp 1891–1894
Kawahara H, Masuda-Katsuse I, de Cheveigne A (1999) Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27:187–207
Logan B (2000) Mel frequency cepstral coefficients for music modeling. In: Proc. int. symp. music information retrieval (ISMIR), Plymouth, 23–25 October 2000
Lu L, Zhang H, Jiang H (2002) Content analysis for audio classification and segmentation. IEEE Trans Speech Audio Process 10(7):504–516, October
Malik H, Khokhar A, Ansari R, Cappe de Baillon B (2002) Predominant pitch contour extraction from audio signals. In: IEEE International Conference on Multimedia and Expo (ICME), August 2002, vol 2. IEEE, Piscataway, pp 257–260
Matsunaga S, Mizuno O, Ohtsuki K, Hayashi Y (2004) Audio source segmentation using spectral correlation features for automatic indexing of broadcast news. In: Proc. EUSIPCO, Vienna, Sep 2004, pp 2104–2106
Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video handling with music and speech detection. IEEE Multimed 5(3):17–25
Molla KI, Hirose K, Minematsu N, Hasan K (2007) Voiced/unvoiced detection of speech signals using empirical mode decomposition model. In: Int. Conf. on Information and Communication Technology (ICICT), March 2007. IEEE, Piscataway, pp 311–314
Muñoz-Exposito JE, Ruiz-Reyes N, Garcia-Galan S, Vera-Candeas P (2006) New speech/music discrimination approach based on warping transformation and ANFIS. J New Music Res 35:237–247, Dec
Muñoz-Exposito JE, Ruiz-Reyes N, Garcia-Galan S, Vera-Candeas P (2007) Adaptive network-based fuzzy inference system vs. other classification algorithms for warped LPC-based speech/music discrimination. Eng Appl Artif Intell 20:783–793, Sep
Panagiotakis C, Tziritas G (2005) A speech/music discriminator based on RMS and zero–crossings. IEEE Trans Multimedia 7:155–166, Feb
Paradzinets A, Kotov O, Harb H, Chen L (2007) Continuous wavelet-Like transform based music similarity features for intelligent music navigation. In: International workshop on content-based multimedia indexing (CBMI), Bordeaux, June 2007, pp 165–172
Politis D, Linardis P, Tsoukalas I (2000) An audio signatures indexing scheme for dynamic content multimedia databases. In: 10th Mediterranean electrotechnical conference (MELECON), vol 2. IEEE, Piscataway, pp 725–728
Qiao RY (1997) Mixed wideband speech and music coding using a speech/music discriminator. In: Proc. IEEE TENCON. IEEE, Piscataway, pp 605–608
Rentzos D, Vaseghi S, Qin Yan, Ching-Hsiang Ho (2004) Voice conversion through transformation of spectral and intonation features. In: IEEE international onference on acoustics, speech, and signal processing (ICASSP), May 2004, vol 1. IEEE, Piscataway, pp 21–24
Richard G, Ramona M, Essid S (2007) Combined supervised and unsupervised approaches for automatic segmentation of radiophonic audio streams. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), April 2007, vol 2. IEEE, Piscataway, pp 461–464
Saitou T, Goto M, Unoki M, Akagi M (2007) Speech-to-singing synthesis: converting speaking voices to singing voices by controlling acoustic features unique to singing voices. In: IEEE workshop on applications of signal processing to audio and acoustics, October 2007. IEEE, Piscataway, pp 215–218
Saunders J (1996) Real-time discrimination of broacast speech/music. In: Proc. IEEE ICASSP’96, Atlanta, May 1996, pp 993–996
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: Proc. IEEE ICASSP’97, Munich, April 1997, pp 1331–1334
Smith SF (1980) A learning system based on genetic adaptive algorithms. Ph.D. thesis, University of Pittsburgh
Tancerel L, Ragot S, Ruoppila VT, Lefebvre R (2000) Combined speech and audio coding by discrimination. In: Proc. IEEE workshop on speech coding. IEEE, Piscataway, pp 17–20
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5)
Venturini G (1992) SIA: a supervised inductive algorithm with genetic search for learning attribute based concepts. In: Proc. European conference on machine learning (ECML’92), Viena. Springer, Heidelberg, pp 280–296
Wang WQ, Gao W, Ying DW (2003) A fast and robust speech/music discrimination approach. In: Proc. 4th pacific rim conference on multimedia, vol 3. IEEE, Piscataway, pp 1325–1329
Wang J, Wu Q, Deng H, Yan Q (2008) Real-time speech/music classification with a hierarchical oblique decision tree. In: IEEE international conference on acoustics, speech and signal Processing (ICASSP), March 2008. IEEE, Piscataway, pp 2033–2036
Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353
Zhang T, Kuo J (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans Speech Audio Process 9(4)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the Spanish Ministry of Education and Science under Project TEC2006-13883-C04-03.
Rights and permissions
About this article
Cite this article
Ruiz-Reyes, N., Vera-Candeas, P., Muñoz, J.E. et al. New speech/music discrimination approach based on fundamental frequency estimation. Multimed Tools Appl 41, 253–286 (2009). https://doi.org/10.1007/s11042-008-0228-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-008-0228-x