Abstract
A major challenge for the identification of singers from monaural popular music recording is to remove or alleviate the influence of accompaniments. Our system is realized in two stages. In the first stage, we exploit computational auditory scene analysis (CASA) to segregate the singing voice units from a mixture signal. First, the pitch of singing voice is estimated to extract the pitch-based features of each unit in an acoustic vector. These features are then exploited to estimate the binary time-frequency (T-F) masks, where 1 indicates that the corresponding T-F unit is dominated by the singing voice, and 0 indicates otherwise. These regions dominated by the singing voice are considered reliable, and other units are unreliable or missing. Thus the acoustic vector is incomplete. In the second stage, two missing feature methods, the reconstruction of acoustic vector and the marginalization, are used to identify the singer by dealing with the incomplete acoustic vectors. For the reconstruction of acoustic vector, the complete acoustic vector is first reconstructed and then converted to obtain the Gammatone frequency cepstral coefficients (GFCCs), which are further used to identify the singer. For the marginalization, the probabilities that the voice belonging to a certain singer are computed on the basis of only the reliable components. We find that the reconstruction method outperforms the marginalization method, while both methods have significantly good performances, especially at signal-to-accompaniment ratios (SARs) of 0 dB and − 3 dB, in contrast to another system.
Similar content being viewed by others
References
Bartsch, M.A. (2004). Automatic singer identification in polyphonic music. PhD dissertation, The University of Michigan
Bartsch, M.A., & Wakefield, G.H. (2004). Singing voice identification using spectral envelope estimation. IEEE Transactions on Speech and Audio Processing, 12, 100–109.
Boersma, P., & Weenink, D. (2005). Praat. Doing phonetics by computer [computer program]. Retrieved 31 Mar 2005.
Cai, W., Li, Q., Guan, X. (2011). Automatic singer identification based on auditory features. In 7th int. conf. natural comput. (ICNC) (Vol. 3, pp. 1624–1628).
Cano, P., Loscos, A., Bonada, J., De Boer, M., Serra, X. (2000). Voice morphing system for impersonating in karaoke applications. In Proc. ICMC (pp. 109–112).
Chang, P. (2009). Pitch oriented automatic singer identification in pop music. In Int. conf. semantic comput. (ICSC) (pp. 161–166).
Cooke, M., Green, P., Josifovski, L., Vizinho, A. (2001). Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication, 34, 267–285.
Fujihara, H., Goto, M., Kitahara, T., Okuno, H.G. (2010). A modeling of singing voice robust to accompaniment sounds and its application to singer identification and vocal-timbre-similarity-based music information retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 638–648.
Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G. (2005). Singer identification based on accompaniment sound reduction and reliable frame selection. In Proc. int. soc. music inf. retrieval conf. (ISMIR) (pp. 329–336).
Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G. (2006). F0 estimation method for singing voice in polyphonic audio signal based on statistical vocal model and Viterbi search. In Proc. IEEE int. conf. acoust., speech signal process. (ICASSP).
Hu, Y., & Liu, G. (2011). Dynamic characteristics of musical note for musical instrument classification. In IEEE int. conf. signal process., commun. and comput. (ICSPCC) (pp. 1–6).
Hu, Y., & Liu, G. (2013). Instrument identification and pitch estimation in multi-timbre polyphonic musical signals based on probabilistic mixture model decomposition. Journal of Intelligent Inf. Systems, 40(1), 1–18.
Jin, Z., & Wang, D.L. (2009). A supervised learning approach to monaural segregation of reverberant speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 625–638.
Khine, S.Z.K., Nwe, T.L., Li, H. (2008). Exploring perceptual based timbre feature for singer identification. In Computer music modeling and retrieval (CMMR. 2007). Lecture notes in computer science (Vol. 4969, pp. 159–171).
Kim, Y.E., & Whitman, B. (2002). Singer identification in popular music recordings using voice coding features. In Proc. int. soc. music inf. retrieval conf. (ISMIR).
Lagrange, M., Ozerov, A., Vincent, E. (2012). Robust singer identification in polyphonic music using melody enhancement and uncertainty-based learning. In Proc. int. soc. music inf. retrieval conf. (ISMIR).
Li, Y., & Wang, D.L. (2005). Detecting pitch of singing voice in polyphonic audio. In Proc. IEEE int. conf. acoust., speech signal process. (ICASSP) (Vol. 3, pp. iii/17–iii/20).
Li, Y., & Wang, D.L. (2007). Separation of singing voice from music accompaniment for monaural recordings. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1475–1487.
Li, Y., & Wang, D.L. (2009). On the optimality of ideal binary time-frequency masks. Speech Communication, 51, 230–239.
Maddage, N.C., Xu, C., Wang, Y. (2004). Singer identification based on vocal and instrumental models. In Proc. int. conf. pattern recognition (ICPR) (pp. 375–378).
Nwe, T.L., & Li, H. (2008). On fusion of timbre-motivated features for singing voice detection and singer identification. In Proc. IEEE int. conf. acoust., speech signal process. (ICASSP) (pp. 2225–2228).
Raj, B., Seltzer, M.L., Stern, R.M. (2004). Reconstruction of missing features for robust speech recognition. Speech communication, 43, 275–296.
Reynolds, D.A., Quatieri, T.F., Dunn, R.B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.
Shen, J., Cui, B., Shepherd, J., Tan, K.L. (2006). Towards efficient automated singer identification in large music databases. In Proc. int. ACM SIGIR conf. res. develop. inf. retrieval (Vol. 27, No. 3, pp. 59–66).
Shen, J., Shepherd, J., Cui, B., Tan, K.L. (2009). A novel framework for efficient automated singer identification in large music databases. ACM Transactions on Information Systems (TOIS), 27, 18.
Sofianos, S., et al. (2012). H-semantics: a hybrid approach to singing voice separation. Journal of the Audio Engineering Society, 60(10), 831–841.
Tsai, W.H., & Lin, H.P. (2010). Popular singer identification based on cepstrum transformation. In Proc. IEEE int. conf. multimedia expo (ICME) (pp. 584–589).
Tsai, W.H., & Lin, H.P. (2011). Background music removal based on cepstrum transformation for popular singer identification. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1196–1205.
Tsai, W.H., & Lee, H.C. (2012). Singer identification based on spoken data in voice charaterization. IEEE Transactions on Audio, Speech, and Language Processing, 20(8), 2291–2300.
Wang, D.L. (2005). On ideal binary mask as the computational goal of auditory scene analysis. In P. Divenyi (Ed.), Speech separation by humans and machines (pp. 181–197). Norwell: Kluwer Academic.
Wang, D.L., & Brown, G.J. (2006). Computational auditory scene analysis: Principles, algorithms and applications. Hoboken: Wiley-IEEE Press.
Zhao, X., Shao, Y., Wang, D. (2012). CASA-based robust speaker identification. IEEE Transactions on Audio, Speech, and Language Processing, 20(5), 1608–1616.
Zwan, P., & Kostek, B. (2008). System for automatic singing voice recognition. Journal of the Audio Engineering Society, Vibrato and Intonation Parameters, 56(9), 710–723.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hu, Y., Liu, G. Singer identification based on computational auditory scene analysis and missing feature methods. J Intell Inf Syst 42, 333–352 (2014). https://doi.org/10.1007/s10844-013-0271-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-013-0271-6