Abstract
Emotional speech recognition (ESR) is a new field of research in the realm of human-computer interactions. Most of the studies in this field are performed in clean environments. Nevertheless, in the real world conditions, there are different noise and disturbance parameters such as car noise, background music, buzz and etc., which can decrease the performance of such recognizing systems. One of the most common noises which can be heard in different places is the babble noise. Because of the similarity of this kind of noise to the desired speech sounds, babble or cross-talk, is highly challenging for different speech-related systems. In this paper, in order to find the most appropriate features for ESR in the presence of babble noise with different signal to noise ratios, 286 features are extracted from speech utterances of two emotional speech datasets in German and Persian. Then the best features are selected among them using different filter and wrapper methods. Finally, different classifiers like Bayes, KNN, GMM, ANN and SVM are used for the selected features in two ways, namely multi-class and binary classifications.
Similar content being viewed by others
References
Banziger, T., Tran, V., & Scherer, K. R. (2005). The Geneva emotion wheel: a tool for the verbal report of emotional reactions. In ISRE’05 proceedings. Bari: ISRE.
Beritelli, F., Casale, S., & Cavallaro, A. (1998). A robust voice activity detector for wireless communications using soft computing. IEEE Journal on Selected Areas in Communications, 16(9), 1818–1829.
Bremner, D., Demaine, E., Erickson, J., Iacono, J., Langerman, S., Morin, P., & Toussaint, G. (2005). Output-sensitive algorithms for computing nearest-neighbor decision boundaries. Discrete and Computational Geometry, 33(4), 593–604.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Interspeech’05 proceedings. Lisbon: Interspeech.
Everitt, B. S., & Hand, D. J. (1981). Finite mixture distributions. New York: Chapman and Hall.
Grimm, M., Kroschel, K., & Harris, H. (2007). On the necessity and feasibility of detecting a driver’s emotional state while driving. In A. Paiva, R. Prada, & R. W. Picard (Eds.), Affective computing and intelligent interaction (pp. 126–138). Berlin: Springer.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. Journal of Acoustic Society, America, 1738–1752.
Hermansky, H., Morgan, N., Bayya, A., & Kohn, P. (1991). RASTA_PLP speech analysis (TR-91-069).
Hess, W. J. (1992). Pitch and voicing determination. In S. Furui & M. M. Sondhi (Eds.) Advances in speech signal processing. New York: Marcel Dekker.
Kim, E. H., Hyun, K. H., & Kwak, Y. K. (2005). Robust emotion recognition feature, frequency range of meaningful signal. In ROMAN’05 proceedings, Nashville, TN, USA.
Krishnamurthy, N., & Hansen, J. H. L. (2009). Babble noise: modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1394–1407.
Lane, H., & Tranel, B. (1971). The lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research, 14, 677–709.
Lee, K. K., Cho, Y. H., & Park, K. S. (2006). Robust feature extraction for mobile-based speech emotion recognition system. In Lecture notes in control and information sciences. Intelligent computing in signal processing and pattern recognition (pp. 470–477). Berlin: Springer.
Loizou, P. (2003). Colea: a MATLAB software-tool for speech analysis. University of Arkansas.
Markel, J. D., & Gray, A. H. (1976). Linear prediction of speech. Berlin: Springer.
McGilloway, S., Cowie, R., & Douglas-Cowi, E. (2000). Approaching automatic recognition of emotion from voice: a rough benchmark. In ISCAWSE’00 proceedings. Newcastle: ISCAWSE.
Ning, T., & Whiting, S. (1990). Power spectrum estimation is a orthogona1 transformation. In ASSP’90 proceedings (pp. 2523–2526).
Pearce, D., & Hirsch, H. G. (2000). The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy. In ICSLP’00 proceedings. Beijing: ICSLP.
Rabiner, L. R., & Sambur, M. R. (1977). Voiced-unvoiced-silence detection using itakura LPC distance measure. In ASSP’77 proceedings (pp. 323–326).
Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs: Prentice-Hall.
Shah, F., Krishnan, V., Sukumar, R., Jayakumar, A., & Anto, B. (2009). Speaker independent automatic emotion recognition from speech, a comparison of MFCCs and discrete wavelet transforms. In ARTCC’09 proceedings (pp. 528–531).
Sedaaghi, M. H. (2008a). Gender classification in emotional speech. In F. Mihelic, & J. Zibert (Eds.) Speech recognition: technologies and applications. Vienna: I-Tech (Chap. 20).
Sedaaghi, M. H. (2008b). Documentation of the sahand emotional speech database (SES) (Technical Report). Department of Electrical Eng., Sahand Univ. of Tech, Iran.
Sedaaghi, M. H., Kotropoulos, C., & Ververidis, D. (2007). Using adaptive genetic algorithms to improve speech emotion recognition. In MMSP’07 proceedings (pp. 461–464). Greece: MMSP.
Schuller, B., Steidl, S., & Batliner, A. (2009). The INTERSPEECH 2009 emotion challenge. In ISCA’09 proceedings (pp. 312–315). Brighton: ISCA.
Snell, R. C., & Milinazzo, F. (1993). Formant location from LPC analysis data. IEEE Transactions on Speech and Audio Processing, 1(2), 129–134.
Tawari, A., & Trivedi, M. (2010). Speech emotion analysis in noisy real-world environment. In ICPR’10 proceedings (pp. 4605–4608). Istanbul: ICPR.
Ververidis, D., & Kotropoulos, C. (2006a). Emotional speech recognition: resources, features, and methods. Speech Communication, 48, 1162–1181.
Ververidis, D., & Kotropoulos, C. (2006b). Fast sequential floating forward selection applied to emotional speech features estimated on DES and SUSAS data collections. In EUSIPCO’06 proceedings. Italy: EUSIPCO.
Wechsler, J. D. (1994). Detection of human speech in structured noise. In ASSP’94 proceedings (pp. 237–240).
Yoon, W. J., Cho, Y. H., & Park, K. S. (2007). A study of speech emotion recognition and its application to mobile services. In Lecture notes in computer science. Ubiquitous intelligence and computing (pp. 758–766). Berlin: Springer.
You, M., Chen, C., Bu, J., Liu, J., & Tao, J. (2007). Manifolds based emotion recognition in speech. Computational Linguistics and Chinese Language Processing, 12(1), 49–64.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Karimi, S., Sedaaghi, M.H. Robust emotional speech classification in the presence of babble noise. Int J Speech Technol 16, 215–227 (2013). https://doi.org/10.1007/s10772-012-9176-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9176-y