Abstract
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Picard R (1997) Affective computing. MIT Press, Cambridge
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Busso C, Sungbok L, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596
Luengo I, Navas E, Hernaez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimedia 12(6):490–501
Dromey C, Silveira J, Sandor P (2005) Recognition of affective prosody by speakers of English as a first or foreign language. Speech Commun 47(3):351–359
Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2011) Whodunnit: searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28
Jaywant A, Pell MD (2012) Categorical processing of negative emotions from speech prosody. Speech Commun 54(1):1–10
El Ayadi M, Kamel M, Karray F (2010) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44(3):572–587
Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2011) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl. Article (in press). doi:10.1007/s00521-00011-00643-00521
Gobl C, Chasaide AN (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Commun 40(1–2):189–212
Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In: Advances in neural networks—ISNN 2008, Lecture Notes in Computer Science 5264, vol 5264. Springer, pp 457–464
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH-2007, Antwerp, Belgium, pp 2253–2256
Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4):603–623
Kienast M, Sendlmeier W (2000) Acoustical analysis of spectral and temporal changes in emotional speech. ITRW on speech and emotion, Newcastle, pp 92–97
Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52(7–8):613–625
Sheikhan M, Gharavian D, Ashoftedel F (2012) Using DTW neural–based MFCC warping to improve emotional speech recognition. Neural Comput Appl 21(7):1765–1773
Hu H, Xu MX, Wu W (2007) GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’07), Honolulu, HI, pp 413–416
Tawari A, Trivedi MM (2010) Speech emotion analysis: exploring the role of context. IEEE Trans Multimedia 12(6):502–509
Yildirim S, Narayanan S, Potamianos A (2011) Detecting emotional state of a child in a conversational computer game. Comput Speech Lang 25(1):29–44
Schuller B, Batliner A, Steidl S, Seppi D (2009) Emotion recognition from speech: putting ASR in the loop. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Taipei, pp 4585–4588
Kamaruddin N, Wahab A, Quek C (2012) Cultural dependency analysis for understanding speech emotion. Expert Syst Appl 39(5):5115–5133
Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49(2):98–112
Rong J, Li G, Chen Y-PP (2009) Acoustic feature selection for automatic emotion recognition from speech. Inf Process Manage 45(3):315–328
Jolliffe IT (1986) Principal component analysis, 2nd edn. Springer, Berlin
Fisher R (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188
Lee CM, Narayanan SS, Pieraccini R (2001) Recognition of negative emotions from the speech signal. In: IEEE Workshop automatic speech recognition and understanding (ASRU), Trento, pp 240–243
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
You M, Chen C, Bu J, Liu J, Tao J (2007) Manifolds based emotion recognition in speech. Comput Linguist Chin Lang Process 12(1):49–64
Zhang S, Zhao X (2011) Dimensionality reduction-based spoken emotion recognition. Multimedia tools and applications: Article (in press). doi:10.1007/s11042-11011-10887-x
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: 1999 Artificial neural networks in engineering (ANNIE ‘99), New York, pp 7–10
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: 4th International conference on spoken language processing (ICSLP’96), Philadelphia, pp 1970–1973
Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Appl 9(4):290–296
Petrushin V (2000) Emotion recognition in speech signal: experimental study, development, and application. In: 6th International conference on spoken language processing (ICSLP’00), Beijing, pp 222–225
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Montreal, Quebec, Canada, pp 577–580
Kwon O, Chan K, Hao J, Lee T (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, Geneva, Switzerland, pp 125–128
Altun H, Polat G (2009) Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Syst Appl 36(4):8197–8203
Sheikhan M, Bejani M, Gharavian D (2012) Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method. Neural Comput Appl. Article (in press). doi:10.1007/s00521-00012-00814-00528
Ververidis D, Kotropoulos C (2005) Emotional speech classification using Gaussian mixture models. In: IEEE international conference on multimedia and expo (ICME’05), Amsterdam, The Netherlands, pp 2871–2874
Iliev A, Zhang Y, Scordilis M (2007) Spoken Emotion Classification Using ToBI Features and GMM. In: IEEE 6th EURASIP conference focused on speech and image processing, Maribor, Slovenia, pp 495–498
Lee C, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S (2004) Emotion recognition based on phoneme classes. In: International conference on spoken language processing (ICSLP’04), Jeju, Korea, pp 889–892
Lee CC, Mower E, Busso C, Lee S, Narayanan S (2011) Emotion recognition using a hierarchical binary decision tree approach. Speech Commun 53(9–10):1162–1171
Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181
Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212
Schuller B, Arsic D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Speech Prosody, Dresden, Germany
You M, Chen C, Bu J, Liu J, Tao J (2006) Emotion recognition from noisy speech. In: IEEE international conference on multimedia and expo (ICME’06), Toronto, Ont, pp 1653–1656
Song M, You M, Li N, Chen C (2008) A robust multimodal approach for emotion recognition. Neurocomputing 71(10–12):1913–1920
Yeh L, Chi T (2010) Spectro-temporal modulations for robust speech emotion recognition. In: INTERSPEECH-2010, Makuhari, Chiba, Japan, pp 789–792
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Baraniuk RG (2007) Compressive sensing [lecture notes]. IEEE Signal Process Mag 24(4):118–121
Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30
Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Wright J, Ma Y, Mairal J, Sapiro G, Huang TS, Yan S (2010) Sparse representation for computer vision and pattern recognition. Proc IEEE 98(6):1031–1044
Wagner A, Wright J, Ganesh A, Zhou Z, Mobahi H, Ma Y (2011) Towards a practical face recognition system: robust alignment and illumination by sparse representation. IEEE Trans Pattern Anal Mach Intell 99:1–15
Sainath TN, Ramabhadran B, Nahamoo D, Kanevsky D, Sethy A (2010) Sparse representation features for speech recognition. In: INTERSPEECH-2010, Makuhari, Chiba, Japan, pp 2254–2257
Gemmeke J, Virtanen T, Hurmalainen A (2011) Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans Audio Speech Lang Process 19(7):2067–2080
Candes E, Romberg J (2005) l1-magic: recovery of sparse signals via convex programming. Available at http://users.ece.gatech.edu/justin/l1magic/downloads/l1magic.pdf
Kim SJ, Koh K, Lustig M, Boyd S, Gorinevsky D (2007) An interior-point method for large-scale l1-regularized least squares. IEEE J Select Top Signal Process 1(4):606–617
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodological):267–288
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech-2005, Lisbon, Portugal, pp 1–4
CICHOSZ J, SLOT K (2005) Application of selected speech-signal characteristics to emotion recognition in polish language. In: International conference on signals and electronic systems, Poznan, Poland, pp 409–412
Batliner A, Buckow A, Niemann H, Noth E, Warnke V (2000) The prosody module. VERBMOBIL: foundations of speech-to-speech translations: 106–121
Ang J, Dhillon R, Krupski A, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: 7th international conference on spoken language processing (ICSLP’02), Denver, Colorado, pp 2037–2040
Murray I, Arnott J (1993) Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J Acoust Soc Am 93:1097–1108
Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc Inst Phon Sci 17:97–110
McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M, Stroeve S (2000) Approaching automatic recognition of emotion from voice: a rough benchmark. In: the ISCA Workshop on Speech and Emotion, Belfast, Northern Ireland, pp 207–212
Polzin T, Waibel A (2000) Emotion-sensitive human-computer interfaces. In: the ISCA Workshop on Speech and Emotion, Belfast, Northern Ireland, pp 201–206
Trask R (1996) A dictionary of phonetics and phonology. Burns & Oates, Routledge
Klasmeyer G, Sendlmeier W (2000) Voice and emotional states. Voice Qual Meas: 339–358
Klasmeyer G (1997) The perceptual importance of selected voice quality parameters. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, pp 1615–1618
Klasmeyer G, Sendlmeier W (1995) Objective voice parameters to characterize the emotional content in speech. In: 13th international congress phonetic sciences (ICPhS’95), Stockholm, Sweden, pp 182–185
Rabiner L, Schafer R (1978) Digital processing of speech signals. Prentice-hall, Englewood Cliffs
Tolkmitt F, Scherer K (1986) Effect of experimentally induced stress on vocal parameters. J Exp Psychol Hum Percept Perform 12(3):302–313
Williams C, Stevens K (1972) Emotions and speech: some acoustical correlates. J Acoust Soc Am 52(4B):1238–1250
Pittam J, Scherer K (1993) Vocal expression and communication of emotion. In: Lewis M, Haviland JM (eds) Handbook of emotions. Guilford Press, New York, pp 185–197
Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614–636
Alter K, Rank E, Kotz S, Toepel U, Besson M, Schirmer A, Friederici A (2000) Accentuation and emotions-two different systems? In: ITRW on Speech and Emotion, Newcastle, Northern Ireland, pp 138–142
Michaelis D, Fr hlich M, Strube H (1998) Selection and combination of acoustic features for the description of pathologic voices. J Acoust Soc Am 103(3):1628–1639
Kasuya H, Endo Y, Saliu S (1993) Novel acoustic measurements of jitter and shimmer characteristics from pathological voice. In: EUROSPEECH ‘93, Berlin, Germany, pp 1973–1976
Chang C, Lin C (2001) LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Fersini E, Messina E, Archetti F (2012) Emotional states in judicial courtrooms: an experimental investigation. Speech Commun 54:11–22
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The twentieth international conference on machine learning (ICML-2003), Washington DC, pp 856–863
Scherer S, Schwenker F, Palm G (2009) Classifier fusion for emotion recognition from speech. Adv Intell Environ: 95–117
Cichosz J, Slot K (2005) Low-dimensional feature space derivation for emotion recognition. In: INTERSPEECH-2005, Lisbon, Portugal, pp. 477–480
Cortes C, Vapnik V (1995) Support-vector networks. Mach learn 20(3):273–297
Gemmeke JF, Van Hamme H, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J Select Top Sig Process 4(2):272–287
Acknowledgments
The authors would like to thank all the anonymous reviewers and editors for their helpful comments and suggestions about the improvement of this paper. This work is supported by National Natural Science Foundation of China under Grant No. 61203257 and No. 61272261, and Zhejiang Provincial Natural Science Foundation of China under Grant No. Z1101048 and No. Y1111058.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, X., Zhang, S. & Lei, B. Robust emotion recognition in noisy speech via sparse representation. Neural Comput & Applic 24, 1539–1553 (2014). https://doi.org/10.1007/s00521-013-1377-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-013-1377-z