Skip to main content
Log in

Robust emotion recognition in noisy speech via sparse representation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Picard R (1997) Affective computing. MIT Press, Cambridge

    Google Scholar 

  2. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80

    Article  Google Scholar 

  3. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303

    Article  Google Scholar 

  4. Busso C, Sungbok L, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596

    Article  Google Scholar 

  5. Luengo I, Navas E, Hernaez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimedia 12(6):490–501

    Article  Google Scholar 

  6. Dromey C, Silveira J, Sandor P (2005) Recognition of affective prosody by speakers of English as a first or foreign language. Speech Commun 47(3):351–359

    Article  Google Scholar 

  7. Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2011) Whodunnit: searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28

    Article  Google Scholar 

  8. Jaywant A, Pell MD (2012) Categorical processing of negative emotions from speech prosody. Speech Commun 54(1):1–10

    Article  Google Scholar 

  9. El Ayadi M, Kamel M, Karray F (2010) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44(3):572–587

    Article  Google Scholar 

  10. Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2011) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl. Article (in press). doi:10.1007/s00521-00011-00643-00521

  11. Gobl C, Chasaide AN (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Commun 40(1–2):189–212

    Article  MATH  Google Scholar 

  12. Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In: Advances in neural networks—ISNN 2008, Lecture Notes in Computer Science 5264, vol 5264. Springer, pp 457–464

  13. Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH-2007, Antwerp, Belgium, pp 2253–2256

  14. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4):603–623

    Article  Google Scholar 

  15. Kienast M, Sendlmeier W (2000) Acoustical analysis of spectral and temporal changes in emotional speech. ITRW on speech and emotion, Newcastle, pp 92–97

    Google Scholar 

  16. Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52(7–8):613–625

    Article  Google Scholar 

  17. Sheikhan M, Gharavian D, Ashoftedel F (2012) Using DTW neural–based MFCC warping to improve emotional speech recognition. Neural Comput Appl 21(7):1765–1773

    Article  Google Scholar 

  18. Hu H, Xu MX, Wu W (2007) GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’07), Honolulu, HI, pp 413–416

  19. Tawari A, Trivedi MM (2010) Speech emotion analysis: exploring the role of context. IEEE Trans Multimedia 12(6):502–509

    Article  Google Scholar 

  20. Yildirim S, Narayanan S, Potamianos A (2011) Detecting emotional state of a child in a conversational computer game. Comput Speech Lang 25(1):29–44

    Article  Google Scholar 

  21. Schuller B, Batliner A, Steidl S, Seppi D (2009) Emotion recognition from speech: putting ASR in the loop. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Taipei, pp 4585–4588

  22. Kamaruddin N, Wahab A, Quek C (2012) Cultural dependency analysis for understanding speech emotion. Expert Syst Appl 39(5):5115–5133

    Article  Google Scholar 

  23. Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49(2):98–112

    Article  Google Scholar 

  24. Rong J, Li G, Chen Y-PP (2009) Acoustic feature selection for automatic emotion recognition from speech. Inf Process Manage 45(3):315–328

    Article  Google Scholar 

  25. Jolliffe IT (1986) Principal component analysis, 2nd edn. Springer, Berlin

    Book  Google Scholar 

  26. Fisher R (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188

    Article  Google Scholar 

  27. Lee CM, Narayanan SS, Pieraccini R (2001) Recognition of negative emotions from the speech signal. In: IEEE Workshop automatic speech recognition and understanding (ASRU), Trento, pp 240–243

  28. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  29. Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Article  Google Scholar 

  30. You M, Chen C, Bu J, Liu J, Tao J (2007) Manifolds based emotion recognition in speech. Comput Linguist Chin Lang Process 12(1):49–64

    Google Scholar 

  31. Zhang S, Zhao X (2011) Dimensionality reduction-based spoken emotion recognition. Multimedia tools and applications: Article (in press). doi:10.1007/s11042-11011-10887-x

  32. Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: 1999 Artificial neural networks in engineering (ANNIE ‘99), New York, pp 7–10

  33. Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: 4th International conference on spoken language processing (ICSLP’96), Philadelphia, pp 1970–1973

  34. Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Appl 9(4):290–296

    Article  MATH  Google Scholar 

  35. Petrushin V (2000) Emotion recognition in speech signal: experimental study, development, and application. In: 6th International conference on spoken language processing (ICSLP’00), Beijing, pp 222–225

  36. Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Montreal, Quebec, Canada, pp 577–580

  37. Kwon O, Chan K, Hao J, Lee T (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, Geneva, Switzerland, pp 125–128

  38. Altun H, Polat G (2009) Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Syst Appl 36(4):8197–8203

    Article  Google Scholar 

  39. Sheikhan M, Bejani M, Gharavian D (2012) Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method. Neural Comput Appl. Article (in press). doi:10.1007/s00521-00012-00814-00528

  40. Ververidis D, Kotropoulos C (2005) Emotional speech classification using Gaussian mixture models. In: IEEE international conference on multimedia and expo (ICME’05), Amsterdam, The Netherlands, pp 2871–2874

  41. Iliev A, Zhang Y, Scordilis M (2007) Spoken Emotion Classification Using ToBI Features and GMM. In: IEEE 6th EURASIP conference focused on speech and image processing, Maribor, Slovenia, pp 495–498

  42. Lee C, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S (2004) Emotion recognition based on phoneme classes. In: International conference on spoken language processing (ICSLP’04), Jeju, Korea, pp 889–892

  43. Lee CC, Mower E, Busso C, Lee S, Narayanan S (2011) Emotion recognition using a hierarchical binary decision tree approach. Speech Commun 53(9–10):1162–1171

    Article  Google Scholar 

  44. Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570

    Article  Google Scholar 

  45. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181

    Article  Google Scholar 

  46. Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212

    Article  Google Scholar 

  47. Schuller B, Arsic D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Speech Prosody, Dresden, Germany

  48. You M, Chen C, Bu J, Liu J, Tao J (2006) Emotion recognition from noisy speech. In: IEEE international conference on multimedia and expo (ICME’06), Toronto, Ont, pp 1653–1656

  49. Song M, You M, Li N, Chen C (2008) A robust multimodal approach for emotion recognition. Neurocomputing 71(10–12):1913–1920

    Article  Google Scholar 

  50. Yeh L, Chi T (2010) Spectro-temporal modulations for robust speech emotion recognition. In: INTERSPEECH-2010, Makuhari, Chiba, Japan, pp 789–792

  51. Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306

    Article  MathSciNet  MATH  Google Scholar 

  52. Baraniuk RG (2007) Compressive sensing [lecture notes]. IEEE Signal Process Mag 24(4):118–121

    Article  Google Scholar 

  53. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30

    Article  Google Scholar 

  54. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227

    Article  Google Scholar 

  55. Wright J, Ma Y, Mairal J, Sapiro G, Huang TS, Yan S (2010) Sparse representation for computer vision and pattern recognition. Proc IEEE 98(6):1031–1044

    Article  Google Scholar 

  56. Wagner A, Wright J, Ganesh A, Zhou Z, Mobahi H, Ma Y (2011) Towards a practical face recognition system: robust alignment and illumination by sparse representation. IEEE Trans Pattern Anal Mach Intell 99:1–15

    Google Scholar 

  57. Sainath TN, Ramabhadran B, Nahamoo D, Kanevsky D, Sethy A (2010) Sparse representation features for speech recognition. In: INTERSPEECH-2010, Makuhari, Chiba, Japan, pp 2254–2257

  58. Gemmeke J, Virtanen T, Hurmalainen A (2011) Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans Audio Speech Lang Process 19(7):2067–2080

    Article  Google Scholar 

  59. Candes E, Romberg J (2005) l1-magic: recovery of sparse signals via convex programming. Available at http://users.ece.gatech.edu/justin/l1magic/downloads/l1magic.pdf

  60. Kim SJ, Koh K, Lustig M, Boyd S, Gorinevsky D (2007) An interior-point method for large-scale l1-regularized least squares. IEEE J Select Top Signal Process 1(4):606–617

    Article  Google Scholar 

  61. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodological):267–288

  62. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech-2005, Lisbon, Portugal, pp 1–4

  63. CICHOSZ J, SLOT K (2005) Application of selected speech-signal characteristics to emotion recognition in polish language. In: International conference on signals and electronic systems, Poznan, Poland, pp 409–412

  64. Batliner A, Buckow A, Niemann H, Noth E, Warnke V (2000) The prosody module. VERBMOBIL: foundations of speech-to-speech translations: 106–121

  65. Ang J, Dhillon R, Krupski A, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: 7th international conference on spoken language processing (ICSLP’02), Denver, Colorado, pp 2037–2040

  66. Murray I, Arnott J (1993) Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J Acoust Soc Am 93:1097–1108

    Article  Google Scholar 

  67. Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc Inst Phon Sci 17:97–110

    Google Scholar 

  68. McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M, Stroeve S (2000) Approaching automatic recognition of emotion from voice: a rough benchmark. In: the ISCA Workshop on Speech and Emotion, Belfast, Northern Ireland, pp 207–212

  69. Polzin T, Waibel A (2000) Emotion-sensitive human-computer interfaces. In: the ISCA Workshop on Speech and Emotion, Belfast, Northern Ireland, pp 201–206

  70. Trask R (1996) A dictionary of phonetics and phonology. Burns & Oates, Routledge

    Google Scholar 

  71. Klasmeyer G, Sendlmeier W (2000) Voice and emotional states. Voice Qual Meas: 339–358

  72. Klasmeyer G (1997) The perceptual importance of selected voice quality parameters. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, pp 1615–1618

  73. Klasmeyer G, Sendlmeier W (1995) Objective voice parameters to characterize the emotional content in speech. In: 13th international congress phonetic sciences (ICPhS’95), Stockholm, Sweden, pp 182–185

  74. Rabiner L, Schafer R (1978) Digital processing of speech signals. Prentice-hall, Englewood Cliffs

    Google Scholar 

  75. Tolkmitt F, Scherer K (1986) Effect of experimentally induced stress on vocal parameters. J Exp Psychol Hum Percept Perform 12(3):302–313

    Article  Google Scholar 

  76. Williams C, Stevens K (1972) Emotions and speech: some acoustical correlates. J Acoust Soc Am 52(4B):1238–1250

    Article  Google Scholar 

  77. Pittam J, Scherer K (1993) Vocal expression and communication of emotion. In: Lewis M, Haviland JM (eds) Handbook of emotions. Guilford Press, New York, pp 185–197

    Google Scholar 

  78. Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614–636

    Article  Google Scholar 

  79. Alter K, Rank E, Kotz S, Toepel U, Besson M, Schirmer A, Friederici A (2000) Accentuation and emotions-two different systems? In: ITRW on Speech and Emotion, Newcastle, Northern Ireland, pp 138–142

  80. Michaelis D, Fr hlich M, Strube H (1998) Selection and combination of acoustic features for the description of pathologic voices. J Acoust Soc Am 103(3):1628–1639

    Article  Google Scholar 

  81. Kasuya H, Endo Y, Saliu S (1993) Novel acoustic measurements of jitter and shimmer characteristics from pathological voice. In: EUROSPEECH ‘93, Berlin, Germany, pp 1973–1976

  82. Chang C, Lin C (2001) LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm

  83. Fersini E, Messina E, Archetti F (2012) Emotional states in judicial courtrooms: an experimental investigation. Speech Commun 54:11–22

    Article  Google Scholar 

  84. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The twentieth international conference on machine learning (ICML-2003), Washington DC, pp 856–863

  85. Scherer S, Schwenker F, Palm G (2009) Classifier fusion for emotion recognition from speech. Adv Intell Environ: 95–117

  86. Cichosz J, Slot K (2005) Low-dimensional feature space derivation for emotion recognition. In: INTERSPEECH-2005, Lisbon, Portugal, pp. 477–480

  87. Cortes C, Vapnik V (1995) Support-vector networks. Mach learn 20(3):273–297

    MATH  Google Scholar 

  88. Gemmeke JF, Van Hamme H, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J Select Top Sig Process 4(2):272–287

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank all the anonymous reviewers and editors for their helpful comments and suggestions about the improvement of this paper. This work is supported by National Natural Science Foundation of China under Grant No. 61203257 and No. 61272261, and Zhejiang Provincial Natural Science Foundation of China under Grant No. Z1101048 and No. Y1111058.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiqing Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, X., Zhang, S. & Lei, B. Robust emotion recognition in noisy speech via sparse representation. Neural Comput & Applic 24, 1539–1553 (2014). https://doi.org/10.1007/s00521-013-1377-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-013-1377-z

Keywords