Skip to main content
Log in

Spoken emotion recognition via locality-constrained kernel sparse representation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Spoken emotion recognition is currently a very active research topic and has attracted extensive attention in signal processing, pattern recognition, artificial intelligence, etc. In this paper, a new emotion classification method based on kernel sparse representation, named locality-constrained kernel sparse representation-based classification (LC-KSRC), is proposed for spoken emotion recognition. LC-KSRC is able to learn more discriminating sparse representation coefficients for spoken emotion recognition, since it integrates both sparsity and data locality in the kernel feature space. The proposed method is compared with six representative emotion classification methods, including linear discriminant classifier, K-nearest-neighbor, radial basis function neural networks, support vector machines, sparse representation-based classification and kernel sparse representation-based classification. Experimental results on two publicly available emotional speech databases, i.e., the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on spoken emotion recognition tasks, outperforming the other used methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80

    Article  Google Scholar 

  2. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech and Audio Process 13(2):293–303

    Article  Google Scholar 

  3. Busso C, Sungbok L, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596

    Article  Google Scholar 

  4. Luengo I, Navas E, Hernaez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimedia 12(6):490–501

    Article  Google Scholar 

  5. Dromey C, Silveira J, Sandor P (2005) Recognition of affective prosody by speakers of English as a first or foreign language. Speech Commun 47(3):351–359

    Article  Google Scholar 

  6. Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2011) Whodunnit—searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28

    Article  Google Scholar 

  7. Jaywant A, Pell MD (2012) Categorical processing of negative emotions from speech prosody. Speech Commun 54(1):1–10

    Article  Google Scholar 

  8. Chen L, Mao X, Wei P, Xue Y, Ishizuka M (2012) Mandarin emotion recognition combining acoustic and emotional point information. Appl Intell 37(4):602–612

    Article  Google Scholar 

  9. van der Wal CN, Kowalczyk W (2013) Detecting changing emotions in human speech by machine and humans. Appl Intell 39(4):675–691

    Article  Google Scholar 

  10. Gobl C, Chasaide NA (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Commun 40(1–2):189–212

    Article  MATH  Google Scholar 

  11. Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In: Advances in neural networks—ISNN 2008, Lecture Notes in Computer Science 5264, vol 5264. Springer, Berlin, pp 457–464

  12. Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH-2007, Antwerp, Belgium, pp 2253–2256

  13. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4):603–623

    Article  Google Scholar 

  14. Kienast M, Sendlmeier W (2000) Acoustical analysis of spectral and temporal changes in emotional speech. ITRW on Speech and Emotion. Newcastle, Northern Ireland, pp 92–97

    Google Scholar 

  15. Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52(7–8):613–625

    Article  Google Scholar 

  16. Sheikhan M, Gharavian D, Ashoftedel F (2012) Using DTW neural–based MFCC warping to improve emotional speech recognition. Neural Comput Appl 21(7):1765–1773

    Article  Google Scholar 

  17. Hu H, Xu MX, Wu W (2007) GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’07), Honolulu, HI, pp 413–416

  18. Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: 1999 Artificial neural networks in engineering (ANNIE ‘99), New York, pp 7–10

  19. Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: 4th International conference on spoken language processing (ICSLP’96), Philadelphia, PA, pp 1970–1973

  20. Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Appl 9(4):290–296

    Article  MATH  Google Scholar 

  21. Petrushin V (2000) Emotion recognition in speech signal: experimental study, development, and application. In: 6th International conference on spoken language processing (ICSLP’00), Beijing, pp 222–225

  22. Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Montreal, pp 577–580

  23. Kwon O, Chan K, Hao J, Lee T (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, Geneva, pp 125–128

  24. Ververidis D, Kotropoulos C (2005) Emotional speech classification using Gaussian mixture models. In: IEEE international conference on multimedia and expo (ICME’05), Amsterdam, pp 2871–2874

  25. Iliev A, Zhang Y, Scordilis M (2007) Spoken emotion classification using ToBI features and GMM. In: IEEE 6th EURASIP conference focused on speech and image processing, Maribor, Slovenia, pp 495–498

  26. Lee C, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S (2004) Emotion recognition based on phoneme classes. In: International conference on spoken language processing (ICSLP’04), Jeju, Korea, pp 889–892

  27. Donoho DL (2006) For most large underdetermined systems of linear equations the minimal l 1-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829

    Article  MATH  MathSciNet  Google Scholar 

  28. Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306

    Article  MATH  MathSciNet  Google Scholar 

  29. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30

    Article  Google Scholar 

  30. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227

    Article  Google Scholar 

  31. Zhao X, Zhang S, Lei B (2014) Robust emotion recognition in noisy speech via sparse representation. Neural Comput Appl 24(7–8):1539–1553

    Article  Google Scholar 

  32. Zhang L, Zhou W-D, Chang P-C, Liu J, Yan Z, Wang T, Li F-Z (2012) Kernel sparse representation-based classifier. IEEE Trans Signal Process 60(4):1684–1695

    Article  MathSciNet  Google Scholar 

  33. Zhou Y, Gao J, Barner KE (2012) An enhanced sparse representation strategy for signal classification. In: SPIE 8365, compressive sensing, Baltimore, MD, p 83650H

  34. Yin J, Liu Z, Jin Z, Yang W (2012) Kernel sparse representation based classification. Neurocomput 77(1):120–128

    Article  Google Scholar 

  35. Gao S, Tsang IW-H, Chia L-T (2010) Kernel sparse representation for image classification and face recognition. In: Computer vision—ECCV 2010. Lecture notes in computer science. Springer, Crete, pp 1–14

  36. Gao S, Tsang IW-H, Chia L-T (2013) Sparse representation with kernels. IEEE Trans Image Process 22:423–434

    MathSciNet  Google Scholar 

  37. Muller K, Mika S, Ratsch G, Tsuda K, Scholkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–201

    Article  Google Scholar 

  38. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27

    Article  MATH  Google Scholar 

  39. Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637

    Article  Google Scholar 

  40. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  41. Yu K, Zhang T, Gong Y (2009) Nonlinear learning using local coordinate coding. Adv Neural Inf Process Syst 22:2223–2231

    Google Scholar 

  42. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR’2010), San Francisco, pp 3360–3367

  43. Candes E, Romberg J (2005) l1-magic: recovery of sparse signals via convex programming. http://users.ece.gatech.edu/~justin/l1magic/

  44. Kim SJ, Koh K, Lustig M, Boyd S, Gorinevsky D (2007) An interior-point method for large-scale l1-regularized least squares. IEEE J Select Top Signal Process 1(4):606–617

    Article  Google Scholar 

  45. Van Den Berg E, Friedlander MP (2008) Probing the Pareto frontier for basis pursuit solutions. SIAM J Sci Comput 31(2):890–912

    Article  MATH  MathSciNet  Google Scholar 

  46. Becker S, Bobin J, Candès EJ (2011) NESTA: a fast and accurate first-order method for sparse recovery. SIAM J Imag Sci 4(1):1–39

    Article  MATH  Google Scholar 

  47. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B (Methodological) 58(1):267–288

  48. Schmidt MW, Murphy KP, Fung G, Rosales R (2008) Structure learning in random fields for heart motion abnormality detection. In: IEEE conference on computer vision and pattern recognition (CVPR’08) Anchorage, pp 1–8

  49. Tropp JA, Wright SJ (2010) Computational methods for sparse solution of linear inverse problems. Proc IEEE 98(6):948–958

    Article  Google Scholar 

  50. Scholkopf B (2001) The kernel trick for distances. Adv Neural Inf Process Syst 301–307

  51. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech-2005, Lisbon, pp 1–4

  52. Cichosz J, Slot K (2005) Application of selected speech-signal characteristics to emotion recognition in polish language. In: International conference on signals and electronic systems, Poznan, pp 409–412

  53. Zhang S, Zhao X (2013) Dimensionality reduction-based spoken emotion recognition. Multimedia Tool Appl 63(3):615–646

    Article  Google Scholar 

  54. Chang C, Lin C (2001) LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm

  55. Fersini E, Messina E, Archetti F (2012) Emotional states in judicial courtrooms: an experimental investigation. Speech Commun 54:11–22

    Article  Google Scholar 

  56. Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570

    Article  Google Scholar 

  57. Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126

    Article  Google Scholar 

  58. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The twentieth international conference on machine learning (ICML-2003), Washington, pp 856–863

  59. Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212

    Article  Google Scholar 

  60. Scherer S, Schwenker F, Palm G (2009) Classifier fusion for emotion recognition from speech. Adv Intel Environ 95–117

  61. Cichosz J, Slot K (2005) Low-dimensional feature space derivation for emotion recognition. In: INTERSPEECH-2005, Lisbon, pp 477–480

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant No. 61203257 and No. 61272261.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiqing Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, X., Zhang, S. Spoken emotion recognition via locality-constrained kernel sparse representation. Neural Comput & Applic 26, 735–744 (2015). https://doi.org/10.1007/s00521-014-1755-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-014-1755-1

Keywords

Navigation