Abstract
Spoken emotion recognition is currently a very active research topic and has attracted extensive attention in signal processing, pattern recognition, artificial intelligence, etc. In this paper, a new emotion classification method based on kernel sparse representation, named locality-constrained kernel sparse representation-based classification (LC-KSRC), is proposed for spoken emotion recognition. LC-KSRC is able to learn more discriminating sparse representation coefficients for spoken emotion recognition, since it integrates both sparsity and data locality in the kernel feature space. The proposed method is compared with six representative emotion classification methods, including linear discriminant classifier, K-nearest-neighbor, radial basis function neural networks, support vector machines, sparse representation-based classification and kernel sparse representation-based classification. Experimental results on two publicly available emotional speech databases, i.e., the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on spoken emotion recognition tasks, outperforming the other used methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech and Audio Process 13(2):293–303
Busso C, Sungbok L, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596
Luengo I, Navas E, Hernaez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimedia 12(6):490–501
Dromey C, Silveira J, Sandor P (2005) Recognition of affective prosody by speakers of English as a first or foreign language. Speech Commun 47(3):351–359
Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2011) Whodunnit—searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28
Jaywant A, Pell MD (2012) Categorical processing of negative emotions from speech prosody. Speech Commun 54(1):1–10
Chen L, Mao X, Wei P, Xue Y, Ishizuka M (2012) Mandarin emotion recognition combining acoustic and emotional point information. Appl Intell 37(4):602–612
van der Wal CN, Kowalczyk W (2013) Detecting changing emotions in human speech by machine and humans. Appl Intell 39(4):675–691
Gobl C, Chasaide NA (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Commun 40(1–2):189–212
Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In: Advances in neural networks—ISNN 2008, Lecture Notes in Computer Science 5264, vol 5264. Springer, Berlin, pp 457–464
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH-2007, Antwerp, Belgium, pp 2253–2256
Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4):603–623
Kienast M, Sendlmeier W (2000) Acoustical analysis of spectral and temporal changes in emotional speech. ITRW on Speech and Emotion. Newcastle, Northern Ireland, pp 92–97
Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52(7–8):613–625
Sheikhan M, Gharavian D, Ashoftedel F (2012) Using DTW neural–based MFCC warping to improve emotional speech recognition. Neural Comput Appl 21(7):1765–1773
Hu H, Xu MX, Wu W (2007) GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’07), Honolulu, HI, pp 413–416
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: 1999 Artificial neural networks in engineering (ANNIE ‘99), New York, pp 7–10
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: 4th International conference on spoken language processing (ICSLP’96), Philadelphia, PA, pp 1970–1973
Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Appl 9(4):290–296
Petrushin V (2000) Emotion recognition in speech signal: experimental study, development, and application. In: 6th International conference on spoken language processing (ICSLP’00), Beijing, pp 222–225
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Montreal, pp 577–580
Kwon O, Chan K, Hao J, Lee T (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, Geneva, pp 125–128
Ververidis D, Kotropoulos C (2005) Emotional speech classification using Gaussian mixture models. In: IEEE international conference on multimedia and expo (ICME’05), Amsterdam, pp 2871–2874
Iliev A, Zhang Y, Scordilis M (2007) Spoken emotion classification using ToBI features and GMM. In: IEEE 6th EURASIP conference focused on speech and image processing, Maribor, Slovenia, pp 495–498
Lee C, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S (2004) Emotion recognition based on phoneme classes. In: International conference on spoken language processing (ICSLP’04), Jeju, Korea, pp 889–892
Donoho DL (2006) For most large underdetermined systems of linear equations the minimal l 1-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30
Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Zhao X, Zhang S, Lei B (2014) Robust emotion recognition in noisy speech via sparse representation. Neural Comput Appl 24(7–8):1539–1553
Zhang L, Zhou W-D, Chang P-C, Liu J, Yan Z, Wang T, Li F-Z (2012) Kernel sparse representation-based classifier. IEEE Trans Signal Process 60(4):1684–1695
Zhou Y, Gao J, Barner KE (2012) An enhanced sparse representation strategy for signal classification. In: SPIE 8365, compressive sensing, Baltimore, MD, p 83650H
Yin J, Liu Z, Jin Z, Yang W (2012) Kernel sparse representation based classification. Neurocomput 77(1):120–128
Gao S, Tsang IW-H, Chia L-T (2010) Kernel sparse representation for image classification and face recognition. In: Computer vision—ECCV 2010. Lecture notes in computer science. Springer, Crete, pp 1–14
Gao S, Tsang IW-H, Chia L-T (2013) Sparse representation with kernels. IEEE Trans Image Process 22:423–434
Muller K, Mika S, Ratsch G, Tsuda K, Scholkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–201
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Yu K, Zhang T, Gong Y (2009) Nonlinear learning using local coordinate coding. Adv Neural Inf Process Syst 22:2223–2231
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR’2010), San Francisco, pp 3360–3367
Candes E, Romberg J (2005) l1-magic: recovery of sparse signals via convex programming. http://users.ece.gatech.edu/~justin/l1magic/
Kim SJ, Koh K, Lustig M, Boyd S, Gorinevsky D (2007) An interior-point method for large-scale l1-regularized least squares. IEEE J Select Top Signal Process 1(4):606–617
Van Den Berg E, Friedlander MP (2008) Probing the Pareto frontier for basis pursuit solutions. SIAM J Sci Comput 31(2):890–912
Becker S, Bobin J, Candès EJ (2011) NESTA: a fast and accurate first-order method for sparse recovery. SIAM J Imag Sci 4(1):1–39
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B (Methodological) 58(1):267–288
Schmidt MW, Murphy KP, Fung G, Rosales R (2008) Structure learning in random fields for heart motion abnormality detection. In: IEEE conference on computer vision and pattern recognition (CVPR’08) Anchorage, pp 1–8
Tropp JA, Wright SJ (2010) Computational methods for sparse solution of linear inverse problems. Proc IEEE 98(6):948–958
Scholkopf B (2001) The kernel trick for distances. Adv Neural Inf Process Syst 301–307
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech-2005, Lisbon, pp 1–4
Cichosz J, Slot K (2005) Application of selected speech-signal characteristics to emotion recognition in polish language. In: International conference on signals and electronic systems, Poznan, pp 409–412
Zhang S, Zhao X (2013) Dimensionality reduction-based spoken emotion recognition. Multimedia Tool Appl 63(3):615–646
Chang C, Lin C (2001) LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Fersini E, Messina E, Archetti F (2012) Emotional states in judicial courtrooms: an experimental investigation. Speech Commun 54:11–22
Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570
Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The twentieth international conference on machine learning (ICML-2003), Washington, pp 856–863
Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212
Scherer S, Schwenker F, Palm G (2009) Classifier fusion for emotion recognition from speech. Adv Intel Environ 95–117
Cichosz J, Slot K (2005) Low-dimensional feature space derivation for emotion recognition. In: INTERSPEECH-2005, Lisbon, pp 477–480
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant No. 61203257 and No. 61272261.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, X., Zhang, S. Spoken emotion recognition via locality-constrained kernel sparse representation. Neural Comput & Applic 26, 735–744 (2015). https://doi.org/10.1007/s00521-014-1755-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-014-1755-1