Abstract
To improve effectively the performance on spoken emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech data lying on a nonlinear manifold embedded in a high-dimensional acoustic space. In this paper, a new supervised manifold learning algorithm for nonlinear dimensionality reduction, called modified supervised locally linear embedding algorithm (MSLLE) is proposed for spoken emotion recognition. MSLLE aims at enlarging the interclass distance while shrinking the intraclass distance in an effort to promote the discriminating power and generalization ability of low-dimensional embedded data representations. To compare the performance of MSLLE, not only three unsupervised dimensionality reduction methods, i.e., principal component analysis (PCA), locally linear embedding (LLE) and isometric mapping (Isomap), but also five supervised dimensionality reduction methods, i.e., linear discriminant analysis (LDA), supervised locally linear embedding (SLLE), local Fisher discriminant analysis (LFDA), neighborhood component analysis (NCA) and maximally collapsing metric learning (MCML), are used to perform dimensionality reduction on spoken emotion recognition tasks. Experimental results on two emotional speech databases, i.e. the spontaneous Chinese database and the acted Berlin database, confirm the validity and promising performance of the proposed method.
Similar content being viewed by others
References
Ang J, Dhillon R, Krupski A, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: 7th International Conference on Spoken Language Processing (ICSLP’02), Denver, Colorado, pp. 2037–2040
Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614–636. doi:10.1037/0022-3514.70.3.614
Batliner A, Buckow A, Niemann H, Noth E, Warnke V (2000) The prosody module. VERBMOBIL: foundations of speech-to-speech translations: 106–121
Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2011) Whodunnit–searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28. doi:10.1016/j.csl.2009.12.003
Bengio Y, Paiement J, Vincent P, Delalleau O, Le Roux N, Ouimet M (2004) Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, vol 16. MIT Press, Cambridge, MA, USA
Boersma P, Weenink D (2009) Praat: doing phonetics by computer (version 5.1.05) [computer program]. Retrieved May 1, 2009, from http://www.praat.org/
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech-2005, Lisbon, Portugal, pp. 1–4
Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Ling 22(2):249–254
Chang Y, Hu C, Feris R, Turk M (2006) Manifold based analysis of facial expression. Image Vis Comput 24(6):605–614. doi:10.1016/j.imavis.2005.08.006
Chang C, Lin C (2001) LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Cowie R, Cornelius R (2003) Describing the emotional states that are expressed in speech. Speech Comm 40(1–2):5–32. doi:10.1016/S0167-6393(02)00071-7
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80. doi:10.1109/79.911197
Daza-Santacoloma G, Acosta-Medina C, Castellanos-Domínguez G (2010) Regularization parameter choice in locally linear embedding. Neurocomputing 73(10–12):1595–1605. doi:10.1016/j.neucom.2009.11.038
de Ridder D, Duin R (2002) Locally linear embedding for classification. Pattern Recognition Group, Dept of Imaging Science & Technology, Delft University of Technology, Delft, The Netherlands, Tech Rep PH-2002-01
de Ridder D, Kouropteva O, Okun O, Pietikainen M, Duin R (2003) Supervised locally linear embedding. In: Artificial Neural Networks and Neural Information Processing-ICANN/ICONIP 2003, Lecture Notes in Computer Science 2714, vol 2714. Springer, pp 333–341
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: 4th International Conference on Spoken Language Processing (ICSLP’96), Philadelphia, PA, USA, pp. 1970–1973
Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3):169–200. doi:10.1080/02699939208411068
Errity A, McKenna J (2006) An investigation of manifold learning for speech analysis. In: Ninth International Conference on Spoken Language Processing (ICSLP’06), Pittsburgh, PA, USA, pp. 2506–2509
Fernandez R, Picard R (2003) Modeling drivers’ speech under stress. Speech Comm 40(1–2):145–159. doi:10.1016/S0167-6393(02)00080-8
Fisher R (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic, Boston
Globerson A, Roweis S (2006) Metric learning by collapsing classes. In: Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 451–458
Gobl C, Ni Chasaide A (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Comm 40(1–2):189–212. doi:10.1016/S0167-6393(02)00082-1
Goddard J, Schlotthauer G, Torres M, Rufiner H (2009) Dimensionality reduction for visualization of normal and pathological speech data. Biomed Signal Process Contr 4(3):194–201. doi:10.1016/j.bspc.2009.01.001
Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2005) Neighbourhood components analysis. In: Advances in Neural Information Processing Systems (NIPS), vol 17. MIT Press, Cambridge, MA, pp 513–520
He X, Niyogi P (2003) Locality preserving projections. In: Advances in neural information processing systems (NIPS), vol 16. MIT Press, Cambridge, MA, pp 153–160
Hozjan V, Kacic Z (2003) Improved emotion recognition with large set of statistical features. In: EUROSPEECH-2003, Geneva, pp. 133–136
Hsu C, Chang C, Lin C (2003) A practical guide to support vector classification. Tech. Rep. Taipei
Iliev A, Scordilis M, Papa J, Falcao A (2010) Spoken emotion recognition through optimum-path forest classification using glottal features. Comput Speech Lang 24(3):445–460. doi:10.1016/j.csl.2009.02.005
Iliev A, Zhang Y, Scordilis M (2007) Spoken emotion classification using ToBI features and GMM. In: IEEE 6th EURASIP Conference Focused on Speech and Image Processing, Maribor, Slovenia, pp. 495–498
Jain V, Saul L (2004) Exploratory analysis and visualization of speech and music by locally linear embedding. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), Montreal, Canada, pp. 984–987
Jansen A, Niyogi P (2005) A geometric perspective on speech sounds. University of Chicago, Tech Rep
Jansen A, Niyogi P (2006) Intrinsic fourier analysis on the manifold of speech sounds. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), Toulouse, France, pp. 241–244
Johnstone T, Scherer K (1999) The effects of emotions on voice quality. In: XIVth International Congress of Phonetic Science, San Francisco, pp. 2029–2032
Jolliffe IT (1986) Principal component analysis, 2nd edn. Springer, New York
Kayo O, Design C, Ahonen R (2006) Locally linear embedding algorithm extensions and applications. Faculty of Technology, University of Oulu
Kim J, Lee S, Narayanan S (2010) An exploratory study of manifolds of emotional speech In: 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’2010), Dallas, Texas, USA, pp. 5142–5145
Kouropteva O, Okun O, Pietikainen M (2003) Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine. In: 11th European Symposium on Artificial Neural Networks Bruges, Belgium, pp. 229–234
Kwon O, Chan K, Hao J, Lee T (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, Geneva, Switzerland, pp. 125–128
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Audio Speech Lang Process 13(2):293–303. doi:10.1109/TSA.2004.838534
Lee CM, Narayanan SS, Pieraccini R (2001) Recognition of negative emotions from the speech signal. In: IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), Trento, pp. 240–243
Lee C, Narayanan S, Pieraccini R (2002) Combining acoustic and language information for emotion recognition. In: 7th International Conference on Spoken Language Processing (ICSLP’02), Denver, Colorado, USA, pp. 873–876
Lee C, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S (2004) Emotion recognition based on phoneme classes. In: International Conference on Spoken Language Processing (ICSLP’04), Jeju, Korea, pp. 889–892
Li B, Zheng C-H, Huang D-S (2008) Locally linear discriminant embedding: an efficient method for face recognition. Pattern Recogn 42(12):3813–3821. doi:10.1016/j.patcog.2008.05.027
Liang D, Yang J, Zheng Z, Chang Y (2005) A facial expression recognition system based on supervised locally linear embedding. Pattern Recogn Lett 26(15):2374–2389. doi:10.1016/j.patrec.2005.04.011
Monzo C, Alías F, Iriondo I, Gonzalvo X, Planet S (2007) Discriminating expressive speech styles by voice quality parameterization. In: 16th International Congress of Phonetic Sciences, Saarbruken, Germany, pp. 2081–2084
Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Comm 49(2):98–112. doi:10.1016/j.specom.2006.11.004
Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Appl 9(4):290–296. doi:10.1007/s005210070006
Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Comm 41(4):603–623. doi:10.1016/s01167-6393(03)00099-2
Osgood C, May W, Miron M (1975) Cross-cultural universals of affective meaning. University of Illinois Press
Pao T, Chen Y, Yeh J, Liao W (2005) Combining acoustic features for improved emotion recognition in Mandarin speech. In: Affective Computing and Intelligent Interaction. pp 279–285
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2(6):559–572
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proc. 1999 Artificial Neural Networks in Engineering (ANNIE ’99), New York, pp. 7–10
Petrushin V (2000) Emotion recognition in speech signal: experimental study, development, and application. In: 6th International Conference on Spoken Language Processing (ICSLP’00), Beijing, China, pp. 222–225
Picard R (1997) Affective computing. MIT, Cambridge
Picard R (2001) Affective medicine: technology with emotional intelligence. Future of health technology. OIS, Cambridge, pp 69–85
Picard R, Klein J (2002) Computers that recognise and respond to user emotion: theoretical and practical implications. Interact Comput 14(2):141–169. doi:10.1016/S0953-5438(01)00055-8
Platt J (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT press, Cambridge, MA, USA, pp 185–208
Rong J, Li G, Chen Y-PP (2009) Acoustic feature selection for automatic emotion recognition from speech. Inform Process Manag 45(3):315–328
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. doi:10.1126/science.290.5500.2323
Saul LK, Roweis ST (2003) Think globally, fit locally: unsupervised learning of low dimensional manifolds. J Mach Learn Res 4:119–155
Scherer K (2003) Vocal communication of emotion: a review of research paradigms. Speech Comm 40(1–2):227–256. doi:10.1016/S0167-6393(02)00084-5
Scherer S, Schwenker F, Palm G (2009) Classifier fusion for emotion recognition from speech. In: Advanced Intelligent Environments. Springer, pp 95–117
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH-2007, Antwerp, Belgium, pp. 2253–2256
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, pp. 577–580
Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Honolulu, Hawai’i, USA, pp. 941–944
Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Comm 49(3):201–212. doi:10.1016/j.specom.2007.01.006
Sugiyama M (2007) Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. J Mach Learn Res 8:1027–1061
Tenenbaum JB, Silva VD, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323. doi:10.1126/science.290.5500.2319
Valencia-Aguirre J, Álvarez-Mesa A, Daza-Santacoloma G, Castellanos-Domínguez G (2009) Automatic choice of the number of nearest neighbors in locally linear embedding. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 77–84
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Van der Maaten L, Postma E, Van den Herik H (2009) Dimensionality reduction: a comparative review. TiCC TR 2009–005
Vapnik V (2000) The nature of statistical learning theory. Springer-Verlag, New York
Ververidis D, Kotropoulos C (2005) Emotional speech classification using Gaussian mixture models. In: IEEE International Conference on Multimedia and Expo (ICME’05), Amsterdam, The Netherlands, pp. 2871–2874
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48(9):1162–1181. doi:10.1016/j.specom.2006.04.003
Ververidis D, Kotropoulos C (2008) Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition. Signal Process 88(12):2956–2970. doi:10.1016/j.sigpro.2008.07.001
Ververidis D, Kotropoulos C, Pitas I (2004) Automatic emotional speech classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’04), Montreal, Quebec, Canada, pp. 593–596
Wang Y, Guan L (2004) An investigation of speech-based human emotion recognition. In: IEEE 6th Workshop on Multimedia Signal Processing, Siena, Italy pp. 15–18
Wang M, Yang J, Xu Z, Chou K (2005) SLLE for predicting membrane protein types. J Theor Biol 232(1):7–15. doi:10.1016/j.jtbi.2004.07.023
Xiao Z, Dellandrea E, Dou W, Chen L (2010) Multi-stage classification of emotional speech motivated by a dimensional emotion model. Multimed Tool Appl 46(1):119–145. doi:10.1007/s11042-009-0319-3
Yildirim S, Narayanan S, Potamianos A (2011) Detecting emotional state of a child in a conversational computer game. Comput Speech Lang 25(1):29–44. doi:10.1016/j.csl.2009.12.004
You M, Chen C, Bu J, Liu J, Tao J (2006) Emotional speech analysis on nonlinear manifold. In: 18th International Conference on Pattern Recognition (ICPR 2006), Hong Kong, pp. 91–94
You M, Chen C, Bu J, Liu J, Tao J (2007) Manifolds based emotion recognition in speech. Comput Ling Chin Lang Process 12(1):49–64
Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In: Advances in Neural Networks–ISNN 2008, Lecture Notes in Computer Science 5264, vol 5264. Springer, pp 457–464
Zhao L, Zhang Z (2009) Supervised locally linear embedding with probability-based distance for classification. Comput Math Appl 57(6):919–926. doi:10.1016/j.camwa.2008.10.055
Acknowledgements
The authors would like to thank all the anonymous reviewers and editors for their helpful comments and suggestions about the improvement of this paper. This work is supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. Z1101048 and No. Y1111058.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, S., Zhao, X. Dimensionality reduction-based spoken emotion recognition. Multimed Tools Appl 63, 615–646 (2013). https://doi.org/10.1007/s11042-011-0887-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-011-0887-x