Abstract
This paper deals with speech emotion analysis within the context of increasing awareness of the wide application potential of affective computing. Unlike most works in the literature which mainly rely on classical frequency and energy based features along with a single global classifier for emotion recognition, we propose in this paper some new harmonic and Zipf based features for better speech emotion characterization in the valence dimension and a multi-stage classification scheme driven by a dimensional emotion model for better emotional class discrimination. Experimented on the Berlin dataset with 68 features and six emotion states, our approach shows its effectiveness, displaying a 68.60% classification rate and reaching a 71.52% classification rate when a gender classification is first applied. Using the DES dataset with five emotion states, our approach achieves an 81% recognition rate when the best performance in the literature to our knowledge is 76.15% on the same dataset.
Similar content being viewed by others
References
Abelin A, Allwood J (2000) Cross-linguistic interpretation of emotional prosody. Proceedings of the ISCA Workshop on Speech and Emotion, Belfast
Atal B, Rabiner L (1976) A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Transactions on ASSP 24(3):201–212
Banse R, Sherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636. doi:10.1037/0022-3514.70.3.614
Bellman R (1961) Adaptive control processes: a guided tour, Princeton University Press
Bishop CM Pattern recognition and machine learning, Ed. Springer, 2006
Breazeal C (2001) Designing social robots. MIT Press, Cambridge, MA
Brian CJ Moore (1997) An introduction to the psychology of hearing, Academic Press
Burkhardt F, Sendlmeier W (2000) Verification of acoustical correlates of emotional speech using formant-synthesis, Proceedings of the ISCA Workshop on Speech and Emotion
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss BA (2005) Database of German Emotional Speech Proceedings Interspeech, Lisbon, Portugal
Childers DG, Hand M, Larar JM (1989) Silent and voiced/unvoied/ mixed excitation(four-way), classification of speech. IEEE Transaction on ASSP 37(11):1771–1774
Cohen A, Mantegna RN, Havlin S (1997) Numerical analysis of word frequencies in artificial and natural language texts. Fractals 5(1):95–104. doi:10.1142/S0218348X97000103
Dellandrea E, Makris P, Vincent N (2004) Zipf analysis of audio signals, fractals. World Sci Publishing Co 12(1):73–85
Devillers L, Lamel L (2003) Emotion detection in task-oriented dialogs, proceedings of the ICME 2003, IEEE, Multimedia Human-Machine Interface and Interaction I, Vol.III, pp.549-552, Baltimore, MD, USA
Druin A, Hendler J (2000) Robots for fids: exploring new technologies for learning. Morgan Kauffman, Los Altos, CA
Ekman P Emotions in the human face, Cambridge University Press, 1982
Engberg IS, Hansen AV (1996) Documentation of the Danish Emotional Speech Database DES, Aalborg
Harb H, Chen L (2005) Voice-based gender identification in multimedia applications. J Intell Inf Syst 24(2):179–198
Havlin S (1995) The distance between Zipf Plots. Physica A 216:148–150. doi:10.1016/0378-4371(95)00069-J
Juslin PN (2000) Cue utilization in communication of emotion in music performance: relating performance to perception. J Exp Psychol 16(6):1797–1813
Kusahara M (2001) The art of creating subjective reality: an analysis of Japanese digital pets. In: Boudreau E (ed) in artificial life 7 workshop proceedings, p141–144
McGilloway S, Cowie R, Cowie ED, Gielen S, Westerdijk M, Stroeve S (2000) Approaching automatic recognition of emotion from voice: a rough benchmark, Proceedings of the ISCA workshop on Speech and Emotion, p. 207–212, Newcastle, Northern Ireland
Morrison D, Silva LCD (2007) Voting ensembles for spoken affect classification. J Netw Comput Appl 30:1356–1365. doi:10.1016/j.jnca.2006.09.005
Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59(1–2):157–183. doi:10.1016/S1071-5819(02)00141-6
Pereira C (2000) Dimensions of emotional meaning in speech, Proceedings of the ISCA workshop on speech and emotion p. 25–28, Newcastle, Northern Ireland
Picard R (1997) Affective computing. MIT Press
Polzin T, Waibel A (2000) Emotion-sensitive human-computer interfaces, Proceedings of the ISCA workshop on Speech and Emotion, p. 201 ~ 206, Newcastle, Northern Ireland
PRAAT (2001) A system for doing phonetics by computer. Glot Int 5(9/10):341–345
Rakotomalala R (2005) TANAGRA : un logiciel gratuit pour l'enseignement et la recherche, in Actes de EGC'2005, RNTI-E-3, vol. 2, pp. 697-702
Russel JA (1980) A circumplex model of affect. J Pers Soc Psychol 39:1161–1178. doi:10.1037/h0077714
Scherer KR (1989) Vocal correlates of emotion. In: Manstead A, Wagner H (eds) Handbook of psychophysiology: emotion and social behavior. Wiley, London, pp 165–197
Scherer KR (2002) Vocal communication of emotion: a review of research paradigms. Speech Commun 40:227–256. doi:10.1016/S0167-6393(02)00084-5
Scherer KR, Kappas A (1988) Primate vocal expression of affective state. In: Todt D, Goedeking P, Symmes D (eds) Primate vocal communication. Springer, Berlin, pp 171–194
Scherer KR, Johnstone T, Klasmeyer G, Banziger T (2000) Can automatic speaker verification be improved by training the algorithms on emotional speech? Proc.ICSLP2000, Beijing, China
Scherer KR, Schorr A, Johnstone T (2001) Appraisal processes in emotion: theory, methods, research, Oxford University Press, New York and Oxford
Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. Proceedings of ICASSP 2003, pp.II-1-II-4
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in hybrid support vector machine − belief network architecture, proceedings of ICASSP, pp I-577-I-580
Schuller B, Reiter S, Muller R, Al-Hames M, Lang M, Rigoll G (2005) Speaker independent speech emotion recognition by ensemble classification, ICME, pp. 864–867
Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. ICME 2006:5–8
Schuller B, Wimmer M, Mösenlechner L, Kern C, Arsic D, Rigoll G (2008) Brute-forcing hierarchical functional for paralinguistics : a waste of feature space. Proceedings of Icassp, pp 4501–4504
Slaney M, Mcroberts G (1998) Baby Ears: A recognition system for affective vocalizations. Proceedings of the ICASSP 1998, Seattle, WA
Spence C, Sajda P (1998) The role of feature selection in building pattern recognizers for computer-aided diagnosis, Proceedings of SPIE - Volume 3338, Medical Imaging 1998: Image Processing, Kenneth M. Hanson, Editor, p 1434–1441
Thayer RE (1989) The biopsychology of mood and arousal. Oxford Univ. Press
Tickle A (2000) English and Japanese speaker’s emotion vocalizations and recognition: a comparison highlighting vowel quality, ISCA Workshop on Speech and Emotion, Belfast
Ververidis D, Kotropoulos C (2004) Automatic speech classification to five emotional states based on gender information, Proceedings of 12th European Signal Processing Conference, p 341–344, Austria
Ververidis D Kotropoulos C (2005) Emotional speech classification using gaussian mixture models and the sequential floating forward selection algorithm, IEEE International Conference on Multimedia and Expo, ICME, p. 1500– 1503
Ververidis D, Kotropoulos C, Pitas I (2004) Automatic emotional speech classification. Proceedings of ICASSP 2004, pp 593–596, Montreal, Canada
Voght T, André E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition, in Proc. Multimedia and Expo (ICME 2005), Amsterdam, pp.474–477
Watson D, Tellegen A (1985) Toward a Consensual Structure of Mood. Psychol Bull 98:219–235. doi:10.1037/0033-2909.98.2.219
Wieczorkowska A, Synak P, Lewis R, Ras ZW (2005) Extracting emotions from music data. Proceedings of 15th International Symposium, ISMIS 2005, p. 456–465, Saratoga Springs, NY, USA
Witten IH, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA, USA
Xiao Z, Dellandrea E, Dou W, Chen L (2005) Features extraction and selection in emotional speech, International Conference on Advanced Video and Signal based Surveillance (AVSS 2005). p. 411–416., Como, Italy
Xiao Z, Dellandrea E, Dou W, Chen L (2006) Two-stage classification of emotional speech, International Conference on Digital Telecommunications (ICDT'06), p. 32–37, Cap Esterel, Côte d’Azur, France
Xiao Z, Dellandrea E, Dou W, Chen L (2007) Automatic hierarchical classification of emotional speech, Ninth IEEE International Symposium on Multimedia Workshops (ISMW 2007), p. 291–296, Taiwan
Xiao Z, Dellandrea E, Dou W, Chen L (2007) Hierarchical classification of emotional speech, research report RR-LIRIS-2007-006, LIRIS UMR 5205 CNRS
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley Press, 1949
Acknowledgment
This work has received a scholarship awarded by the French government from 2004 to 2007 and was partly supported by a PRA project Apollo under the number SI04-02 and a PICS grant by CNRS under the number 3597.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xiao, Z., Dellandrea, E., Dou, W. et al. Multi-stage classification of emotional speech motivated by a dimensional emotion model. Multimed Tools Appl 46, 119–145 (2010). https://doi.org/10.1007/s11042-009-0319-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-009-0319-3