Skip to main content

Advertisement

Log in

On the recognition of emotional vocal expressions: motivations for a holistic approach

  • Review
  • Published:
Cognitive Processing Aims and scope Submit manuscript

Abstract

Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Sony AIBO Europe, Sony entertainment. www.sonydigital-link.com/AIBO/.

References

  • Apolloni B, Aversano G, Esposito A (2000) Preprocessing and classification of emotional features in speech sentences. In: Kosarev Y (ed) Proceedings of international workshop on speech and computer. SPIIRAS, pp 49–52

  • Apolloni B, Esposito A, Malchiodi D, Orovas C, Palmas G, Taylor JG (2004) A general framework for learning rules from data. IEEE Trans Neural Networks 15(6):1333–1350

    Article  Google Scholar 

  • Atassi H, Esposito A (2008) Speaker independent approach to the classification of emotional vocal expressions. In: Proceedings of IEEE conference on tools with artificial intelligence (ICTAI), vol 1. Dayton, OH, USA, pp 487–494

  • Atassi H, Riviello MT, Smékal Z, Hussain A, Esposito A (2010) Emotional vocal expressions recognition using the COST 2102 Italian database of emotional speech. In: Esposito A et al (eds) LNCS, vol 5967. Springer, Berlin, pp 406–422

  • Aversano G, Esposito A, Esposito AM, Marinaro M (2001) A new text-independent method for phoneme segmentation. In: Ewing RL et al (eds) Proceedings of the IEEE international workshop on circuits and systems, vol 2, pp 516–519

  • Bachorowski JA (1999) Vocal expression and perception of emotion. Curr Dir Psychol Sci 8:53–57

    Article  Google Scholar 

  • Banse R, Scherer K (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636

    Article  PubMed  CAS  Google Scholar 

  • Bargh JA, Chen M, Burrows L (1996) Automaticity of social behavior: direct effects of trait construct and stereotype activation on action. J Pers Soc Psychol 71:230–244

    Article  PubMed  CAS  Google Scholar 

  • Barsalou LW, Niedenthal PM, Barbey AK, Ruppert JA (2003) Social embodiment. In: Ross BH (ed) The psychology of learning and motivation, vol 43. Academic Press, San Diego, pp 43–92

  • Benoit C, Mohamadi T, Kandel S (1994) Effects of phonetic context on audio-visual intelligibility of French. J Speech Hear Res 37:1195–1203

    PubMed  CAS  Google Scholar 

  • Block N (1995) The mind as the software of the brain. In: Smith EE, Osherson DN (eds) Thinking. MIT Press, Cambridge, pp 377–425

    Google Scholar 

  • Blumberg BM, Todd PM, Maes P (1996) No bad dogs: ethological lessons for learning in Hamsterdam. In: Proceedings of the 4th international conference on simulation of adaptive behaviour, MIT Press/Bradford Books, Cambridge, pp 295–304

  • Breazeal C, Aryananda L (2002) Recognition of affective communicative intent in robot-directed speech. Auton Robots 12:83–104

    Article  Google Scholar 

  • Breitenstein C, Van Lancker D, Daum I (2001) The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn Emot 15(1):57–79

    Google Scholar 

  • Bryant GA, Barrett HC (2007) Recognizing intentions in infant-directed speech. Psychol Sci 18(8):746–751

    Article  PubMed  Google Scholar 

  • Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech, pp 1517–1520

  • Busso C, Lee S, Narayanan SS (2007) Using neutral speech models for emotional speech analysis. In: Proceedings of Interspeech, Antwerp, Belgium, pp 2225–2228

  • Butterworth BL, Beattie GW (1978) Gestures and silence as indicator of planning in speech. In: Campbell RN, Smith PT (eds) Recent advances in the psychology of language. Olenum Press, New York, pp 347–360

  • Callan DE, Jones JA, Munhall K, Callan AM, Kroos C, Vatikiotis-Bateson E (2003) Neural processes underlying perceptual enhancement by visual speech gestures. NeuroReport 14:2213–2218

    Article  PubMed  Google Scholar 

  • Chafe WL (1987) Cognitive constraint on information flow. In: Tomlin R (ed) Coherence and grounding in discourse. John Benjamins, Amsterdam, pp 20–51

    Google Scholar 

  • de Byl PB, Toleman MA (2005) Engineering emotionally intelligent agents. Encycl Inf Sci Technol II:1052–1056

    Article  Google Scholar 

  • Dennett DC (1969) Content and consciousness. Humanities Press, Oxford

    Google Scholar 

  • Douglas-Cowie E, Cowie R, Schroder M (2000) A new emotion database: considerations, source and scope. In: Proceedings of ISCA workshop on speech and emotion. Belfast, Northern Ireland

  • Duda R, Hart P, Stork D (2003) Pattern classification, 2nd edn. Wiley, New York

    Google Scholar 

  • Ekman P (1992) An argument for basic emotions. Cogn Emot 6:169–200

    Article  Google Scholar 

  • El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587

    Article  Google Scholar 

  • Elman JL (1991) Distributed representation, simple recurrent neural networks, and grammatical structure. Mach Learn 7:195–225

    Google Scholar 

  • El-Nasr MS (1998) Modeling emotion dynamics in intelligent agents. M.Sc. dissertation, American University in Cairo

  • Esposito A (2000) Approaching speech signal problems: an unifying viewpoint for the speech recognition process. In: Memoria of Taller Internacional de Tratamiento del Habla, Procesamiento de Vos y el Language, Suarez Garcia S, Baron Fernandez R (Eds), CIC-IPN Obra Compleata, Memoria. ISBN: 970-18-4936-1

  • Esposito A (2002) The importance of data for training intelligent devices. In: Apolloni B, Kurfess C (eds) From synapses to rules: discovering symbolic knowledge from neural processed data. Kluwer, Dordrecht, pp 229–250

    Google Scholar 

  • Esposito A (2007) The amount of information on emotional states conveyed by the verbal and nonverbal channels: some perceptual data. In: Stilianou Y et al (eds) Progress in nonlinear speech processing. LNCS, vol 4391. Springer, Berlin, pp 245–268

    Google Scholar 

  • Esposito A (2008) Affect in multimodal information. In: Tao J, Tan T (eds) Affective information processing, Springer, Heidelberg, pp 211–234

  • Esposito A (2009) The perceptual and cognitive role of visual and auditory channels in conveying emotional information. Cogn Comput J 2:268–278

    Article  Google Scholar 

  • Esposito A, Aversano G (2005) Text independent methods for speech segmentation. In: Chollet G et al (eds) Nonlinear speech modeling and applications, LNCS, vol 3445, pp 261–290

  • Esposito A, Marinaro M (2007) What pauses can tell us about speech and gesture partnership. In: Esposito A et al (eds) Fundamentals of verbal and nonverbal communication and the biometric issue, vol 18. IOS press, Amsterdam, pp 45–57

    Google Scholar 

  • Esposito A, Riviello MT (2010) The new Italian audio and video emotional database. In: Esposito A et al (eds) LNCS, vol 5967. Springer, Berlin, pp 406–422

    Google Scholar 

  • Esposito A, Riviello MT (2011) The cross-modal and cross-cultural processing of affective information. In: Apolloni B et al (eds) Frontiers in artificial intelligence and applications. IOS press, Amsterdam, pp 301–310

    Google Scholar 

  • Esposito A, Riviello MT, Di Maio G (2009a) The COST 2102 Italian audio and video emotional database. In: Apolloni B et al (eds) WIRN09, vol 204. IOS press, Amsterdam, pp 51–61

    Google Scholar 

  • Esposito A, Riviello MT, Bourbakis N (2009b) Cultural specific effects on the recognition of basic emotions: a study on Italian subjects. In: Holzinger A, Miesenberger K (eds) USAB 2009, LNCS, vol 5889. Springer, Berlin, pp 135–148

  • Fodor JA (1983) The modularity of mind. MIT Press, Cambridge

    Google Scholar 

  • Fragopanagos N, Taylor JG (2005) Emotion recognition in human–computer interaction. Neural Netw 18:389–405

    Article  PubMed  CAS  Google Scholar 

  • Frens MA, Van Opstal AJ, Van der Willigen RF (1995) Spatial and temporal factors determine auditory-visual interactions in human saccadic eye movements. Percept Psychophys 57:802–816

    Article  PubMed  CAS  Google Scholar 

  • Friend M (2000) Developmental changes in sensitivity to vocal paralanguage. Dev Sci 3:148–162

    Article  Google Scholar 

  • Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. JASA 87(4):1738–1752

    CAS  Google Scholar 

  • Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589

    Article  Google Scholar 

  • Hozjan V, Kacic Z (2003) Context-independent multilingual emotion recognition from speech signals. Int J Speech Technol 6:311–320

    Article  Google Scholar 

  • Hozjan V, Kacic Z (2006) A rule-based emotion-dependent feature extraction method for emotion analysis from speech. JASA 119(5):3109–3120

    Google Scholar 

  • Hu H, Xu M, Wu W (2007) GMM supervector based SVM with spectral features for speech emotion recognition. In: Proceedings of ICASSP, vol 4, pp IV 413–IV 416

  • Hughes HC, Reuter-Lorenz PA, Nozawa G, Fendrich R (1994) Visual auditory interactions in sensorimotor processing: saccades versus manual responses. J Exp Psychol Hum Percept Perform 20:131–153

    Article  PubMed  CAS  Google Scholar 

  • Izard CE (1992) Basic emotions, relations among emotions, and emotion–cognition relations. Psychol Rev 99:561–565

    Article  PubMed  CAS  Google Scholar 

  • Jones C, Deeming A (2008) Affective human-robotic interaction. In: Peter C, Beale R (eds) Affect and emotion in HCI, LNCS, vol 4868. Springer, pp 175–185

  • Kaehms B (1999) Putting a (sometimes) pretty face on the web. WebTechniques, CMP Media. www.newarchitectmag.com/archives/1999/09/newsnotes/

  • Klasmeyer G, Sendlmeier WF (1995) Objective voice parameters to characterize the emotional content in speech. In: Elenius K, Branderudf P (Eds) Proceedings of ICPhS, Arne Strömbergs Grafiska, vol 1, pp 182–185

  • Lindblom B (1990) Explaining phonetic variation: a sketch of the H&H theory. In: Hardcastle J, Marchal A (eds) Speech production and speech modeling. Kluwer, Dordrecht, pp 403–439

    Chapter  Google Scholar 

  • Lugger M, Yang B (2007) The relevance of voice quality features in speaker independent emotion recognition. In: Proceedings of ICASSP, vol 4, pp 17–20

  • Macaluso E, George N, Dolan R, Spence C, Driver J (2004) Spatial and temporal factors during processing of audiovisual speech: a PET study. NeuroImage 21:725–732

    Article  PubMed  CAS  Google Scholar 

  • Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63(4):561–580

    Article  Google Scholar 

  • McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  PubMed  CAS  Google Scholar 

  • Navas E, Hernáez I, Luengo I (2006) An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS. IEEE Trans Audio Speech Lang Process 14(4):1117–1127

    Article  Google Scholar 

  • Newell A, Simon HA (1972) Human problem solving. Prentice Hall, Oxford

    Google Scholar 

  • Nushikyan EA (1995) Intonational universals in texual context. In: Elenius K, Branderudf P (eds) Proceedings of ICPhS 1995, Arne Strömbergs Grafiska, vol 1, pp 258–261

  • Nwe T, Foo S, De Silva L (2003) Speech emotion recognition using Hidden Markov models. Speech Commun 41:603–623

    Article  Google Scholar 

  • Oatley K, Jenkins JM (2006) Understanding emotions, 2nd edn. Blackwell, Oxford

    Google Scholar 

  • Penrose R (1989) The emperor’s new mind. Oxford University Press, New York

    Google Scholar 

  • Perrott DR, Sadralodabai T, Saberi K, Strybel TZ (1991) Aurally aided visual search in the central visual field: effects of visual load and visual enhancement of the target. Hum Factors 33:389–400

    PubMed  CAS  Google Scholar 

  • Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proceedings of the conference on artificial neural networks in engineering, pp 7–10

  • Picard R (2000) Toward computers that recognize and respond to user emotion. IBM Syst J 39(3–4):705–719

    Article  Google Scholar 

  • Pierre-Yves O (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59:157–183

    Article  Google Scholar 

  • Plutchik R (1993) Emotion and their vicissitudes: emotions and psychopatology. In: Lewis JM, Haviland-Jones M (eds) Handbook of emotion. Guilford Press, New York, pp 53–66

    Google Scholar 

  • Pudil P, Ferri F, Novovicova J, Kittler J (1994) Floating search method for feature selection with non monotonic criterion functions. Pattern Recogn 2:279–283

    Google Scholar 

  • Pylyshyn ZW (1984) Computation and cognition: toward a foundation for cognitive science. MIT Press, Cambridge

    Google Scholar 

  • Razak A, Komiya R, Abidin M (2005) Comparison between fuzzy and nn method for speech emotion recognition. In: Proceedings of 3rd international conference on information technology and applications ICITA, vol 1, pp 297–302

  • Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39:1161–1178

    Article  Google Scholar 

  • Scherer KR (1989) Vocal correlates of emotional arousal and affective disturbance. In: Wagner H, Mner H, Manstead A (eds) Handbook of social psychophysiology. Wiley, New York, pp 165–197

    Google Scholar 

  • Scherer K (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun 40:227–256

    Article  Google Scholar 

  • Scherer KR, Banse R, Wallbott HG (2001) Emotion inferences from vocal expression correlate across languages and cultures. J Cross Cult Psychol 32:76–92

    Article  Google Scholar 

  • Schubert TW (2004) The power in your hand: gender differences in bodily feedback from making a fist. Pers Soc Psychol Bull 30:757–769

    Article  PubMed  Google Scholar 

  • Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of the ICASSP, vol 1, pp 577–580

  • Schulz M, Ross B, Pantev C (2003) Evidence for training-induced cross modal reorganization of cortical functions in trumpet players. NeuroReport 14:157–161

    Article  PubMed  Google Scholar 

  • Slaney M, McRoberts G (2003) Baby ears: a recognition system for affective vocalizations. Speech Commun 39:367–384

    Article  Google Scholar 

  • Sloman A (2001) Beyond shallow models of emotion. Cogn Process 2(1):177–198

    Google Scholar 

  • Smit ER, Semin GR (2004) Socially situated cognition: cognition in its social context. Adv Exp Soc Psychol 36:53–117

    Article  Google Scholar 

  • Stein BE, Jiang W, Wallace MT, Stanford TR (2001) Nonvisual influences on visual-information processing in the superior colliculus. Prog Brain Res 134:143–156

    Article  PubMed  CAS  Google Scholar 

  • Stepper S, Strack F (1993) Proprioceptive determinants of emotional and non-emotional feelings. J Pers Soc Psychol 64:211–220

    Article  Google Scholar 

  • Ström N (1997) Sparse connection and pruning in large dynamic artificial neural networks. In: Proceedings of Eurospeech, vol 5, pp 2807–2810

  • Velasquez JD (1999) From affect programs to higher cognitive emotions: an emotion-based control approach. In: Proceedings of workshop on emotion-based agent architectures, Seattle, USA, pp 10–15

Download references

Acknowledgments

This work has been supported by the European projects: COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication” (cost2102.cs.stir.ac.uk/) and by COST TD0904 “TIMELY: Time in MEntaL activity” (www.timely-cost.eu/). Acknowledgements go to three unknown reviewers, to Isabella Poggi and to Maria Teresa Riviello for their useful comments and suggestions. Miss Tina Marcella Nappi is acknowledged for her editorial help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Esposito.

Additional information

This article is part of the Supplement Issue on ‘Social Signals. From Theory to Applications’, guest-edited by Isabella Poggi, Francesca D’Errico, and Alessandro Vinciarelli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Esposito, A., Esposito, A.M. On the recognition of emotional vocal expressions: motivations for a holistic approach. Cogn Process 13 (Suppl 2), 541–550 (2012). https://doi.org/10.1007/s10339-012-0516-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10339-012-0516-2

Keywords

Navigation