On the recognition of emotional vocal expressions: motivations for a holistic approach

Esposito, Anna; Esposito, Antonietta M.

doi:10.1007/s10339-012-0516-2

On the recognition of emotional vocal expressions: motivations for a holistic approach

Review
Published: 08 August 2012

Volume 13, pages 541–550, (2012)
Cite this article

Cognitive Processing Aims and scope Submit manuscript

Anna Esposito^1,2 &
Antonietta M. Esposito³

441 Accesses
13 Citations
Explore all metrics

Abstract

Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Emotions and Communication Styles from Prosody

Superior Communication of Positive Emotions Through Nonverbal Vocalisations Compared to Speech Prosody

Article Open access 24 July 2021

Speech Emotion Recognition: A Review

Notes

Sony AIBO Europe, Sony entertainment. www.sonydigital-link.com/AIBO/.

References

Apolloni B, Aversano G, Esposito A (2000) Preprocessing and classification of emotional features in speech sentences. In: Kosarev Y (ed) Proceedings of international workshop on speech and computer. SPIIRAS, pp 49–52
Apolloni B, Esposito A, Malchiodi D, Orovas C, Palmas G, Taylor JG (2004) A general framework for learning rules from data. IEEE Trans Neural Networks 15(6):1333–1350
Article Google Scholar
Atassi H, Esposito A (2008) Speaker independent approach to the classification of emotional vocal expressions. In: Proceedings of IEEE conference on tools with artificial intelligence (ICTAI), vol 1. Dayton, OH, USA, pp 487–494
Atassi H, Riviello MT, Smékal Z, Hussain A, Esposito A (2010) Emotional vocal expressions recognition using the COST 2102 Italian database of emotional speech. In: Esposito A et al (eds) LNCS, vol 5967. Springer, Berlin, pp 406–422
Aversano G, Esposito A, Esposito AM, Marinaro M (2001) A new text-independent method for phoneme segmentation. In: Ewing RL et al (eds) Proceedings of the IEEE international workshop on circuits and systems, vol 2, pp 516–519
Bachorowski JA (1999) Vocal expression and perception of emotion. Curr Dir Psychol Sci 8:53–57
Article Google Scholar
Banse R, Scherer K (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636
Article PubMed CAS Google Scholar
Bargh JA, Chen M, Burrows L (1996) Automaticity of social behavior: direct effects of trait construct and stereotype activation on action. J Pers Soc Psychol 71:230–244
Article PubMed CAS Google Scholar
Barsalou LW, Niedenthal PM, Barbey AK, Ruppert JA (2003) Social embodiment. In: Ross BH (ed) The psychology of learning and motivation, vol 43. Academic Press, San Diego, pp 43–92
Benoit C, Mohamadi T, Kandel S (1994) Effects of phonetic context on audio-visual intelligibility of French. J Speech Hear Res 37:1195–1203
PubMed CAS Google Scholar
Block N (1995) The mind as the software of the brain. In: Smith EE, Osherson DN (eds) Thinking. MIT Press, Cambridge, pp 377–425
Google Scholar
Blumberg BM, Todd PM, Maes P (1996) No bad dogs: ethological lessons for learning in Hamsterdam. In: Proceedings of the 4th international conference on simulation of adaptive behaviour, MIT Press/Bradford Books, Cambridge, pp 295–304
Breazeal C, Aryananda L (2002) Recognition of affective communicative intent in robot-directed speech. Auton Robots 12:83–104
Article Google Scholar
Breitenstein C, Van Lancker D, Daum I (2001) The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn Emot 15(1):57–79
Google Scholar
Bryant GA, Barrett HC (2007) Recognizing intentions in infant-directed speech. Psychol Sci 18(8):746–751
Article PubMed Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech, pp 1517–1520
Busso C, Lee S, Narayanan SS (2007) Using neutral speech models for emotional speech analysis. In: Proceedings of Interspeech, Antwerp, Belgium, pp 2225–2228
Butterworth BL, Beattie GW (1978) Gestures and silence as indicator of planning in speech. In: Campbell RN, Smith PT (eds) Recent advances in the psychology of language. Olenum Press, New York, pp 347–360
Callan DE, Jones JA, Munhall K, Callan AM, Kroos C, Vatikiotis-Bateson E (2003) Neural processes underlying perceptual enhancement by visual speech gestures. NeuroReport 14:2213–2218
Article PubMed Google Scholar
Chafe WL (1987) Cognitive constraint on information flow. In: Tomlin R (ed) Coherence and grounding in discourse. John Benjamins, Amsterdam, pp 20–51
Google Scholar
de Byl PB, Toleman MA (2005) Engineering emotionally intelligent agents. Encycl Inf Sci Technol II:1052–1056
Article Google Scholar
Dennett DC (1969) Content and consciousness. Humanities Press, Oxford
Google Scholar
Douglas-Cowie E, Cowie R, Schroder M (2000) A new emotion database: considerations, source and scope. In: Proceedings of ISCA workshop on speech and emotion. Belfast, Northern Ireland
Duda R, Hart P, Stork D (2003) Pattern classification, 2nd edn. Wiley, New York
Google Scholar
Ekman P (1992) An argument for basic emotions. Cogn Emot 6:169–200
Article Google Scholar
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587
Article Google Scholar
Elman JL (1991) Distributed representation, simple recurrent neural networks, and grammatical structure. Mach Learn 7:195–225
Google Scholar
El-Nasr MS (1998) Modeling emotion dynamics in intelligent agents. M.Sc. dissertation, American University in Cairo
Esposito A (2000) Approaching speech signal problems: an unifying viewpoint for the speech recognition process. In: Memoria of Taller Internacional de Tratamiento del Habla, Procesamiento de Vos y el Language, Suarez Garcia S, Baron Fernandez R (Eds), CIC-IPN Obra Compleata, Memoria. ISBN: 970-18-4936-1
Esposito A (2002) The importance of data for training intelligent devices. In: Apolloni B, Kurfess C (eds) From synapses to rules: discovering symbolic knowledge from neural processed data. Kluwer, Dordrecht, pp 229–250
Google Scholar
Esposito A (2007) The amount of information on emotional states conveyed by the verbal and nonverbal channels: some perceptual data. In: Stilianou Y et al (eds) Progress in nonlinear speech processing. LNCS, vol 4391. Springer, Berlin, pp 245–268
Google Scholar
Esposito A (2008) Affect in multimodal information. In: Tao J, Tan T (eds) Affective information processing, Springer, Heidelberg, pp 211–234
Esposito A (2009) The perceptual and cognitive role of visual and auditory channels in conveying emotional information. Cogn Comput J 2:268–278
Article Google Scholar
Esposito A, Aversano G (2005) Text independent methods for speech segmentation. In: Chollet G et al (eds) Nonlinear speech modeling and applications, LNCS, vol 3445, pp 261–290
Esposito A, Marinaro M (2007) What pauses can tell us about speech and gesture partnership. In: Esposito A et al (eds) Fundamentals of verbal and nonverbal communication and the biometric issue, vol 18. IOS press, Amsterdam, pp 45–57
Google Scholar
Esposito A, Riviello MT (2010) The new Italian audio and video emotional database. In: Esposito A et al (eds) LNCS, vol 5967. Springer, Berlin, pp 406–422
Google Scholar
Esposito A, Riviello MT (2011) The cross-modal and cross-cultural processing of affective information. In: Apolloni B et al (eds) Frontiers in artificial intelligence and applications. IOS press, Amsterdam, pp 301–310
Google Scholar
Esposito A, Riviello MT, Di Maio G (2009a) The COST 2102 Italian audio and video emotional database. In: Apolloni B et al (eds) WIRN09, vol 204. IOS press, Amsterdam, pp 51–61
Google Scholar
Esposito A, Riviello MT, Bourbakis N (2009b) Cultural specific effects on the recognition of basic emotions: a study on Italian subjects. In: Holzinger A, Miesenberger K (eds) USAB 2009, LNCS, vol 5889. Springer, Berlin, pp 135–148
Fodor JA (1983) The modularity of mind. MIT Press, Cambridge
Google Scholar
Fragopanagos N, Taylor JG (2005) Emotion recognition in human–computer interaction. Neural Netw 18:389–405
Article PubMed CAS Google Scholar
Frens MA, Van Opstal AJ, Van der Willigen RF (1995) Spatial and temporal factors determine auditory-visual interactions in human saccadic eye movements. Percept Psychophys 57:802–816
Article PubMed CAS Google Scholar
Friend M (2000) Developmental changes in sensitivity to vocal paralanguage. Dev Sci 3:148–162
Article Google Scholar
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. JASA 87(4):1738–1752
CAS Google Scholar
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Article Google Scholar
Hozjan V, Kacic Z (2003) Context-independent multilingual emotion recognition from speech signals. Int J Speech Technol 6:311–320
Article Google Scholar
Hozjan V, Kacic Z (2006) A rule-based emotion-dependent feature extraction method for emotion analysis from speech. JASA 119(5):3109–3120
Google Scholar
Hu H, Xu M, Wu W (2007) GMM supervector based SVM with spectral features for speech emotion recognition. In: Proceedings of ICASSP, vol 4, pp IV 413–IV 416
Hughes HC, Reuter-Lorenz PA, Nozawa G, Fendrich R (1994) Visual auditory interactions in sensorimotor processing: saccades versus manual responses. J Exp Psychol Hum Percept Perform 20:131–153
Article PubMed CAS Google Scholar
Izard CE (1992) Basic emotions, relations among emotions, and emotion–cognition relations. Psychol Rev 99:561–565
Article PubMed CAS Google Scholar
Jones C, Deeming A (2008) Affective human-robotic interaction. In: Peter C, Beale R (eds) Affect and emotion in HCI, LNCS, vol 4868. Springer, pp 175–185
Kaehms B (1999) Putting a (sometimes) pretty face on the web. WebTechniques, CMP Media. www.newarchitectmag.com/archives/1999/09/newsnotes/
Klasmeyer G, Sendlmeier WF (1995) Objective voice parameters to characterize the emotional content in speech. In: Elenius K, Branderudf P (Eds) Proceedings of ICPhS, Arne Strömbergs Grafiska, vol 1, pp 182–185
Lindblom B (1990) Explaining phonetic variation: a sketch of the H&H theory. In: Hardcastle J, Marchal A (eds) Speech production and speech modeling. Kluwer, Dordrecht, pp 403–439
Chapter Google Scholar
Lugger M, Yang B (2007) The relevance of voice quality features in speaker independent emotion recognition. In: Proceedings of ICASSP, vol 4, pp 17–20
Macaluso E, George N, Dolan R, Spence C, Driver J (2004) Spatial and temporal factors during processing of audiovisual speech: a PET study. NeuroImage 21:725–732
Article PubMed CAS Google Scholar
Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63(4):561–580
Article Google Scholar
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article PubMed CAS Google Scholar
Navas E, Hernáez I, Luengo I (2006) An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS. IEEE Trans Audio Speech Lang Process 14(4):1117–1127
Article Google Scholar
Newell A, Simon HA (1972) Human problem solving. Prentice Hall, Oxford
Google Scholar
Nushikyan EA (1995) Intonational universals in texual context. In: Elenius K, Branderudf P (eds) Proceedings of ICPhS 1995, Arne Strömbergs Grafiska, vol 1, pp 258–261
Nwe T, Foo S, De Silva L (2003) Speech emotion recognition using Hidden Markov models. Speech Commun 41:603–623
Article Google Scholar
Oatley K, Jenkins JM (2006) Understanding emotions, 2nd edn. Blackwell, Oxford
Google Scholar
Penrose R (1989) The emperor’s new mind. Oxford University Press, New York
Google Scholar
Perrott DR, Sadralodabai T, Saberi K, Strybel TZ (1991) Aurally aided visual search in the central visual field: effects of visual load and visual enhancement of the target. Hum Factors 33:389–400
PubMed CAS Google Scholar
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proceedings of the conference on artificial neural networks in engineering, pp 7–10
Picard R (2000) Toward computers that recognize and respond to user emotion. IBM Syst J 39(3–4):705–719
Article Google Scholar
Pierre-Yves O (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59:157–183
Article Google Scholar
Plutchik R (1993) Emotion and their vicissitudes: emotions and psychopatology. In: Lewis JM, Haviland-Jones M (eds) Handbook of emotion. Guilford Press, New York, pp 53–66
Google Scholar
Pudil P, Ferri F, Novovicova J, Kittler J (1994) Floating search method for feature selection with non monotonic criterion functions. Pattern Recogn 2:279–283
Google Scholar
Pylyshyn ZW (1984) Computation and cognition: toward a foundation for cognitive science. MIT Press, Cambridge
Google Scholar
Razak A, Komiya R, Abidin M (2005) Comparison between fuzzy and nn method for speech emotion recognition. In: Proceedings of 3rd international conference on information technology and applications ICITA, vol 1, pp 297–302
Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39:1161–1178
Article Google Scholar
Scherer KR (1989) Vocal correlates of emotional arousal and affective disturbance. In: Wagner H, Mner H, Manstead A (eds) Handbook of social psychophysiology. Wiley, New York, pp 165–197
Google Scholar
Scherer K (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun 40:227–256
Article Google Scholar
Scherer KR, Banse R, Wallbott HG (2001) Emotion inferences from vocal expression correlate across languages and cultures. J Cross Cult Psychol 32:76–92
Article Google Scholar
Schubert TW (2004) The power in your hand: gender differences in bodily feedback from making a fist. Pers Soc Psychol Bull 30:757–769
Article PubMed Google Scholar
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of the ICASSP, vol 1, pp 577–580
Schulz M, Ross B, Pantev C (2003) Evidence for training-induced cross modal reorganization of cortical functions in trumpet players. NeuroReport 14:157–161
Article PubMed Google Scholar
Slaney M, McRoberts G (2003) Baby ears: a recognition system for affective vocalizations. Speech Commun 39:367–384
Article Google Scholar
Sloman A (2001) Beyond shallow models of emotion. Cogn Process 2(1):177–198
Google Scholar
Smit ER, Semin GR (2004) Socially situated cognition: cognition in its social context. Adv Exp Soc Psychol 36:53–117
Article Google Scholar
Stein BE, Jiang W, Wallace MT, Stanford TR (2001) Nonvisual influences on visual-information processing in the superior colliculus. Prog Brain Res 134:143–156
Article PubMed CAS Google Scholar
Stepper S, Strack F (1993) Proprioceptive determinants of emotional and non-emotional feelings. J Pers Soc Psychol 64:211–220
Article Google Scholar
Ström N (1997) Sparse connection and pruning in large dynamic artificial neural networks. In: Proceedings of Eurospeech, vol 5, pp 2807–2810
Velasquez JD (1999) From affect programs to higher cognitive emotions: an emotion-based control approach. In: Proceedings of workshop on emotion-based agent architectures, Seattle, USA, pp 10–15

Download references

Acknowledgments

This work has been supported by the European projects: COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication” (cost2102.cs.stir.ac.uk/) and by COST TD0904 “TIMELY: Time in MEntaL activity” (www.timely-cost.eu/). Acknowledgements go to three unknown reviewers, to Isabella Poggi and to Maria Teresa Riviello for their useful comments and suggestions. Miss Tina Marcella Nappi is acknowledged for her editorial help.

Author information

Authors and Affiliations

Department of Psychology, Second University of Naples, Caserta, Italy
Anna Esposito
International Institute for Advanced Scientific Studies (IIASS), Vietri sul Mare, Italy
Anna Esposito
Sezione di Napoli Osservatorio Vesuviano, Istituto Nazionale di Geofisica e Vulcanologia, Naples, Italy
Antonietta M. Esposito

Authors

Anna Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Antonietta M. Esposito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Esposito.

Additional information

This article is part of the Supplement Issue on ‘Social Signals. From Theory to Applications’, guest-edited by Isabella Poggi, Francesca D’Errico, and Alessandro Vinciarelli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Esposito, A., Esposito, A.M. On the recognition of emotional vocal expressions: motivations for a holistic approach. Cogn Process 13 (Suppl 2), 541–550 (2012). https://doi.org/10.1007/s10339-012-0516-2

Download citation

Received: 07 May 2012
Accepted: 11 July 2012
Published: 08 August 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s10339-012-0516-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the recognition of emotional vocal expressions: motivations for a holistic approach

Abstract

Access this article

Similar content being viewed by others

Extracting Emotions and Communication Styles from Prosody

Superior Communication of Positive Emotions Through Nonverbal Vocalisations Compared to Speech Prosody

Speech Emotion Recognition: A Review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the recognition of emotional vocal expressions: motivations for a holistic approach

Abstract

Access this article

Similar content being viewed by others

Extracting Emotions and Communication Styles from Prosody

Superior Communication of Positive Emotions Through Nonverbal Vocalisations Compared to Speech Prosody

Speech Emotion Recognition: A Review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation