Emotion recognition in human–computer interaction

doi:10.1016/j.neunet.2005.03.006

Neural Networks

Volume 18, Issue 4, May 2005, Pages 389-405

https://doi.org/10.1016/j.neunet.2005.03.006 Get rights and content

Abstract

In this paper, we outline the approach we have developed to construct an emotion-recognising system. It is based on guidance from psychological studies of emotion, as well as from the nature of emotion in its interaction with attention. A neural network architecture is constructed to be able to handle the fusion of different modalities (facial features, prosody and lexical content in speech). Results from the network are given and their implications discussed, as are implications for future direction for the research.

Introduction

As computers and computer-based applications become more and more sophisticated and increasingly involved in our everyday life, whether at a professional, a personal or a social level, it becomes ever more important that we are able to interact with them in a natural way, similar to the way we interact with other human agents. The most crucial feature of human interaction that grants naturalism to the process is our ability to infer the emotional states of others based on covert and/or overt signals of those emotional states. This allows us to adjust our responses and behavioural patterns accordingly, thus ensuring convergence and optimisation of the interactive process. This paper is based on the theoretical foundations of, and work carried out within, the collaborative EC project called ERMIS (for emotionally rich man–machine intelligent system), in which we have been involved recently. The aim of ERMIS is the development of a hybrid system capable of recognising people's emotions based on information from their faces and speech, both from the point of view of their prosodic and lexical content. We will develop in particular a neural network architecture and simulation demonstrating its recognition of emotions in speech and face stimuli. It will lead to open questions indicating further lines of enquiry.

The literature on emotions is rich and spans several disciplines, often with no obvious overlap or consolidating outlook. Our view of emotions has thus been shaped by the philosophy of Rene Descartes, the biological concepts of Charles Darwin and the psychological theories of William James, only to mention a few of the gurus of human sciences. Such theoretical concepts should be used as guidelines in putting together an automatic emotion recognition system (such as ERMIS) provided that they are shown to be relevant to more recent knowledge on emotions such as that stemming from the modern neurosciences. Indeed, recent technological advances have allowed us to probe the human brain and particularly the emotional circuitry that is involved in recognising emotions, which is yielding a more detailed understanding of the function and structure of emotion recognition in the brain. At the same time technological advances have significantly improved the signal processing techniques applied to the analysis of the physical correlates of emotions (such as the facial and vocal features) thus allowing efficient multi-modal emotion recognition interfaces to be built.

The possible applications of an interface capable of assessing human emotional states are numerous. One of the uses of such an interface is to enhance human judgement of emotion in situations where objectivity and accuracy are required. Lie detection is an obvious example of such situations, although improving on human performance would require a very effective emotion recognition system. Another example is clinical studies of schizophrenia and particularly the diagnosis of flattened affect that so far relies on the psychiatrists' subjective judgement of subjects' emotionality based on various physiological clues. An automatic emotion-sensitive system could augment these judgements, so minimising the dependence of the diagnostic procedure on individual psychiatrists' perception of emotionality. More generally along those lines, automatic emotion detection and classification can be used in a wide range of psychological and neuro-physiological studies of human emotional expression that so far rely on subjects' self-report of their emotional state, which often proves problematic. In a professional environment, enriching a teleconference session with real-time information on the emotional state of the participants could provide a substitute for the reduced naturalism of the medium, so again assisting humans in their emotional discriminatory capacity.

Another use of an emotion-sensitive system could be to embed it in an automatic tutoring application. An emotion-sensitive automatic tutor can interactively adjust the content of the tutorial and the speed at which it is delivered based on whether the user finds it boring and dreary or exciting and thrilling or even unapproachable and daunting. The system could recommend a brake when signs of weariness are detected. Similarly, emotion-sensitivity can be added to automatic customer services, call centres or personal assistants, for example, to help detect frustration and avoid further irritation, with the options to pass the interaction over to a human, or even terminate it altogether. One could also imagine an emotion-responsive car that can alert the driver when it detects signs of stress or anger that could impair their driving abilities.

The most obvious commercial application of emotion-sensitive systems is the game and entertainment industry with either interactive games that offer the sensation of naturalistic human-like interaction, or pets, dolls and so on that are sensitive to the owner's mood and can respond accordingly. Finally, owing to the shared basis of human emotion recognition and emotional expression, understanding and developing automatic systems for emotion recognition can assist in generating faces and/or voices endowed with convincingly human-like emotional qualities. This can in turn lead to a fully interactive system or agent that can perceive emotion and respond emotionally. This would thereby take human–machine interaction a step closer to human–human interaction.

In the sections that follow we will briefly review some of the prominent theories of emotions and the issues that arise from them. We will then turn to the more modern theoretical advances and experimental evidence and discuss issues that arise separately on the side of the sender and on the side of the receiver. After that we will explore the nature of the emotional features from the various modalities and discuss the available data for training and testing. Finally, we will present an artificial neural network architecture for fusing emotional information from the various modalities under attentional modulation and present the results obtained in the ERMIS framework through this neural network.

Section snippets

The psychological tradition

In our effort to construct an automatic emotion recogniser, it is important to examine the ideas proposed on the nature of emotions insofar as they shape the way emotional states are described. These ideas can guide us in determining what an emotional state is and what the relevant features are which distinguish this state from others. It is also crucial to delineate the nature of the mapping of these relevant features to the state's internal representation so that effective models of this

Training and testing material

An automatic emotion recognition system that employs learning architectures (e.g. neural networks), such as the one developed for ERMIS, requires sufficient training and testing material. This material should contain two streams: an input stream and an output stream. The input stream would comprise the extracted relevant features from the various modalities (prosody, faces, words, etc.) and the output would comprise the emotional class or category or more generally the emotional representation

The architecture

One of the most important effects of emotion is their ability to capture attention whether it is ‘bottom-up’ attention directed to stimuli or events that have been automatically registered as emotional, or it is ‘top-down’ attention re-engaged to a stimulus or event that has been evaluated as important to the current needs and goals after a cognitive appraisal mediated by a complex emotional–cognitive network. This emotion–attention interaction has been extensively discussed in the previous

General features of the analysis

A selection of SALAS sessions was analysed by the respective ERMIS partners who extracted the relevant features from the voice, faces and word stream. These sessions were also evaluated as to their emotional content by four subjects using the FEELTRACE program. The resulting streams of input and output data were in turn analysed by use of ANNA. The results are shown in Table 1, which gives the full set of ASSESS–FAPs–DAL training results, as well as the testing results.

To explain what has been

Conclusions

In this paper we have introduced the framework of the EC project ERMIS. The aim of this project was to build an automatic emotion recognition system able to exploit multimodal emotional markers such as those embedded in the voice, face and words spoken. We discussed the numerous potential applications of such a system for industry as well as in academia. We then turned to the psychological literature to help lay the theoretical foundation of our system and make use of insights from the various

Acknowledgements

We would like to acknowledge help from all of the partners in ERMIS, especially Roddie Cowie and Ellie Douglas-Cowie from QUB, as well as our colleagues from NTUA led by Stefanos Kollias. We would also like to acknowledge financial help from the EC under project ERMIS, under which this work was carried out.

References (33)

R. Adolphs
Neural systems for recognizing emotion
Current Opinion in Neurobiology
(2002)
J.R. Averill
A constructionist view of emotion
C.M. Whissell
The dictionary of affect in language
R. Adolphs et al.
Neural systems for recognition of emotional prosody: A 3-D lesion study
Emotion
(2002)
M.B. Arnold
Emotion and personality
(1960)
R. Banse et al.
Acoustic profiles in vocal emotion expression
Journal of Persolity and Social Psychology
(1996)
C.S. Carver
Affect and the functional bases of behavior: On the dimensional structure of affective experience
Personality and Social Psychology Review
(2001)
R. Cowie et al.
Emotion recognition in human–computer interaction
IEEE Signal Processing Magazine
(2001)
A.R. Damasio
Descartes' error: Emotion, reason, and the human brain
(1994)
P. Ekman
Face muscles talk every language
Psychology Today
(1975)

P. Ekman

Strong evidence for universals in facial expressions: A reply to Russell's mistaken critique

Psychological Bulletin

(1994)

P. Ekman

All emotions are basic

P. Ekman et al.

The facial action coding system

(1978)

P. Ekman et al.

Emotion in the human face: Guidelines for research and an integration of findings

(1972)

P. Ellsworth

Some reasons to expect universal antecedents of emotion

Fragopanagos, N., Kockelkoern, S., & Taylor, J. G. (2005). A neurodynamic model of the attentional blink. Cognitive...

Cited by (351)

Ensemble deep learning in speech signal tasks: A review
2023, Neurocomputing
Machine learning methods are extensively used for processing and analysing speech signals by virtue of their performance gains over multiple domains. Deep learning and ensemble learning are the two most commonly used techniques, which results in benchmark performance across different downstream tasks. Ensemble deep learning is a recent development which combines these two techniques to result in a robust architecture having substantial performance gains, as well as better generalization performance over the individual techniques. In this paper, we extensively review the use of ensemble deep learning methods for different speech signal related tasks, ranging from general objectives such as automatic speech recognition and voice activity detection, to more specific areas such as biomedical applications involving the detection of pathological speech or music genre detection. We provide a discussion on the use of different ensemble strategies such as bagging, boosting and stacking in the context of speech signals, and identify the various salient features and advantages from a broader perspective when coupled with deep learning architectures. The main objective of this study is to comprehensively evaluate existing works in the area of ensemble deep learning, and highlight the future directions that may be explored to further develop it as a tool for several speech related tasks. To the best of our knowledge, this is the first review study which primarily focuses on ensemble deep learning for speech applications. This study aims to serve as a valuable resource for researchers in academia and in industry working with speech signals, supporting advanced novel applications of ensemble deep learning models towards solving challenges in existing speech processing systems.
Affective health and countermeasures in long-duration space exploration
2022, Heliyon
Citation Excerpt :
During the recent COVID-19 pandemic, for instance, the effect of fearful emotions has been explored on personal expectations about future (Ceccato et al., 2021), as well as on buying/accumulation behaviors (Cannito et al., 2021). Finally, research on emotions and stress-related responses is of great interest in the fields of machine learning, robotics, and artificial intelligence (Cowie et al., 2001; Fragopanagos and Taylor, 2005). As we will state in the section about countermeasures, the emotion-technology pair is of great interest for adaptation during LDSE missions.
Space research is shifting attention toward interplanetary expeditions. Therefore, whether long-duration spaceflight may influence affective health is becoming an urgent issue.
To this end, we undertook a literature search and reviewed several behavioral simulation studies on Earth that focused on affective components in space. We concluded with studies showing how spaceflight can impact on affective health of astronauts with a positively laden trajectory.
By analyzing the multifaceted theoretical concept of affective health, we show that there is a variety of affective states (e.g., stress, coping, adaptation, and resilience) that can be differently affected by spaceflight.
Countermeasures geared toward promoting positive emotions could play a key role in positive adaptation to extreme environments and thus during long-duration space missions may benefit. Subjective resilience plays a mediating role in adaptation, but its definition needs to be deepened in order to develop robust countermeasures that may prevent the onset of emotional disorders.
Emotional speaker identification using a novel capsule nets model
2022, Expert Systems with Applications
Speaker recognition systems are widely used in various applications to identify a person by their voice; however, the high degree of variability in speech signals makes this a challenging task. Dealing with emotional variations is very difficult because emotions alter the voice characteristics of a person; thus, the acoustic features differ from those used to train models in a neutral environment. Therefore, speaker recognition models trained on neutral speech fail to correctly identify speakers under emotional stress. Although considerable advancements in speaker identification have been made using convolutional neural networks (CNN), CNNs cannot exploit the spatial association between low-level features. Inspired by the recent introduction of capsule networks (CapsNets), which are based on deep learning to overcome the inadequacy of CNNs in preserving the pose relationship between low-level features with their pooling technique, this study investigates the performance of using CapsNets in identifying speakers from emotional speech recordings. A CapsNet-based speaker identification model is proposed and evaluated using three distinct speech databases, i.e., the Emirati Speech Database, SUSAS Dataset, and RAVDESS (open-access). The proposed model is also compared to baseline systems. Experimental results demonstrate that the novel proposed CapsNet model trains faster and provides better results over current state-of-the-art schemes. The effect of the routing algorithm on speaker identification performance was also studied by varying the number of iterations, both with and without a decoder network.
Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospects
2024, Multimedia Systems
An Innovative Method for Speech Signal Emotion Recognition Based on Spectral Features Using GMM and HMM Techniques
2024, Wireless Personal Communications
Directional Edge Coding for Facial Expression Recognition System
2024, Lecture Notes in Networks and Systems

View all citing articles on Scopus

View full text

2005 Special IssueEmotion recognition in human–computer interaction

Abstract

Introduction

Section snippets

The psychological tradition

Training and testing material

The architecture

General features of the analysis

Conclusions

Acknowledgements

Current Opinion in Neurobiology

Neural systems for recognition of emotional prosody: A 3-D lesion study

Emotion

Emotion and personality

Acoustic profiles in vocal emotion expression

Journal of Persolity and Social Psychology

Affect and the functional bases of behavior: On the dimensional structure of affective experience

Personality and Social Psychology Review