Elsevier

Pattern Recognition

Volume 44, Issue 3, March 2011, Pages 572-587
Pattern Recognition

Survey on speech emotion recognition: Features, classification schemes, and databases

https://doi.org/10.1016/j.patcog.2010.09.020Get rights and content

Abstract

Recently, increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. This paper is a survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system. The first one is the choice of suitable features for speech representation. The second issue is the design of an appropriate classification scheme and the third issue is the proper preparation of an emotional speech database for evaluating system performance. Conclusions about the performance and limitations of current speech emotion recognition systems are discussed in the last section of this survey. This section also suggests possible ways of improving speech emotion recognition systems.

Introduction

The speech signal is the fastest and the most natural method of communication between humans. This fact has motivated researchers to think of speech as a fast and efficient method of interaction between human and machine. However, this requires that the machine should have the sufficient intelligence to recognize human voices. Since the late fifties, there has been tremendous research on speech recognition, which refers to the process of converting the human speech into a sequence of words. However, despite the great progress made in speech recognition, we are still far from having a natural interaction between man and machine because the machine does not understand the emotional state of the speaker. This has introduced a relatively recent research field, namely speech emotion recognition, which is defined as extracting the emotional state of a speaker from his or her speech. It is believed that speech emotion recognition can be used to extract useful semantics from speech, and hence, improves the performance of speech recognition systems [93].

Speech emotion recognition is particularly useful for applications which require natural man–machine interaction such as web movies and computer tutorial applications where the response of those systems to the user depends on the detected emotion [116]. It is also useful for in-car board system where information of the mental state of the driver may be provided to the system to initiate his/her safety [116]. It can be also employed as a diagnostic tool for therapists [41]. It may be also useful in automatic translation systems in which the emotional state of the speaker plays an important role in communication between parties. In aircraft cockpits, it has been found that speech recognition systems trained to stressed-speech achieve better performance than those trained by normal speech [49]. Speech emotion recognition has also been used in call center applications and mobile communication [86]. The main objective of employing speech emotion recognition is to adapt the system response upon detecting frustration or annoyance in the speaker's voice.

The task of speech emotion recognition is very challenging for the following reasons. First, it is not clear which speech features are most powerful in distinguishing between emotions. The acoustic variability introduced by the existence of different sentences, speakers, speaking styles, and speaking rates adds another obstacle because these properties directly affect most of the common extracted speech features such as pitch, and energy contours [7]. Moreover, there may be more than one perceived emotion in the same utterance; each emotion corresponds to a different portion of the spoken utterance. In addition, it is very difficult to determine the boundaries between these portions. Another challenging issue is that how a certain emotion is expressed generally depends on the speaker, his or her culture and environment. Most work has focused on monolingual emotion classification, making an assumption there is no cultural difference among speakers. However, the task of multi-lingual classification has been investigated [53]. Another problem is that one may undergo a certain emotional state such as sadness for days, weeks, or even months. In such a case, other emotions will be transient and will not last for more than a few minutes. As a consequence, it is not clear which emotion the automatic emotion recognizer will detect: the long-term emotion or the transient one. Emotion does not have a commonly agreed theoretical definition [62]. However, people know emotions when they feel them. For this reason, researchers were able to study and define different aspects of emotions. It is widely thought that emotion can be characterized in two dimensions: activation and valence [40]. Activation refers to the amount of energy required to express a certain emotion. According to some physiological studies made by Williams and Stevens [136] of the emotion production mechanism, it has been found that the sympathetic nervous system is aroused with the emotions of Joy, Anger, and Fear. This induces an increased heart rate, higher blood pressure, changes in depth of respiratory movements, greater sub-glottal pressure, dryness of the mouth, and occasional muscle tremor. The resulting speech is correspondingly loud, fast and enunciated with strong high-frequency energy, a higher average pitch, and wider pitch range. On the other hand, with the arousal of the parasympathetic nervous system, as with sadness, heart rate and blood pressure decrease and salivation increases, producing speech that is slow, low-pitched, and with little high-frequency energy. Thus, acoustic features such as the pitch, timing, voice quality, and articulation of the speech signal highly correlate with the underlying emotion [20]. However, emotions cannot be distinguished using only activation. For example, both the anger and the happiness emotions correspond to high activation but they convey different affect. This difference is characterized by the valence dimension. Unfortunately, there is no agreement within researchers on how, or even if, acoustic features correlate with this dimension [79]. Therefore, while classification between high-activation (also called high-arousal) emotions and low-activation emotions can be achieved at high accuracies, classification between different emotions is still challenging.

An important issue in speech emotion recognition is the need to determine a set of the important emotions to be classified by an automatic emotion recognizer. Linguists have defined inventories of the emotional states, most encountered in our lives. A typical set is given by Schubiger [111] and O’Connor and Arnold [95], which contains 300 emotional states. However, classifying such a large number of emotions is very difficult. Many researchers agree with the ‘palette theory’, which states that any emotion can be decomposed into primary emotions similar to the way that any color is a combination of some basic colors. Primary emotions are Anger, Disgust, Fear, Joy, Sadness, and Surprise [29]. These emotions are the most obvious and distinct emotions in our life. They are called the archetypal emotions [29].

In this paper, we present a comprehensive review of speech emotion recognition systems targeting pattern recognition researchers who do not necessarily have a deep background in speech analysis. We survey three important aspects in speech emotion recognition: (1) important design criteria of emotional speech corpora, (2) the impact of speech features on the classification performance of speech emotion recognition, and (3) classification systems employed in speech emotion recognition. Though there are many reviews on speech emotion recognition such as [129], [5], [12], our survey is more comprehensive in surveying the speech features and the classification techniques used in speech emotion recognition. We surveyed different types of features and considered the benefits of combining the available acoustic information with other sources of information such as linguistic, discourse, and video information. We theoretically covered, in some detail different classification techniques commonly used in speech emotion recognition. We also included numerous speech recognition systems implemented in other research papers in order to have an insight on the performance of existing speech emotion recognizers. However, the reader should interpret the recognition rates of those systems carefully since different emotional speech corpora and experimental setups were used with each of them.

The paper is divided into five sections. In Section 2, important issues in the design of an emotional speech database are discussed. Section 3 reviews in detail speech feature extraction methods. Classification techniques applied in speech emotion recognition are addressed in Section 4. Finally, important conclusions are drawn in Section 5.

Section snippets

Emotional speech databases

An important issue to be considered in the evaluation of an emotional speech recognizer is the degree of naturalness of the database used to assess its performance. Incorrect conclusions may be established if a low-quality database is used. Moreover, the design of the database is critically important to the classification task being considered. For example, the emotions being classified may be infant-directed; e.g. soothing and prohibition [15], [120], or adult-directed; e.g. joy and anger [22]

Features for speech emotion recognition

An important issue in the design of a speech emotion recognition system is the extraction of suitable features that efficiently characterize different emotions. Since pattern recognition techniques are rarely independent of the problem domain, it is believed that a proper selection of features significantly affects the classification performance.

Four issues must be considered in feature extraction. The first issue is the region of analysis used for feature extraction. While some researchers

Classification schemes

A speech emotion recognition system consists of two stages: (1) a front-end processing unit that extracts the appropriate features from the available (speech) data, and (2) a classifier that decides the underlying emotion of the speech utterance. In fact, most current research in speech emotion recognition has focused on this step since it represents the interface between the problem domain and the classification techniques. On the other hand, traditional classifiers have been used in almost

Conclusions

In this paper, a survey of current research work in speech emotion recognition system has been given. Three important issues have been studied: the features used to characterize different emotions, the classification techniques used in previous research, and the important design criteria of emotional speech databases. There are several conclusions that can be drawn from this study.

The first one is that while high classification accuracies have been obtained for classification between

Moataz M.H. El Ayadi received his B.Sc. degree (Hons) in Electronics and Communication Engineering, Cairo University, in 2000, M.Sc. degree in Engineering Mathematics and Physics, Cairo University, in 2004, and Ph.D. degree in Electrical and Computer Engineering, University of Waterloo, in 2008.

He worked as a postdoctoral research fellow in the Electrical and Computer Engineering Department, University of Toronto, from January 2009 to March 2010. Since April 2010, has been an assistant

References (146)

  • H. Akaike

    A new look at the statistical model identification

    IEEE Trans. Autom. Control

    (1974)
  • N. Amir, S. Ron, N. Laor, Analysis of an emotional speech corpus in Hebrew based on objective criteria, in:...
  • J. Ang, R. Dhillon, A. Krupski, E. Shriberg, A. Stolcke, Prosody-based automatic detection of annoyance and frustration...
  • B.S. Atal

    Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

    J. Acoust. Soc. Am.

    (1974)
  • M.M.H. El Ayadi, M.S. Kamel, F. Karray, Speech emotion recognition using Gaussian mixture vector autoregressive models,...
  • R. Banse et al.

    Acoustic profiles in vocal emotion expression

    J. Pers. Soc. Psychol.

    (1996)
  • A. Batliner, K. Fischer, R. Huber, J. Spiker, E. Noth, Desperately seeking emotions: actors, wizards and human beings,...
  • S. Beeke et al.

    Prosody as a compensatory strategy in the conversations of people with agrammatism

    Clin. Linguist. Phonetics

    (2009)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • M. Borchert, A. Dusterhoft, Emotions in speech—experiments with prosody and quality features in speech for use in...
  • L. Bosch

    Emotions, speech and the asr framework

    Speech Commun.

    (2003)
  • S. Bou-Ghazale et al.

    A comparative study of traditional and newly proposed features for recognition of speech under stress

    IEEE Trans. Speech Audio Process.

    (2000)
  • C. Breazeal et al.

    Recognition of affective communicative intent in robot-directed speech

    Autonomous Robots

    (2002)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining Knowl. Discovery

    (1998)
  • F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database of German emotional speech, in: Proceedings...
  • C. Busso et al.

    Analysis of emotionally salient aspects of fundamental frequency for emotion detection

    IEEE Trans. Audio Speech Language Process.

    (2009)
  • J. Cahn

    The generation of affect in synthesized speech

    J. Am. Voice Input/Output Soc.

    (1990)
  • D. Caims et al.

    Nonlinear analysis and detection of speech under stressed conditions

    J. Acoust. Soc. Am.

    (1994)
  • W. Campbell, Databases of emotional speech, in: Proceedings of the ISCA (International Speech Communication and...
  • C. Chen, M. You, M. Song, J. Bu, J. Liu, An enhanced speech emotion recognition system based on discourse information,...
  • L. Chen, T. Huang, T. Miyasato, R. Nakatsu, Multimodal human emotion/expression recognition, in: Proceedings of the...
  • Z. Chuang, C. Wu, Emotion recognition using acoustic features and textual content, Multimedia and Expo, 2004. IEEE...
  • R. Cohen, A computational theory of the function of clue words in argument understanding, in: ACL-22: Proceedings of...
  • R. Cowie, E. Douglas-Cowie, Automatic statistical analysis of the signal and prosodic signs of emotion in speech, in:...
  • R. Cowie et al.

    Emotion recognition in human–computer interaction

    IEEE Signal Process. Mag.

    (2001)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines

    (2000)
  • J.R. Davitz

    The Communication of Emotional Meaning

    (1964)
  • A. Dempster et al.

    Maximum likelihood from incomplete data via the em algorithm

    J. R. Stat. Soc.

    (1977)
  • L. Devillers, L. Lamel, Emotion detection in task-oriented dialogs, in: Proceedings of the International Conference on...
  • R. Duda et al.

    Pattern Recognition

    (2001)
  • D. Edwards

    Emotion discourse

    Culture Psychol.

    (1999)
  • P. Ekman

    Emotion in the Human Face

    (1982)
  • M. Abu El-Yazeed et al.

    On the determination of optimal model order for gmm-based text-independent speaker identification

    EURASIP J. Appl. Signal Process.

    (2004)
  • I. Engberg, A. Hansen, Documentation of the Danish emotional speech database des...
  • Y. Ephraim et al.

    Hidden Markov processes

    IEEE Trans. Inf. Theory

    (2002)
  • R. Fernandez, A computational model for the automatic recognition of affect in speech, Ph.D. Thesis, Massachusetts...
  • D.J. France et al.

    Acoustical properties of speech as indicators of depression and suicidal risk

    IEEE Trans. Biomedical Eng.

    (2000)
  • L. Fu, X. Mao, L. Chen, Speaker independent emotion recognition based on svm/hmms fusion system, in: International...
  • H. Go, K. Kwak, D. Lee, M. Chun, Emotion recognition from the facial image and speech signal, in: Proceedings of the...
  • Cited by (0)

    Moataz M.H. El Ayadi received his B.Sc. degree (Hons) in Electronics and Communication Engineering, Cairo University, in 2000, M.Sc. degree in Engineering Mathematics and Physics, Cairo University, in 2004, and Ph.D. degree in Electrical and Computer Engineering, University of Waterloo, in 2008.

    He worked as a postdoctoral research fellow in the Electrical and Computer Engineering Department, University of Toronto, from January 2009 to March 2010. Since April 2010, has been an assistant professor in the Engineering Mathematics and Physics Department, Cairo University.

    His research interests include statistical pattern recognition and speech processing. His master work was in enhancing the performance of text independent speaker identification systems that uses Gaussian Mixture Models as the core statistical classifier. The main contribution was in developing a new model order selection technique based on the goodness of fit statistical test. He is expected to follow the same line of research in his Ph.D.

    Mohamed S. Kamel received the B.Sc. (Hons) EE (Alexandria University), M.A.Sc. (McMaster University), Ph.D. (University of Toronto).

    He joined the University of Waterloo, Canada, in 1985 where he is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory at the Department of Electrical and Computer Engineering and holds a University Research Chair. Professor Kamel held Canada Research Chair in Cooperative Intelligent Systems from 2001 to 2008.

    Dr. Kamel's research interests are in Computational Intelligence, Pattern Recognition, Machine Learning and Cooperative Intelligent Systems. He has authored and co-authored over 390 papers in journals and conference proceedings, 11 edited volumes, two patents and numerous technical and industrial project reports. Under his supervision, 81 Ph.D. and M.A.Sc. students have completed their degrees.

    He is the Editor-in-Chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, Pattern Recognition Letters, Cognitive Neurodynamics journal and Pattern Recognition J. He is also member of the editorial advisory board of the International Journal of Image and Graphics and the Intelligent Automation and Soft Computing journal. He also served as Associate Editor of Simulation, the Journal of The Society for Computer Simulation.

    Based on his work at the NCR, he received the NCR Inventor Award. He is also a recipient of the Systems Research Foundation Award for outstanding presentation in 1985 and the ISRAM best paper award in 1992. In 1994 he has been awarded the IEEE Computer Society Press outstanding referee award. He was also a coauthor of the best paper in the 2000 IEEE Canadian Conference on electrical and Computer Engineering. Dr. Kamel is recipient of the University of Waterloo outstanding performance award twice, the faculty of engineering distinguished performance award. Dr. Kamel is member of ACM, PEO, Fellow of IEEE, Fellow of the Engineering Institute of Canada (EIC), Fellow of the Canadian Academy of Engineering (CAE) and selected to be a Fellow of the International Association of Pattern Recognition (IAPR) in 2008. He served as consultant for General Motors, NCR, IBM, Northern Telecom and Spar Aerospace. He is co-founder of Virtek Vision Inc. of Waterloo and chair of its Technology Advisory Group. He served as member of the board from 1992 to 2008 and VP research and development from 1987 to 1992.

    Fakhreddine Karray (S’89,M90,SM’01) received Ing. Dipl. in Electrical Engineering from University of Tunis, Tunisia (84) and Ph.D. degree from the University of Illinois, Urbana-Champaign, USA (89). He is Professor of Electrical and Computer Engineering at the University of Waterloo and the Associate Director of the Pattern Analysis and Machine Intelligence Lab. Dr. Karray's current research interests are in the areas of autonomous systems and intelligent man–machine interfacing design. He has authored more than 200 articles in journals and conference proceedings. He is the co-author of 13 patents and the co-author of a recent textbook on soft computing: Soft Computing and Intelligent Systems Design, Addison Wesley Publishing, 2004. He serves as the associate editor of the IEEE Transactions on Mechatronics, the IEEE Transactions on Systems Man and Cybernetics (B), the International Journal of Robotics and Automation and the Journal of Control and Intelligent Systems. He is the Associate Editor of the IEEE Control Systems Society's Conference Proceedings. He has served as Chair (or) co-Chair of more than eight International conferences. He is the General Co-Chair of the IEEE Conference on Logistics and Automation, China, 2008. Dr. Karray is the KW Chapter Chair of the IEEE Control Systems Society and the IEEE Computational Intelligence Society. He is co-founder of Intelligent Mechatronics Systems Inc. and of Voice Enabling Systems Technology Inc.

    View full text