Elsevier

Speech Communication

Volume 41, Issue 4, November 2003, Pages 603-623
Speech Communication

Speech emotion recognition using hidden Markov models

https://doi.org/10.1016/S0167-6393(03)00099-2Get rights and content

Abstract

In emotion classification of speech signals, the popular features employed are statistics of fundamental frequency, energy contour, duration of silence and voice quality. However, the performance of systems employing these features degrades substantially when more than two categories of emotion are to be classified. In this paper, a text independent method of emotion classification of speech is proposed. The proposed method makes use of short time log frequency power coefficients (LFPC) to represent the speech signals and a discrete hidden Markov model (HMM) as the classifier. The emotions are classified into six categories. The category labels used are, the archetypal emotions of Anger, Disgust, Fear, Joy, Sadness and Surprise. A database consisting of 60 emotional utterances, each from twelve speakers is constructed and used to train and test the proposed system. Performance of the LFPC feature parameters is compared with that of the linear prediction Cepstral coefficients (LPCC) and mel-frequency Cepstral coefficients (MFCC) feature parameters commonly used in speech recognition systems. Results show that the proposed system yields an average accuracy of 78% and the best accuracy of 96% in the classification of six emotions. This is beyond the 17% chances by a random hit for a sample set of 6 categories. Results also reveal that LFPC is a better choice as feature parameters for emotion classification than the traditional feature parameters.

Introduction

There are many motivations in identifying the emotional state of speakers. In human–machine interaction, the machine can be made to produce more appropriate responses if the state of emotion of the person can be accurately identified. Most state-of-the-art automatic speech recognition systems resort to natural language understanding to improve the accuracy of recognition of the spoken words. Such language understanding can be further improved if an emotional state of the speaker can be extracted, and this in turn will enhance the accuracy of the system. In general, translation is required to carry out communications using different languages. Current automatic translation algorithms focus mainly on the semantic part of the speech. It would provide the communicating parties an additional useful information if an emotional state of the speaker can also be identified and presented, especially in non-face-to-face situations. Other applications of automatic emotion recognition systems include, tutoring, alerting, and entertainment (Cowie et al., 2001).

Before delving into the details of automatic emotion recognition, it is appropriate to have some understanding of psychological, biological, and linguistic aspects of emotion (Cowie et al., 2001; Cornelius, 1996; Oatley and Johnson-Laird, 1995; Plutchik, 1994; Scherer, 1986a; Scherer, 1984; Otaley and Jenkis, 1996; Arnold, 1960; Lazarus, 1991; Fox, 1992; Darwin, 1965; Ekman and Friesen, 1975; Schubiger, 1958; O’Connor and Arnold, 1973; Williams and Stevens, 1981; Cowan, 1936; Fairbanks and Pronovost, 1939; Lynch, 1934; Frick, 1985; Murray and Arnott, 1993; Crystal, 1969; Crystal, 1975; Fonagy, 1978a, Fonagy, 1978b; Fonagy and Magdics, 1963; Davitz, 1964; Williams and Stevens, 1969; Van Bezooijen, 1984; Kotlyar and Mozorov, 1976; Muller, 1960; Oster and Risberg, 1986; McGilloway et al., 1995; Trojan, 1952; Havrdova and Moravek, 1979; Huttar, 1968; Coleman and Williams, 1979; Kaiser, 1962; Scherer, 1986b; Utsuki and Okamura, 1976; Sulc, 1977; Johnson et al., 1986). From the psychological perspective, of particular interest is the cause-and-effect of emotion (Cornelius, 1996; Oatley and Johnson-Laird, 1995; Plutchik, 1994; Scherer, 1986a; Scherer, 1984; Otaley and Jenkis, 1996; Arnold, 1960; Lazarus, 1991). The activation–evaluation space (Cowie et al., 2001) provides a simple approach in understanding and classifying emotions. In a nutshell, it considers the stimulus that excites the emotion, the cognition ability of the agent to appraise the nature of the stimulus and subsequently his/her mental and physical responses to the stimulus. The mental response is in the form of emotional state. The physical response is in the form of fight or flight1, or as described by Fox (1992), approach or withdrawal. From a biological perspective, Darwin (1965) looked at the emotional and physical responses as distinctive action patterns selected by evolution because of their survival value. Thus, emotional arousal will have an effect on, the heart rate, skin resistivity, temperature, pupillary diameter, and muscle activity, as the agent prepares for fight or flight. As a result, the emotional state is also manifested in spoken words and facial expressions (Ekman and Friesen, 1975).

Emotional states have a definite temporal structure (Otaley and Jenkis, 1996): For example, people with emotional disorders such as, manic depression or pathological anxiety may be in those emotional states for months and years, or one may be in a bad ‘mood’ for weeks and months, or emotions such as Anger and Joy may be transient in nature and last no longer than a few minutes. Thus, emotion has a broad sense and a narrow sense effect. The broad sense reflects the underlying long-term emotion and the narrow sense refers to the short-term excitation of the mind that prompts people to action. In automatic recognition of emotion, a machine would not distinguish if the emotional state were due to long-term or short-term effect so long as it is reflected in the speech or facial expression.

The output of an automatic emotion recognizer will naturally consist of labels of emotion. The choice of a suitable set of labels is important. Linguists have a large vocabulary of terms of describing emotional states. Schubiger (1958) and O’Connor and Arnold (1973) used 300 labels between the states in their studies. The ‘palette theory’ (Cowie et al., 2001) suggests that basic categories be identified to serve as primaries and mixing may be done in order to produce other emotions similar to the mixing of primary colors to produce all other colors. The ‘primary’ emotions that are often used include, Joy, Sadness, Fear, Anger, Surprise and Disgust. They are often referred to as archetypal emotions. Although these archetypal emotions cover a rather small part of emotional life, they nevertheless represent the popularly known emotions and are recommended for testing the capabilities of an automatic recognizer. Cognitive theory would argue against equating emotion recognition with assigning category labels. Instead, it would want to recognize the way a person perceives the world or key aspects of it. Perhaps it is true to say that category labels are not a sufficient representation of emotional state, but they are a better way to indicate the output from an automatic emotion recognition system.

It is to be noted that the emotional state of a speaker can be identified from the facial expression (Ekman, 1973; Davis and College, 1975; Scherer and Ekman, 1984), speech (McGilloway et al., 2000; Dellaert et al., 1996; Nicholson et al., 1999), perhaps brainwaves, and other biological features of the speaker. Ultimately, a combination of these features may be the way to achieve high accuracy of recognition. In this paper, the focus is on emotional speech recognition.

The remainder of this paper is structured as follows. In Section 2, a review of the features relevant to emotion of speech is presented and in Section 3, some of the speech emotion recognizers are discussed. This is followed by a description of the corpus of emotional speech and presentation of results of subjective assessment of the emotional content of the speech. Details of the proposed system are presented in Section 5. Experiments to assess the performance of the proposed system are described in Section 6 together with analysis of the results of the experiments. The concluding remarks are presented in Section 7.

Section snippets

Characteristics of emotional speech

There are two broad types of information in speech. The semantic part of the speech carries linguistic information insofar that the utterances are made according to the rules of pronunciation of the language. Paralinguistic information, on the other hand, refers to the implicit messages such as the emotional state of the speaker. For speech emotion recognition, the identification of the paralinguistic features that represent the emotional state of the speaker is an important first step.

From the

Review of emotional speech classifiers

Although there are a number of systems proposed for emotion recognition based on facial expressions, only a few systems based on speech input are reported in the literature.

ASSESS (McGilloway et al., 2000) is a system that makes use of a few landmarks––peaks and troughs in the profiles of fundamental frequency, intensity and boundaries of pauses and fricative bursts in identifying four archetypal emotions, viz. Fear, Anger, Sadness and Joy. Using discriminant analysis to separate samples that

Emotion corpus

An emotion database is specifically designed and set up for text-independent emotion classification studies. The database includes short utterances covering the six archetypal emotions, namely Anger, Disgust, Fear, Joy, Sadness and Surprise. Non-professional speakers are selected to avoid exaggerated expression. A total of six native Burmese language speakers (three males and three females), six native Mandarin language speakers (three males and three females) are employed to generate 720

Overview of the system

As mentioned in 2 Characteristics of emotional speech, 3 Review of emotional speech classifiers, the classification based on emotion on certain specific features is not clearly defined. Thus, instead of resorting to measurement of specific features of the speech signals such as fundamental frequency contour to identify the type of emotion, the novel acoustic feature that can distinguish several emotion categories is proposed in this paper.

The block diagram of the proposed system is shown in

Experiments and analysis of results

Experiments were conducted to evaluate the performance of the proposed system. The variation of LFPC with time for utterances associated with different emotions are presented in Appendix A. In addition to the proposed system, experiments using LPCC, MFCC as feature vectors were also conducted for the purpose of comparison.

First, the performance of the system in classifying all the six basic emotions individually was assessed. The recognition rates of classification using utterances reserved for

Conclusions

In this paper, a system for classification of emotional state of utterances is proposed. The system makes use of short time LFPC for feature representation and a 4-state ergodic HMM as the recognizer.

Short time LFPC represents the energy distribution of the signal in different Log frequency bands. Spectral analysis shows that distribution of energy is dependent on emotion type and this serves as a good indication of emotion type. The coefficients also provide important information on the

References (62)

  • Arnold, M.B., 1960. Emotion and Personality. Physiological Aspects, Vol. 2. Columbia University Press, New...
  • B.S. Atal

    Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

    J. Acoust. Soc. Amer.

    (1974)
  • C. Becchetti et al.

    Speech Recognition Theory and C++ Implementation

    (1998)
  • J.E. Cahn

    The generation of affect in synthesized speech

    J. Amer. Voice I/O Soc.

    (1990)
  • D.A. Cairns et al.

    Nonlinear analysis and classification of speech under stressed conditions

    J. Acoust. Soc. Amer.

    (1994)
  • Coleman, R., Williams, R., 1979. Identification of emotional states using perceptual and acoustic analyses. In:...
  • R. Cornelius

    The Science of Emotion

    (1996)
  • Cowan, M., 1936. Pitch and Intensity Characteristics of Stage of Speech. Arch. Speech, suppl. to Dec....
  • R. Cowie et al.

    Emotion recognition in human–computer interaction

    IEEE Sig. Proc. Mag.

    (2001)
  • D. Crystal

    Prosodic Systems and Intonation in English

    (1969)
  • D. Crystal

    The English Tone of Voice

    (1975)
  • Darwin, C., 1965. The Expression of Emotions in Man and Animals. John Murray, Ed., 1872. Reprinted by University...
  • M. Davis et al.

    Recognition of Facial Expressions

    (1975)
  • S.B. Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Process.

    (1980)
  • L.C. De Silva et al.

    Use of multimodal information in facial emotion recognition

    IEICE Trans. Inf. Syst.

    (1998)
  • Dellaert, F., Polzin, T., Waibel, A., 1996. Recognizing Emotion in Speech. Fourth International Conference on Spoken...
  • J.R. Deller et al.

    Discrete-Time Processing of Speech Signals

    (1993)
  • P. Ekman

    Darwin and Facial Expressions

    (1973)
  • P. Ekman et al.

    Unmasking the Face

    (1975)
  • Elias, N.J., 1975. New Statistical Methods for Assigning Device Tolerances. Proc. IEEE Int. SYmp. Ccts. Sys., 1975,...
  • W.H. Equitz

    A new vector quantization clustering algorithm

    IEEE Trans. Acoust. Speech Signal Process.

    (1989)
  • G. Fairbanks et al.

    An experimental study of the pitch characteristics of the voice during the expression of emotion

    Speech Monograph

    (1939)
  • I. Fonagy

    A new method of investigating the perception of prosodic features

    Language and Speech

    (1978)
  • Fonagy, I., 1978b. In: Sundberg, J. (Ed.), Emotions, Voice and Music in Language and Speech, Vol. 21, pp....
  • I. Fonagy et al.

    Emotional patterns in intonation and music

    Z. Phonet. Sprachwiss. Kommunikationsforsch

    (1963)
  • N.A. Fox

    If it’s not left it’s right

    Amer. Psychol.

    (1992)
  • R. Frick

    Communicating Emotion: The Role of Prosodic Features

    Psychol. Bull.

    (1985)
  • S. Furui

    Digital Speech Processing, Synthesis and Recognition

    (1989)
  • Z. Havrdova et al.

    Changes of the voice expression during suggestively influenced states of experiencing

    Activitas Nervosa Superior

    (1979)
  • G.L. Huttar

    Relations between prosodic variables and emotions in normal american english utterances

    J. Speech Hearing Res.

    (1968)
  • Cited by (0)

    View full text