Shape-based modeling of the fundamental frequency contour for emotion detection in speech
Introduction
Emotional understanding is a crucial skill in human communication. It plays an important role not only in interpersonal interactions, but also in many cognitive activities such as rational decision making, perception and learning (Picard, 1997). For this reason, modeling and recognizing emotions is essential in the design and implementation of human-machine interfaces (HMIs) that are more in tune with the user's needs. Systems that are aware of the user's emotional state will facilitate several new scientific avenues that serve as truly innovative advancements in security and defense (e.g. threat detection), health informatics (e.g., depression, autism), and education (e.g., tutoring system) (Burleson and Picard, 2004, Langenecker et al., 2005). Given the important role of speech in the expression of emotions, an increasing number of publications have reported progress in automatic emotion recognition and detection using acoustic features. Complete reviews are given by Cowie et al., 2001, Zeng et al., 2009, Schuller et al., 2011a, Koolagudi and Rao, 2012, El Ayadi et al., 2011.
The dominant approach in emotion recognition from speech consists in estimating global statistics or functionals at sentence level from low level descriptors such as F0, energy and Mel-frequency cepstral coefficients (MFCCs) (Schuller et al., 2011a). Among prosodic based features, gross pitch statistics such as mean, maximum, minimum and range are considered as the most emotionally prominent parameters (Busso et al., 2009). One limitation of global statistics is the assumption that every frame in the sentence is equally important. Studies have shown that emotional information is not uniformly distributed in time (Lee et al., 2004, Busso and Narayanan, 2007). For example, the intonation in happy speech tends to increase at the end of the sentence (Wang et al., 2005). Since the statistics are computed at the global level, it is not possible to identify local salient segments or focal points within the sentence. Furthermore, features describing global statistics do not capture local variations (e.g., in F0 contours), which in turn could provide useful information for emotion detection. In this context, this paper proposes a novel shape-based approach to detect emotionally salient temporal segments in the speech using functional data analysis (FDA). The detection of localized emotional segments can shift current approaches in affective computing. Instead of recognizing the emotional content on pre-segmented sentences, the problem can be formulated as a detection paradigm, which is appealing from an application perspective (e.g., continuous assessments of unsegmented recordings). The emotion recognition system can be more robust by weighting each frame according to their emotional saliency. From a speech production viewpoint, the approach can shed light into the underlying interplay between lexical and affective human communication across various acoustic features (Busso and Narayanan, 2007).
This study focuses on detecting emotionally salient temporal segments on the fundamental frequency. Patterson and Ladd (1172) argued that the range (i.e., the difference between the maximum and the minimum of F0 contour in a sentence or utterance) does not give information about the distribution of F0 and hence valuable emotional information is neglected. Also, according to Lieberman and Michaels (1962) low variations in F0 can be subjectively relevant in the identification of emotions. In the literature, there are some attempts to model the shape of the F0 contour. Paeschke and Sendlmeier (2000) analyzed the rising and falling movements of F0 within accents in affective speech. The study incorporated metrics related to accent peaks within a sentence. The authors found that those metrics present statistically significant differences between emotional classes. Also, Paeschke (2004) modeled the global trend of F0 in emotional speech as the gradient of linear regression. The author concluded that global trend can be useful to describe emotions such as boredom and sadness. Rotaru and Litman (2005) employed linear and quadratic regression coefficients and regression error as features to represent pitch curves. Yang and Campbell (2001) argued that concavity and convexity of the F0 contour reflect the underlying expressive state. The Tone and Break Indices system (ToBI) is a scheme for labeling prosody that has been widely used for transcribing intonation (Silverman et al., 1992). Liscombe et al. (2003) analyzed affective speech with acoustic features by using ToBI labels to identify the type of nuclear pitch accent, the contour type and the phrase boundaries. Despite the fact that ToBI provides an interesting approach to describe F0 contours, more precise labeling is required to generate prosodic transcripts. Taylor (2000) introduced the Tilt Intonation Model to represent intonation as a linear sequence of events (e.g. pitch accents or boundaries), which in turn are given by a set of parameters. However, an automatic event segmentation algorithm is required to employ this scheme and, hence, it cannot be easily applied to emotion recognition or detection tasks.
Despite current efforts to address the problem of affective speech characterization by means of modeling F0 contour, this is still an open task. The contributions of the paper concern: (a) a novel framework to detect emotional modulation based on reference templates that models F0 contours of neutral speech; (b) an insightful and thorough analysis of neutral references as a method to detect emotion in speech; (c) the generation of reference F0 contour templates with functional data analysis (FDA); and, (d) a study of the shortest segmentation unit that can be used in emotion detection. Extensive experiments are presented to demonstrate the discriminative power of the FDA based approach to detect emotional speech. The results on the SEMAINE database reveal that the approach captures localized emotional information conveyed in short speech segments (0.5 s). These properties of the proposed approach are interesting from the research and application points of view.
Section snippets
Emotional databases
The analysis and results presented in Sections 3 and 4 require recordings with controlled, lexicon-dependent conditions (e.g., recordings of sentences with the same lexical content conveying different emotional states). Therefore, the study considers, for these sections, two emotional databases recorded from actors (Table 1). Even though acted emotions differ from real-life emotional manifestation, they provide a good first approximation, especially when controlled conditions are required, as
Proposed method
Building neutral reference models to contrast emotional speech is an appealing method. The scheme significantly reduces the dependency on emotional (acted or spontaneous) speech databases, which in turn are much more difficult to obtain than ordinary corpora. This section describes the proposed neutral reference models built with FDA.
Discriminant analysis
To assess the discriminative power of the functional PCA projections, this section evaluates the approach using lexicon-dependent models (e.g., one functional PCA model for each utterance – Section 4.1), and a lexicon-independent model (e.g., a single functional PCA model for all sentences – Section 4.2). It also evaluates the performance of the approach at sub-sentence level (Section 4.3). The evaluation considers both the EMA and EMO-DB databases, separately.
The approach is evaluated for
Validation of the approach in a non-acted corpus
The proposed approach is validated with the spontaneous SEMAINE database (McKeown et al., 2010) (see Table 1 and Section 2.1). Instead of assigning an emotional label for each sentence, the subjective evaluations correspond to continuous assessment of the emotional content in real time using Feeltrace (50 values per second). Therefore, this database is ideal to evaluate whether the proposed approach can detect localized emotional information conveyed within the sentence. Previous studies have
Conclusions
This paper proposed a novel method to detect emotional modulation in F0 contours by using neutral reference models with functional PCA basis functions. The projections into this basis define the features that are used to train an emotion detection system. The approach was evaluated under different conditions. First, we built lexicon-dependent conditions (i.e., one basis per sentence), which achieved accuracies as high as 75.8% in binary emotion classification tasks. This performance is 6.2%
Acknowledgment
This work was funded by the Government of Chile under grants Fondecyt 1100195 and Mecesup FSM0601, and US National Science Foundation under grants IIS-1217104 and IIS-1329659.
References (47)
- et al.
Automatic intonation assessment for computer aided language learning
Speech Communication
(2010) - et al.
Survey on speech emotion recognition: Features, classification schemes, and databases
Pattern Recognition
(2011) - et al.
Primitives-based evaluation and estimation of emotions in speech
Speech Communication
(2007) - et al.
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge
Speech Communication
(2011) - et al.
Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach
Advances in Human-Computer Interaction
(2010) - et al.
Praat, a system for doing phonetics by computer. Technical Report 132
(1996) - et al.
A database of German emotional speech
- et al.
Affective agents: sustaining motivation to learn through failure and a state of “stuck”
- et al.
Toward effective automatic recognition systems of emotion in speech
- et al.
Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio
Speech and Language Processing
(2009)
Joint analysis of the emotional fingerprint in the face and speech: a single subject study
Speech emotion recognition system based on L1 regularized linear regression and decision fusion
‘FEELTRACE’: an instrument for recording perceived emotion in real time
Emotion recognition in human-computer interaction
IEEE Signal Processing Magazine
openEAR-introducing the munich open-source emotion and affect recognition toolkit
On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues
Journal on Multimodal User Interfaces
Functional data analysis as a tool for analyzing speech dynamics. a case study on the french word c’était
Sentence level emotion recognition based on decisions from subsentence segments
Investigating the use of formant based features for detection of affective dimensions in speech
Emotion recognition from speech: a review
International Journal of Speech Technology
Use of support vector learning for chunk identification
Face emotion perception and executive functioning deficits in depression
Journal of Clinical and Experimental Neuropsychology
Emotion recognition based on phoneme classes
Cited by (52)
Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition
2023, Expert Systems with ApplicationsEmotional speech Recognition using CNN and Deep learning techniques
2023, Applied AcousticsAn ongoing review of speech emotion recognition
2023, NeurocomputingCyTex: Transforming speech to textured images for speech emotion recognition
2022, Speech CommunicationA survey of speech emotion recognition in natural environment
2021, Digital Signal Processing: A Review JournalCitation Excerpt :This approach is simple and takes less time, but any significant improvement in accuracy is not observed. Arias et al. [206] suggested a new technique for detecting the emotional modulation in the F0 contours using neutral reference models with the PCA-basis function. This approach is robust against text-independent emotion recognition and accuracy was improved by 6.2% when evaluated using without PCA-basis function on the same features.
Enhancing Speech Emotions Recognition Using Multivariate Functional Data Analysis
2023, Big Data and Cognitive Computing