Abstract
Affective dimensions (i.e. valence, arousal, etc.) are continuous, real variables, bounded on [−1,+1]. They give insights on people emotional state. Literature showed that regressing these variables is a complex problem due to their variability. We propose here a two-step process. First, an ensemble of ordinal classifiers predicts the optimal range within [−1, +1] and a discrete estimate of the variable. Then, a regressor is trained locally on this range and its neighbors and provides a finer continuous estimate. Experiments on audio data from AVEC’2014 and AV+EC’2015 challenges show that this cascading process can be compared favorably with state of art and challengers results.
Keywords
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Nowadays, vocal recognition of emotions has multiple applications in domains as diverse as medicine, telecommunications or transport [1]. For example, in telecommunications, it would become possible to priorities the calls from individuals in imminent danger situations over less relevant ones. In general, emotion recognition enables the improvement of human/machine interfaces, which justifies the unexpected increase of research on this field, due to the progresses in artificial learning.
Human interactions rely on multiple sources: body language, facial expressions, etc. A vocal message carries a lot of information that we translate implicitly. This information can be expressed or perceived verbally, but also non-verbally, through the tone, the volume or the speed of the voice. The automatic analysis of such information gives insights on the speaker emotional state.
The conceptualization of emotions is still a hot topic in psychology. Opinions do not converge towards a unique model. In fact, we can mainly differentiate three approaches [9]: (1) the basic emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) described by Ekman [6], (2) the circumplex model of affect and (3) the appraisal theory. In the second model, the affective state is generally described, at least, by two dimensions: the valence which determines the positivity of the emotion and the arousal which determines the activity of the emotion [18, 23]. These two values, bounded on [−1,+1], describe much more precisely the emotional state of an individual than the basic emotions. However, it has been shown that other dimensions were necessary to report more accurately this state during an interaction [8].
The choice of one model or the other restrains the kind of machine learning algorithms used to estimate the emotional state. In case of basic emotions, the variable to be predicted is qualitative and nominal. Classification methods must be used. On the contrary, affective dimensions are quantitative, continuous, and bounded variables. So, regression predictor will be needed. To take advantage of the best of both worlds, we propose in this study a method that combines classification and regression. To predict a continuous and bounded variable, we first quantize the affect variable into bounded ranges. For example, a 5 ranges valence quantization would give the following boundaries {−1, −0.5, −0.2, +0.2, +0.5, +1}. It could be interpreted as “very negative”, “negative”, “neutral”, “positive” and “very positive”. Then, we proceed into 3 steps:
-
Train an ensemble of classifiers to estimate if the affect variable associated to an observation is higher than a given boundary;
-
Combine the ensemble decisions to predict the optimal range;
-
Regress locally the affect variable on this range.
The proposed method is therefore a cascade of ordinal classifiers and local regressors (COCLR). We will see in the following state of the art that similar proposals have been made. But in this paper, we perform a thorough study on the key parameter of this method: the number of ranges to be separated by the ensemble of ordinal classifiers. We show experimentally that:
-
On small and numerous ranges, ordinal classification performs well;
-
On large ranges, the COCLR cascade performs better;
-
On challenging databases (AVEC’2014 [23] and AV+EC’2015 [17], described in Sect. 4), the COCLR cascade can be compared favorably with challengers’ and winner’s proposals with an acceptable development and computational cost.
This paper is organized as follows. Section 2 focuses on the state of the art on affect prediction on audio data. In Sect. 3, we will present the COCLR flowchart. In Sect. 4, we will introduce the datasets used to train and evaluate our system and the different pre-processing realized. Then, in Sect. 5, we will expose and discuss our results. Finally, Sect. 6 offers some conclusions.
2 State of Art
The Audio-Visual Emotion recognition Challenges (AVEC), that takes place every year since 2011, enables to assess the systems proposed on similar datasets. The main objective of these challenges is to ensure a fair comparison between research teams by using the same data. Particularly, the unlabeled test set is released to registered participants some days before the challenge deadline. Moreover, the organizers provide to the competitors a set of audio and video descriptors extracted by approved methods.
The prosodic features such as the height, the intensity, the speech rate, and the quality of the voice, are important to identify the different types of emotions. Low level acoustic descriptors like energy, spectrum, cepstral coefficients, formants, etc. enable an accurate description of the signal [23]. Furthermore, it has recently been demonstrated that features learned by the first layers of deep convolutional networks were quite similar to some acoustic descriptors [22].
2.1 Emotion Classification and Prediction
The classification of emotion is done through classical methods like support vector machines (SVM) [2], Gaussian mixture models (GMM) [20] or random forests (RF) [15]. For regression tasks, numerous models have been proposed: support vector regressors (SVR) [5], deep belief networks (DBN) [13], bidirectional long-short term memory networks (BLSTM) [14], etc. As all these models having their own pros and cons, recent works focus on model combinations to improve overall accuracy. Thus, in [11], authors propose to associate BLSTM and SVR to benefit from the treatment of the past/present context of the BLSTM and the generalization ability of the SVR.
AV+EC’2015 challenge winners proposed in [12] a hierarchy of BLSTM. They deal with 4 information channels: audio, video (described by frame-by-frame geometric features and temporal appearance features), electrocardiogram and electro dermal activity. They combine the predictions of single-modal deep BLSTM with a multimodal deep BLSTM that perform the final affect prediction.
2.2 Ordinal Classification and Hierarchical Prediction
The standard approach to ordinal classification converts the class value into a numeric quantity and applies a regression learner to the transformed data, translating the output back into a discrete class value in a post-processing step [7]. Here, we work directly on numerical values of affect variables but quantify them into several ranges. Recently, a discrete classification of continuous affective variables through generative adversarial networks (GAN) has been proposed [3]. Five ranges are considered.
The idea of a combining regressors and classifiers has already been applied to deal with age estimation from images. In [10], a first “global” regression is done with a SVR on all ages. Then, it is refined by locally adjusting the age regressed value by using an SVM. In [21] authors propose another hierarchy on the same issue. They define 3 age ranges (namely “child”, “teen” an “adult”). An image is classified by combining the results of a pool of classifiers (SVC, FLD, PLS, NN and naïve Bayes) in a majority rule. Then, a second stage uses the appropriate relevant vector machine regression model (trained on one age range) to estimate the age.
The idea of such a hierarchy is not new, but its application to affect data, have not been proposed yet. Moreover, we show in the following experiments that the number of boundaries to be considered impacts the performance of the whole hierarchy.
3 Cascade of Ordinal Classifiers and Local Regressors
The cascade of ordinal classifiers and local regressors proposed here is a hybrid combination of classification and regression systems. Let us note X, the observation (feature vector), y the affective variable to be predicted (valence or arousal) and \( \widehat{y} \), the prediction. The variable y is continuous and defined on the bounded interval [−1, +1]. Therefore, it is possible to segment this interval into a set of n smaller sub-intervals called “ranges” in the following, bounded by the boundaries bi and bi+1 with i ∈ {1, n + 1}. For example, n = 2 define 2 ranges: [−1, 0[ (“negative”) and [0, +1] (“positive”) and 3 boundaries bi ∈ {−1, 0, +1}. Each boundary bi (except −1 and +1) may define a binary classification issue: given the observation X, the prediction \( \widehat{y} \) is lower (resp. higher) than bi. By combining the outputs of the (n − 1) binary classifiers, we get an ordinal classification. Given the observation X, the prediction \( \widehat{y} \) is probably (necessarily in case of perfect classification) located within the range [bi, bi + 1[. Once this range obtained, a local regression is run on it along to its direct neighbors to predict y. Figure 1 illustrates the full cascade. The structure of this system is modular and compatible with any kind of classification and regression algorithms. Moreover, it is generic and may be adapted to other subjects than affective dimension prediction.
3.1 Ordinal Classification
The regression of an affect value y on an observation X can be bounded by the minimum and the maximum this value might take. The interval on which y is defined, I = [min(y), max(y)], can be divided in n ranges.
The first stage of the cascade is an ensemble of (n − 1) binary classifiers. Each classifier decides if, given the observation X, the variable to be predicted is higher than the lower boundary bi of a range or not. Training samples are labeled −1 if their y value is lower than bi and +1 otherwise. Considering the sorted nature of the boundaries bi, we build here an ensemble of ordinal classifiers [7].
We combine the decisions of these classifiers to compute the lower and upper bounds of the optimal range [bi, bi+1[. Consider an observation X with y = 0.15. Suppose the number of ranges n = 6 and linearly distributed boundaries bi. The following ranges are defined: [−1.0, −0.5, −0.25, 0, 0.25, 0.5, 1.0]. In case of perfect classification, the output vector of the ensemble of classifiers would be: {1, 1, 1, 1, −1, −1} where −1 means “y is lower than bi” while +1 means “y is higher than bi”. Obviously, bi is the bound associated to the “latest” classifier with a positive output and bi+1 the first classifier with a negative output. By combining the local decisions of these binary classifiers, we get the (optimal) range [bi, bi+1[. This range Ci will be used in the second stage to locally predict y. In the example, this range is [0, 0.25[. However, indecision between two classifiers can happen [16]. This indecision will be handled by the second stage of the cascade.
The performance measure of the ordinal classifiers, the accuracy, is directly linked to the definition of the ranges. The choice of the number of ranges n is a key parameter of our system and can be seen as a hyper-parameter. The n ranges and their corresponding boundaries bi can be defined in several ways. If they are linearly distributed, they will define a kind of affective scale as in [3]. But the choice of the boundaries bi could also prevent strong imbalances between classes. In case of highly imbalanced classes, the application of a data augmentation method is strongly recommended [4].
From now on, we can evaluate the accuracy (ranges detection rates) of the classifier combination. It can also be used to compute a discrete estimate of y, by using the center of the predicted range as the value ŷ. Finally, we can estimate the correlation of ŷ to the ground truth y.
3.2 Local Regression
The aim of the second stage of the cascade is to compute the continuous value of y. Thus, each range i is associated to a regressor Ri that locally regresses y on [bi,bi+1]. So, each regressor is specialized in the regression on a specific range. However, as explained previously, indecisions between nearby classes throughout the ordinal classification may induce an improper prediction of the range. De facto, the wrong regressor can be activated, causing a drop of the correlation. The analysis of the first stage results, illustrated by the confusion matrix (Fig. 2), indicates that prediction mistakes are close enough or even connected to the optimal range which y belongs to. Thus, we can expand the regression range to [bi−1, bi+2], if they exist.
Widening the local regression ranges helps to solve the indecision issue between the nearby boundaries. Moreover, it frees us from the obligation to strongly optimize the first stage. In fact, the use of a perfect classifier instead of a classifier that reaches an accuracy of 90% on the first stage won’t have a significant impact on the result of the whole cascade.
4 Databases
4.1 AVEC’2014
The AVEC’2014 database is an ensemble of audio and video recordings of human/machine interaction [23]. This base is composed of 150 recordings, each of them containing the reactions of only one person, realized from 84 German subjects. The age of the subjects varies between 18 and 63 years old. In order to create this dataset, a part of the subjects has been recorded many times with a break of two weeks between each recording session. The distribution of the records is arranged as following: 18 subjects have been recorded three times, 31 of them have been recorded twice and the 34 lefts have been recorded only once. In these recordings, the subjects had to realize two tasks:
-
NORTHWIND – The participants read out loud an extract of “Die Sonne und der Wind” (The North Wind and the Sun)
-
FREEFORM – The participants answer numerous questions such as: “What is your favorite meal?”; “Tell us a story about your childhood.”
Then the recordings are split in 3 parts: learning set, validation set, and test set, in which 150 couples of Freeform-Northwind are equally distributed. Low-level descriptors are described in Table 1.
4.2 AV+EC’2015/RECOLA
The second dataset we used to measure the performances of our system is the affect recognition challenge AV+EC’2015 [17]. The AV+EC’2015 relies on the RECOLA base. This one is composed of a set of 9.5 h of audio, video and physiologic recordings (ECG, EDA) from 46 records of French people from different origins (Italian, German, and French) and different genders. The AV+EC’2015 relies on a sub-set of 27 recordings completely labelled. In our case, we only used the audio recordings and only worked on the valence which is known as the most complex to be predicted.
The learning, development and testing partitions contain 9 recordings each. The diversity of origins and genders of the subjects has been preserved in these. The different audio features used are available in the AV+EC’2015 presentation paper [17].
4.3 Data Augmentation
The study of the valence on a bounded interval allows the identification of several intensity thresholds of the felt emotion. Then, we can qualify this as very negative, negative, neutral, positive, and very positive, depending on this value. However, for the AVEC’2014 and the AV+EC’2015/RECOLA bases, these intensity thresholds are no equally represented. Figure 3 shows a clear unbalance of classes, favoring the representation of observations corresponding to a neutral valence (between −0.1 and 0.1). Considering the fact that some systems poorly support strong unbalances of classes [19], we increased the volume of data using the Synthetic Minority Over-sampling Technique [4].
5 Experimental Results
5.1 Performance Metrics
The cascade performances are directly linked to those of both stages. Thus, the performances of the ensemble of ordinal classifiers are measured by the accuracy. It measures the ratio of examples for which the interval has been correctly predicted. We use the confusion matrix in order to analyze the behavior of this system in a more precise way.
The performances of the ensemble of local regressors are measured using Pearson’s correlation (PC), gold standard metric of the challenge AVEC’2014 [23] on which we base our study. However, as these data are not normally distributed, we decided to measure the performances of our system with Spearman’s correlation (SC) and the concordance correlation coefficient (CCC) as well.
The experimental results presented in the following are computed on the development/validation set of the different databases. Due to the temporal nature of audio data, we have also decided to analyze the outputs of both stages on complete sequences and applied temporal smoothing to refine results.
5.2 Preliminary Results
As previously stated, our architecture is modular and adapted to any kind of classification or regression method. Throughout our experiments, we tried to use support vector machines (C-SVM with RBF kernels) and random forests (RF with 300 decision treesFootnote 1, attribute bagging on \( \sqrt {nfeatures} \)) as classifiers. The Table 2 presents the ordinal classification rate obtained by these two systems on the development sets of AVEC’2014 and AV+EC’2015, for the prediction of valence. We choose this affect variable because it is known to be particularly hard to predict (see baseline results in row 1). By taking the center of the predicted intervals as values of ŷ, we have been able to compute the correlations of these two systems. These correlations enable to compare the performances of our classifier ensemble to those of a unique “global” random forest regressor dealing with the whole interval [−1, +1].
The results obtained on both databases encouraged us to continue with random forests rather than support vector machines. Indeed, the results returned by these are significantly sharper than the SVM ones, independently of the choice of the sub-intervals. For the same reasons, we have decided to use random forests to perform local regression.
5.3 Results on AVEC’2014
The Table 3 compares the performances of the different systems presented on the development base of the AVEC’2014, while using several number of ranges n.
First, the interval I has been split here in 10 ranges: [−1.0, −0.4, −0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0]. The most performant system in term of correlation is here, without a doubt, the ordinal classifier ensemble, where the values are the centers of the predicted ranges. It is as well relevant to point out that, despite the very high correlation of the local regressors alone, the COCLR system does not seem efficient.
Then, the interval I has been split into 6 ranges: [−1.0, −0.4, −0.2, 0.0, 0.2, 0.4, 1.0].
We compare the performances of the different systems on the AVEC 2014 development base. The most performant system, as far as the correlation is concerned, is still the ordinal classifier ensemble. However, the performance gap between the COCLR and the ordinal classifier ensemble has tightened. It is also noteworthy that the accuracy of the classification system has risen and the correlation of the local regressors alone, has slightly dropped.
Finally, the interval I here has been split into n = 4 ranges, [−1.0, 0.3, 0.0, 0.3, 1.0]. Previous conclusions on ordinal classifiers and local regressors remain the same. But this time, the COCLR cascade turned out to be significantly the most efficient one. The correlation related to this system is the highest obtained for every choice of intervals of any sort. These different results highlighted the importance of the choice of the number of ranges on which the COCLR system stands. It seems, as well, that the correlation of the local regressors alone decreases when we increase the size of the ranges, contrary to the accuracy of the classification system.
5.4 Results on AV+EC’2015/ RECOLA
As we did previously, we measured the performances of our system according to the different sub-intervals. Affect value varies within [−0.3, 1.0] so we discard classifier and regressors trained on ]−1.0,−0.3[. Throughout our tests, we used 3 groups of different sub-intervals. The biggest, composed of 8 ranges, is: [−0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0]. The second one, composed of 5 ranges, is: [−0.3, −0.1, 0.0, 0.1, 0.3, 1.0]. Finally, the last one, composed of 3 ranges, is: [−0.3, 0.0, 0.3, 1.0]. The Table 4 presents a summary of these results.
We can observe that the results derived from the RECOLA database are similar to the ones the AVEC’2014. In fact, the most performant system remains the COCLR, when we chose a small number of ranges. The correlation obtained by the cascade of ordinal classifiers and local regressors for the valence on the development base is worth 0.67. As previously, we have observed a decline of the correlation of the local regressors and a rise of the accuracy of the first stage of the cascade when the size of the sub-intervals increased. Comparisons with challenge winner’s results [12] are encouraging. Though our cascade get lower results (0.675) than their multimodal system (0.725), it gets better result than those obtained on the audio channel only (0.529). These latter are similar to those of the first stage ordinal classifier (0.521).
Last but not least, our proposal is fast to train (<10 mn for 3 ranges) and evaluate (<0.1 ms) on an Intelcore I7-8 cores-3.4 GHz and doesn’t require a great amount of memory space (<1Go for 3 ranges).
5.5 Temporal Smoothing
As previously stated, the AVEC’2014 and AV+EC’2015 are based on audio recordings. As a result, the observations provided are temporally linked. Because the system we trained do not consider this characteristic, we have analyzed our results and the ground truth in a temporal way. The Fig. 4 presents the ground trough valences and the ones predicted by the ordinal classification system, according to the time on a sequence of the development base. It is outstanding that the system seems to only miss punctually. Indeed, it is exceptional that the system is majorly failing on a w time window wide enough. By rendering a temporal smoothing operation with a sliding window of size 5, we have been able to increase the performances of our system, as shown on Fig. 5.
6 Conclusions
We propose in this article an original approach for the regression of a continuous, bounded variable, based on a cascade of ordinal classifiers and local regressors. We chose to applicate it to the estimation of affective variables such as the valence. The first stage allows us to predict a trend, depending to the chosen interval. Thus, taking into account, for example, four intervals, the emotional state of a person will be qualified as very negative, negative, positive or very positive. We have been able to observe that this trend is more accurately estimated while the number of interval is increasing. The second stage enable a sharper prediction of the variable by regressing locally, on its interval and its direct neighbors. It seems even more efficient when the number of considered interval is low. Indeed, it allows to reduce the influence of the first stage on the prediction. Finally, we showed that the performances of this cascade can be compared favorably to the ones of the winner of the challenge AV+EC’2015.
Despite these satisfying results, there are still room to improve it (others than applying it to the prediction of the arousal and the – running – assessment of the performances on the challenges test data). The COCLR is a cascade which first stage is an ensemble of classifiers. The decision here is sanctioned by the least performant classifier. A more adapted combination rule would impact advantageously the global performances. The outputs (binary or probabilistic) of the ordinal classifier might also enrich the descriptors used by the local regressors.
To conclude, the research introduces a cascading architecture which obtains promising results on a challenging dataset. Several hypotheses have been issued concerning the impact of the different parameters involved, but none of them has been generalized yet. Testing this architecture on other datasets would help us to validate these hypotheses and justify the general interest of this proposal.
Notes
- 1.
Sensitivity analysis on the number of decision trees is presented in Fig. 5.
References
Basu, S., Chakraborty, J., Bag, A., Aftabuddin, M.: A review on emotion recognition using speech. In: International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 109–114 (2017)
Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7), 613–625 (2010)
Chang, J., Scherer, S.: Learning representations of emotional speech with deep convolutional generative adversarial networks. In: ICASSP, pp. 2746–2750 (2017)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Drucker, H., Burges, C.J., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems, pp. 155–161 (1997)
Ekman, P.: Basic emotions. In: Handbook of Cognition and Emotion, pp. 45–60. Wiley (1999)
Frank, E., Hall, M.: A simple approach to ordinal classification. In: De Raedt, L., Flach, P. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 145–156. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44795-4_13
Fontaine, J.R., Scherer, K.R., Roesch, E.B., Ellsworth, P.C.: The world of emotions is not two-dimensional. Psychol. Sci. 18(12), 1050–1057 (2007)
Grandjean, D., Sander, D., Scherer, K.R.: Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization. Conscious. Cognit. 17(2), 484–495 (2008)
Guo, G., Fu, Y., Huang, T.S., Dyer, C.R.: Locally adjusted robust regression for human age estimation. In: WACV (2008)
Han, J., Zhang, Z., Ringeval, F., Schuller, B.: Prediction-based learning for continuous emotion recognition in speech. In: ICASSP, pp. 5005–5009 (2017)
He, L., Jiang, D., Yang, L., Pei, E., Wu, P., Sahli, H.: Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: AVEC, pp. 73–80 (2015)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Nicolaou, M.A., Gunes, H., Pantic, M.: Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2(2), 92–105 (2011)
Noroozi, F., Sapinski, T., Kaminska, D., Anbarjafari, G.: Vocal-based emotion recognition using random forests and decision tree. Int. J. Speech Technol. 20(2), 239–246 (2017)
Qiao, X.: Noncrossing Ordinal Classification. arXiv:1505.03442 (2015)
Ringeval, F., et al.: AV+EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: AVEC, pp. 3–8 (2015)
Russell, J.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178 (1980)
Saranya, R., Yamini, C.: Survey on ensemble alpha tree for imbalance Classification problem
Sethu, V., Ambikairajah, E., Epps, J.: Empirical mode decomposition based weighted frequency feature for speech-based emotion classification. In: ICASSP, pp. 5017–5020 (2008)
Thukral, P., Mitra, K., Chellappa, R.: A hierarchical approach for human age estimation. In: ICASSP, pp. 1529 –1532 (2012)
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: ICASSP, pp. 5200–5204 (2016)
Valstar, M.F., et al.: Avec 2014: 3D dimensional affect and depression recognition challenge. In: AVEC (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Sazadaly, M., Pinchon, P., Fagot, A., Prevost, L., Maumy-Bertrand, M. (2018). Cascade of Ordinal Classification and Local Regression for Audio-Based Affect Estimation. In: Pancioni, L., Schwenker, F., Trentin, E. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2018. Lecture Notes in Computer Science(), vol 11081. Springer, Cham. https://doi.org/10.1007/978-3-319-99978-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-99978-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99977-7
Online ISBN: 978-3-319-99978-4
eBook Packages: Computer ScienceComputer Science (R0)