Elsevier

Pattern Recognition Letters

Volume 128, 1 December 2019, Pages 290-297
Pattern Recognition Letters

Estimation of affective dimensions using CNN-based features of audiovisual data

https://doi.org/10.1016/j.patrec.2019.09.015Get rights and content

Highlights

  • A novel two-stream CNN that uses audiovisual data to estimate affective dimensions.

  • Selection of CNN-based features using minimum redundancy and maximum relevancy rule.

  • Regression of selected CNN-based features outperforms the end-to-end CNN mapping.

  • Proposed model shows robust performance of estimation on generalization experiments.

Abstract

Automatic estimation of emotional state has been of great interest as emotion is an important component in user-oriented interactive technologies. This paper investigates the usage of feed-forward convolutional neural network (CNN) and features extracted from such networks for predicting dimensions of continuous-level emotional states. In this context, a two-stream CNN architecture wherein the video and audio data are learned simultaneously, is proposed. End-to-end mapping of audiovisual data to emotional dimensions reveals that the two-stream network performs better than its single-stream counterpart. The representations learned by the CNNs are refined through a minimum redundancy maximum relevance statistical selection method. Then, the support vector regression applied to selected CNN-based features estimates the instantaneous values of emotional dimensions. The proposed method is trained and tested using the audiovisual conversations of well-known RECOLA and SEMAINE databases. Experimentally it is verified that the regression of the CNN-based features outperforms the traditional audiovisual affective features as well as the end-to-end CNN mapping. Through generalization experiments, it is also observed that the learned representations are robust enough to provide an acceptable prediction performance, when the settings of training and testing datasets are widely different.

Introduction

Humans have an innate ability to track instantaneous affective state of another person during human-human interaction [21]. Imitating this task, however, is vastly challenging for a machine. In the recent years, an increasing demand of using the ‘sensitive artificial listener’ has been observed in order to tailor services as per the emotional state of an user. For example, a content provider wants to show the contents according to the emotional state of an user so that coherent and cohesive interactions can be performed with the person using its services. The emotional state of a driver can be tracked to provide safety measures to both the driver and the nearby vehicles and pedestrians. Similar type of technologies have already been shown to put into use for innovative application such as emotion aware advertisement [13]. It is apparent that in the near future, the instantaneous estimation of affective states is going to be an integral part of community driven innovative products [3]. Thus, it is necessary to develop an effective technique for predicting instantaneous affective states of humans in the real life. In this section, first, the background and related works are briefly reviewed. Then the scope in the area and the specific contributions of the work are presented.

The interdisciplinary field of affective content analysis traditionally relies on psychological studies, but mostly on the computational intelligence for automation. In order to quantify the emotional state, psychologists have employed two approaches: categorical and dimensional [28]. In categorical approach, the model of emotion is divided into six categories of expressions, which are happiness, sadness, anger, disgust, fear, and surprise [7]. However, quantification of emotional states into discrete categories may not be sufficient to reflect the complexity of internal feelings where multiple states may occur simultaneously. In such a context, modeling continuous-valued emotional state as a degree of different emotional dimensions is a viable idea for quantifying the subtle and context specific emotions. As a result, affective computation is increasingly moving towards the dimensional approach. This approach quantifies the emotional states in three dimensions - arousal, valence and dominance. Arousal measures the activation level of the emotion, valence represents the degree of pleasure, and dominance shows the controlling nature. In most cases, only the first two of the dimensions suffice for representing emotional state, and the dominance can be ignored as it is only observable during the extremes of the state of valence [28].

In order to estimate an affective state, signals of different modalities are extracted from a subject. The most common signals include the video feed of the subject or the audio recordings of the environment (see for example, [5], [29]). In a laboratory setting, physiological signals such as the electro cardiogram, electro-dermal activity, heart-rate, skin conductance can be considered [27]. Traditionally, the collection of physiological data requires an invasive process, and thus use of such data is limited for a wide range of practical applications. Thus, estimation of emotional states using audiovisual data is more suitable in real life, even without any cooperation from a subject.

Audiovisual information has been shown to have significant success in categorizing discrete level emotional states. For example, Imran et al. [9] have employed differential components of the orthogonal 2D Gaussian-Hermite moments to classify the facial expressions. Noroozi et al. [18] have employed convolutional neural network (CNN)-based confidence values from video data, geometric features estimated from key frames, and acoustic features such as the mel-frequency cepstral coefficients (MFCC) and prosodic features to classify the discrete level emotional states. Meng et al. [16] employed temporally aligned CNN-based visual features for recognizing speech-related facial action units.

In order to estimate continuous level emotional dimensions, features extracted from audiovisual data are usually employed in a regression-type framework to obtain the scores of arousal and valence. For example, the Gabor energy of visual signals have been employed in a multi-class support vector machine classifier for detection of continuous level emotional dimension [5]. Selection of dominant features reduces the length of feature vector and thus improves the computational efficiency of a predictor. For example, Nicolle et al. [17] employed a correlation-based feature selection process to choose a powerful set of audiovisual features. The minimum redundancy and maximum relevance (mRMR) pipeline has been used on the features that are based on the audio intensity, timbre, rhythm, visual motion, and frame statistics for emotional rating of film data [4]. The local binary patterns in three orthogonal planes (LBP-TOP) of video data and low level audio features such as the MFCCs and their derivatives have been used with considerable success for instantaneous estimation of emotional dimension [19].

Instead of handcrafted features, the CNN-based learned features were employed to predict the dimension valence from a video clip [1]. In order to predict emotional dimension from video clips, the learning process of features combines the CNN with the recurrent neural network (RNN) [11]. The frame-frame dynamics of emotional dimensions are characterized by the time delay NN using the facial features [15]. Cascade of two models, viz., the support vector regression (SVR) and bidirectional long short-term memory (LSTM) network, in a hierarchical framework have been applied to estimate emotional dimensions using the audiovisual signals [8].

Deep learning algorithms are a class of machine learning algorithms that learn end-to-end representation of data in different layers to provide expected outputs [12]. These algorithms have shown significant success in the area, and have set state-of-the-art performance in many classification tasks such as the object classification and facial emotion recognition [25]. The class of algorithms have also shown promising performance in regression tasks such as image denoising [10].

In the literature, the use of deep learning for categorical or dimensional affective analysis is relatively few. In a survey paper, Wang and Ji [28] showed an interest on deep learning to automatically characterize the video-based affective content analysis. A number of studies that use deep learning for estimation of continuous emotional state have been published since then. Most of these studies employ CNN along with the RNN or LSTM in the final layers of abstraction to estimate the affective contents [11]. A few number of studies use features extracted from the CNN as a part of the feature-based affective content analysis. In particular, such studies employ CNN-based features along side other features extracted from the audio or video data to estimate the emotional dimensions [27]. In these methods, the effect of statistical selection of affective features learned by the CNN has not been thoroughly investigated. It is worth mentioning that in many CNN models, a special selection of training data may work significantly well as compared to the generalized set of large size training data [24].

In this context, CNN-based features learned from a larger set of audiovisual data can be used in a classical regression technique by carefully selecting the features to build a computationally efficient predictor that can instantaneously estimate the emotional dimensions. In other words, there remains a scope of developing new deep learning-based algorithms using a CNN architecture such that frame-by-frame emotional states can be estimated by using a sufficiently low number of statistically selected affective features. Both the selection process of the features and the selection of training frames for traditional regression methods can be investigated. In this paper, an architecture based on two-stream CNN is explored for end-to-end mapping of audio and video data for instantaneous estimation of the emotional dimensions. The performance of the features extracted from the network is also investigated using a statistical selection pipeline and a classical regression technique.

In this paper, we propose a CNN-based method for instantaneous prediction of continuous-level emotional dimensions from audiovisual data. In particular, our specific contributions are as follows:

  • A small size two-stream CNN architecture is proposed to map the audiovisual data into two types of continuous emotional dimensions, namely, valence and arousal.

  • The performance of estimation of emotional dimensions is evaluated under the statistical selection of CNN-based features applied to the SVR technique.

  • Through generalization experiments, it is shown that the CNN-based features perform better than conventional features for predicting emotional dimensions.

The rest of the paper is organized as follows. The detailed structure of the proposed CNN model and the estimation process of emotional dimensions are presented in Section 2. The experimental setup and the results obtained are given in Section 3. Finally, concluding remarks are provided in Section 4.

Section snippets

Proposed method

The main target of this paper is to present a CNN architecture for prediction of frame-by-frame emotional states from audiovisual data. In other words, an emotional score is obtained from the current frame of a video and the audio samples associated with this frame. Keeping this in mind, a low-complexity two-stream CNN model, one stream for the video and the other for the audio, with suitable arrangements of classical layers is proposed. Then, the learned features are employed in a feature

Experiments and results

In order to evaluate the performance of the proposed method, commonly-referred two databases are used in the experiments. A multimodal corpus of spontaneous interactions in French called the REmote COLlaborative and Affective interactions (RECOLA) that has been introduced by Ringeval et al. [22] is used for the purpose of training and testing of the experiments. The second database called the Sustained Emotionally colored Machine–human Interaction using Nonverbal Expression (SEMAINE) [14] has

Conclusions

In this paper, a two-stream CNN model has been proposed for estimating the instantaneous emotional dimensions, namely, valence and arousal from audiovisual data. First, the proposed model has been trained for end-to-end mapping the emotional scores from the audio and video data. In the second approach, the set of features extracted from the CNN model have been refined using an mRMR-based feature selection process. The selected features are then employed to estimate the continuous-valued

CRediT authorship contribution statement

Ramesh Basnet: Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Mohammad Tariqul Islam: Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Tamanna Howlader: Formal analysis, Writing - original draft, Writing - review & editing. S. M. Mahbubur Rahman: Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Dimitrios Hatzinakos: Formal

Declaration of Competing Interest

The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript

Acknowledgment

We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan XP GP that was used for this research. The authors would like to thank the anonymous reviewers for their valuable comments, which have been useful in improving the quality of the paper.

References (29)

  • H. Drucker et al.

    Support vector regression machines

    Proc. Advances in Neural Information Processing Systems, Denver, CO

    (1997)
  • P. Ekman

    Emotional and conversational nonverbal signals

    Language, Knowledge, and Representation

    (2005)
  • J. Han et al.

    Strength modelling for real-world automatic continuous affect recognition from audiovisual signals

    Image Vis. Comput.

    (2017)
  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • Cited by (0)

    Handled by Associate editor Song Wang.

    View full text