Estimation of affective dimensions using CNN-based features of audiovisual data
Graphical abstract
Introduction
Humans have an innate ability to track instantaneous affective state of another person during human-human interaction [21]. Imitating this task, however, is vastly challenging for a machine. In the recent years, an increasing demand of using the ‘sensitive artificial listener’ has been observed in order to tailor services as per the emotional state of an user. For example, a content provider wants to show the contents according to the emotional state of an user so that coherent and cohesive interactions can be performed with the person using its services. The emotional state of a driver can be tracked to provide safety measures to both the driver and the nearby vehicles and pedestrians. Similar type of technologies have already been shown to put into use for innovative application such as emotion aware advertisement [13]. It is apparent that in the near future, the instantaneous estimation of affective states is going to be an integral part of community driven innovative products [3]. Thus, it is necessary to develop an effective technique for predicting instantaneous affective states of humans in the real life. In this section, first, the background and related works are briefly reviewed. Then the scope in the area and the specific contributions of the work are presented.
The interdisciplinary field of affective content analysis traditionally relies on psychological studies, but mostly on the computational intelligence for automation. In order to quantify the emotional state, psychologists have employed two approaches: categorical and dimensional [28]. In categorical approach, the model of emotion is divided into six categories of expressions, which are happiness, sadness, anger, disgust, fear, and surprise [7]. However, quantification of emotional states into discrete categories may not be sufficient to reflect the complexity of internal feelings where multiple states may occur simultaneously. In such a context, modeling continuous-valued emotional state as a degree of different emotional dimensions is a viable idea for quantifying the subtle and context specific emotions. As a result, affective computation is increasingly moving towards the dimensional approach. This approach quantifies the emotional states in three dimensions - arousal, valence and dominance. Arousal measures the activation level of the emotion, valence represents the degree of pleasure, and dominance shows the controlling nature. In most cases, only the first two of the dimensions suffice for representing emotional state, and the dominance can be ignored as it is only observable during the extremes of the state of valence [28].
In order to estimate an affective state, signals of different modalities are extracted from a subject. The most common signals include the video feed of the subject or the audio recordings of the environment (see for example, [5], [29]). In a laboratory setting, physiological signals such as the electro cardiogram, electro-dermal activity, heart-rate, skin conductance can be considered [27]. Traditionally, the collection of physiological data requires an invasive process, and thus use of such data is limited for a wide range of practical applications. Thus, estimation of emotional states using audiovisual data is more suitable in real life, even without any cooperation from a subject.
Audiovisual information has been shown to have significant success in categorizing discrete level emotional states. For example, Imran et al. [9] have employed differential components of the orthogonal 2D Gaussian-Hermite moments to classify the facial expressions. Noroozi et al. [18] have employed convolutional neural network (CNN)-based confidence values from video data, geometric features estimated from key frames, and acoustic features such as the mel-frequency cepstral coefficients (MFCC) and prosodic features to classify the discrete level emotional states. Meng et al. [16] employed temporally aligned CNN-based visual features for recognizing speech-related facial action units.
In order to estimate continuous level emotional dimensions, features extracted from audiovisual data are usually employed in a regression-type framework to obtain the scores of arousal and valence. For example, the Gabor energy of visual signals have been employed in a multi-class support vector machine classifier for detection of continuous level emotional dimension [5]. Selection of dominant features reduces the length of feature vector and thus improves the computational efficiency of a predictor. For example, Nicolle et al. [17] employed a correlation-based feature selection process to choose a powerful set of audiovisual features. The minimum redundancy and maximum relevance (mRMR) pipeline has been used on the features that are based on the audio intensity, timbre, rhythm, visual motion, and frame statistics for emotional rating of film data [4]. The local binary patterns in three orthogonal planes (LBP-TOP) of video data and low level audio features such as the MFCCs and their derivatives have been used with considerable success for instantaneous estimation of emotional dimension [19].
Instead of handcrafted features, the CNN-based learned features were employed to predict the dimension valence from a video clip [1]. In order to predict emotional dimension from video clips, the learning process of features combines the CNN with the recurrent neural network (RNN) [11]. The frame-frame dynamics of emotional dimensions are characterized by the time delay NN using the facial features [15]. Cascade of two models, viz., the support vector regression (SVR) and bidirectional long short-term memory (LSTM) network, in a hierarchical framework have been applied to estimate emotional dimensions using the audiovisual signals [8].
Deep learning algorithms are a class of machine learning algorithms that learn end-to-end representation of data in different layers to provide expected outputs [12]. These algorithms have shown significant success in the area, and have set state-of-the-art performance in many classification tasks such as the object classification and facial emotion recognition [25]. The class of algorithms have also shown promising performance in regression tasks such as image denoising [10].
In the literature, the use of deep learning for categorical or dimensional affective analysis is relatively few. In a survey paper, Wang and Ji [28] showed an interest on deep learning to automatically characterize the video-based affective content analysis. A number of studies that use deep learning for estimation of continuous emotional state have been published since then. Most of these studies employ CNN along with the RNN or LSTM in the final layers of abstraction to estimate the affective contents [11]. A few number of studies use features extracted from the CNN as a part of the feature-based affective content analysis. In particular, such studies employ CNN-based features along side other features extracted from the audio or video data to estimate the emotional dimensions [27]. In these methods, the effect of statistical selection of affective features learned by the CNN has not been thoroughly investigated. It is worth mentioning that in many CNN models, a special selection of training data may work significantly well as compared to the generalized set of large size training data [24].
In this context, CNN-based features learned from a larger set of audiovisual data can be used in a classical regression technique by carefully selecting the features to build a computationally efficient predictor that can instantaneously estimate the emotional dimensions. In other words, there remains a scope of developing new deep learning-based algorithms using a CNN architecture such that frame-by-frame emotional states can be estimated by using a sufficiently low number of statistically selected affective features. Both the selection process of the features and the selection of training frames for traditional regression methods can be investigated. In this paper, an architecture based on two-stream CNN is explored for end-to-end mapping of audio and video data for instantaneous estimation of the emotional dimensions. The performance of the features extracted from the network is also investigated using a statistical selection pipeline and a classical regression technique.
In this paper, we propose a CNN-based method for instantaneous prediction of continuous-level emotional dimensions from audiovisual data. In particular, our specific contributions are as follows:
- •
A small size two-stream CNN architecture is proposed to map the audiovisual data into two types of continuous emotional dimensions, namely, valence and arousal.
- •
The performance of estimation of emotional dimensions is evaluated under the statistical selection of CNN-based features applied to the SVR technique.
- •
Through generalization experiments, it is shown that the CNN-based features perform better than conventional features for predicting emotional dimensions.
The rest of the paper is organized as follows. The detailed structure of the proposed CNN model and the estimation process of emotional dimensions are presented in Section 2. The experimental setup and the results obtained are given in Section 3. Finally, concluding remarks are provided in Section 4.
Section snippets
Proposed method
The main target of this paper is to present a CNN architecture for prediction of frame-by-frame emotional states from audiovisual data. In other words, an emotional score is obtained from the current frame of a video and the audio samples associated with this frame. Keeping this in mind, a low-complexity two-stream CNN model, one stream for the video and the other for the audio, with suitable arrangements of classical layers is proposed. Then, the learned features are employed in a feature
Experiments and results
In order to evaluate the performance of the proposed method, commonly-referred two databases are used in the experiments. A multimodal corpus of spontaneous interactions in French called the REmote COLlaborative and Affective interactions (RECOLA) that has been introduced by Ringeval et al. [22] is used for the purpose of training and testing of the experiments. The second database called the Sustained Emotionally colored Machine–human Interaction using Nonverbal Expression (SEMAINE) [14] has
Conclusions
In this paper, a two-stream CNN model has been proposed for estimating the instantaneous emotional dimensions, namely, valence and arousal from audiovisual data. First, the proposed model has been trained for end-to-end mapping the emotional scores from the audio and video data. In the second approach, the set of features extracted from the CNN model have been refined using an mRMR-based feature selection process. The selected features are then employed to estimate the continuous-valued
CRediT authorship contribution statement
Ramesh Basnet: Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Mohammad Tariqul Islam: Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Tamanna Howlader: Formal analysis, Writing - original draft, Writing - review & editing. S. M. Mahbubur Rahman: Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Dimitrios Hatzinakos: Formal
Declaration of Competing Interest
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript
Acknowledgment
We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan XP GP that was used for this research. The authors would like to thank the anonymous reviewers for their valuable comments, which have been useful in improving the quality of the paper.
References (29)
- et al.
Differential components of discriminative 2D Gaussian–Hermite moments for recognition of facial expressions
Pattern Recognition
(2016) - et al.
Mixed Gaussian-impulse noise reduction using convolutional neural network
Signal Process.
(2018) - et al.
How deep neural networks can improve emotion recognition on video data
Proc. IEEE Int. Conf. Image Processing, Phoenix, AZ
(2016) - et al.
The SEMAINE corpus of emotionally coloured character interactions
Proc. IEEE Int. Conf. Multimedia and Expo, Singapore, Singapore
(2010) - et al.
Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions
Proc. IEEE Int. Conf. and Workshops on Automatic Face and Gesture Recognition, Shanghai, China
(2013) - et al.
Statistical selection of CNN-based audiovisual features for instantaneous estimation of human emotional states
Proc. Int. Conf. New Trends in Computing Sciences, Amman, Jordan
(2017) - et al.
Algorithms for hyper-parameter optimization
Proc. Advances in Neural Information Processing Systems, Granada, Spain
(2011) Affective computing and sentiment analysis
IEEE Intell. Syst.
(2016)- et al.
Mutual Information-Based Emotion Recognition
(2013) - et al.
Continuous emotion recognition using Gabor energy filters
Lecture Notes in Computer Science: Affective Computing and Intelligent Interaction
(2011)
Support vector regression machines
Proc. Advances in Neural Information Processing Systems, Denver, CO
Emotional and conversational nonverbal signals
Language, Knowledge, and Representation
Strength modelling for real-world automatic continuous affect recognition from audiovisual signals
Image Vis. Comput.
Deep learning
Nature
Cited by (0)
Handled by Associate editor Song Wang.