Elsevier

Speech Communication

Volume 127, March 2021, Pages 73-81
Speech Communication

Learning deep multimodal affective features for spontaneous speech emotion recognition

https://doi.org/10.1016/j.specom.2020.12.009Get rights and content

Highlights

  • This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs).

  • The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification.

  • The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance.

  • Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method.

Abstract

Recently, spontaneous speech emotion recognition has become an active and challenging research subject. This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs). The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification. The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance. Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method.

Introduction

In recent years, spontaneous speech emotion recognition (SER) has become an active and challenging research subject in pattern recognition, speech signal processing, artificial intelligence, and so on. This is because spontaneous SER has important applications to human-computer interaction (Li et al., 2018; Zhang et al., 2013). In particular, these SER systems aim to provide affective interaction modes with computers by direct speech interaction rather than traditional input devices, thereby giving rise to smart affective services for spoken call center, healthcare, surveillance, and affective computing.

In SER community, a variety of previous works (Akçay and Oğuz, 2020; Anagnostopoulos et al., 2015; El Ayadi et al., 2011; Liu et al., 2018; Schuller, 2018; Wang et al., 2020) concentrate on acted SER tasks on the basis of collected data related to acted emotion expression. The main reason is that acted emotions are easily portrayed in the laboratory controlled environment, and usually produce good SER performance. However, acted emotions are often exaggerated so that they cannot effectively represent the characteristics of speech emotion expression in real-world sceneries. Therefore, identifying spontaneous emotions in the wild is more difficult and challenging compared with conventional acted emotions.

Speech feature extraction, which is a crucial step in a basic spontaneous SER system, aims to derive effective feature representations related to speech emotion expression. The well-known affective speech features (Demircan and Kahramanli, 2017; Gharavian et al., 2012; Song, 2019; Zhang et al., 2018a; Zhao and Zhang, 2015; Zixing et al., 2015) are low-level descriptors (LLDs). The early-used typical LLDs contain prosody (pitch, intensity) features, voice quality (formants) features, spectral features such as Mel-frequency cepstral coefficients (MFCCs), linear predictor coefficients (LPC), and linear predictor cepstral coefficients (LPCC). Recently, various extensive feature sets on the basis of LLDs, including INTERSPEECH-2010 (Kayaoglu and Eroglu Erdem, 2015), ComParE (Schuller et al., 2013), AVEC-2013 (Valstar et al., 2013), and GeMAPS (Eyben et al., 2016), have been also developed for SER. Nevertheless, all these extracted LLDs and its variants belong to low-level hand-designed features. Due to the gap between low-level hand-designed features and subjective emotions, these low-level hand-designed features are not effective enough to represent emotional characteristics of speech (Akçay and Oğuz, 2020; Anagnostopoulos et al., 2015; El Ayadi et al., 2011; Liu et al., 2018; Schuller, 2018; Wang et al., 2020). Accordingly, it is desired to develop advanced feature learning approaches to automatically achieve high-level affective feature representations that characterize speakers’ emotions.

To address the above-mentioned issues, the recently-emerged deep learning techniques (Hinton and Salakhutdinov, 2006; LeCun et al., 2015), which have gained extensive attentions in SER community, may offer possible solutions. Due to the used deep layers of architectures, deep learning methods usually have certain advantages over traditional methods, including their ability to automatically detect the complicated structures and features without manual feature extraction. So far, various deep learning techniques, such as deep neural networks (DNNs) (Hinton and Salakhutdinov, 2006), deep convolutional neural networks (CNNs) (Krizhevsky et al., 2012), long short-term memory based recurrent neural networks (LSTM-RNNs) (Graves, 2012), have been used for high-level feature learning tasks for SER. On the basis of feed-forward structures, DNNs contain one or more underlying hidden layers between inputs and outputs. The feed-forward architectures make DNNs present promising performance for SER. In detail, in (Han et al., 2014), MFCCs are fed into a DNN for learning high-level features, and then an extreme learning machine (ELM) is used for speech emotion classification. In (Wang and Tashev, 2017), a DNN is used to encode all frames in an utterance into a fixed-length vector by pooling the activations of the last hidden layer over time. A kernel ELM is further trained on the encoded vectors for utterance-level emotion classification. Due to the used hand-designed features as inputs of DNNs, DNNs cannot effectively obtain discriminative features for SER.

CNNs comprise of multi-level convolutional and pooling layers, so that they are capable of capturing mid-level feature representations from input data. Benefited from the obtained great success of CNNs in computer vision tasks (Krizhevsky et al., 2012), a 2D time-frequency representation derived from an audio spectrogram is usually fed into CNNs for SER. In particular, in (Mao et al., 2014) the authors adopt audio spectrograms as inputs of a hybrid deep model, which combines a sparse auto-encoder with a 1-layer CNN, to learn salient features for SER. In (Badshah et al., 2019), segment-level spectrograms are fed into a CNN consisting of five convolutional layers and three pooling layers, to capture discriminative features for SER. In (Zhang et al., 2018c), an image-like spectrogram is developed as inputs of a deep CNN like AlexNet (Krizhevsky et al., 2012) to extract high-level segment-level feature representations for SER. In recent years, combining CNNs with LSTM-RNNs (i.e., CNN+LSTM/RNNs) has become a new research trend in SER community. In (Zhao et al., 2019b, 2018), by using segment-level spectrograms, the authors integrate an attention-based bidirectional LSTM with a spatial CNN with a fully convolutional networks (FCN) like structure for deep spectrum features extraction on SER tasks. In (Zhang et al., 2019a), a multiscale deep convolutional LSTM framework based on segment-level spectrograms is presented for SER. In (Zhao et al., 2019a), the authors provide compact convolutional RNNs via binarization for SER by means of quantizing the weights of neural networks from the original full-precised values into binary values.

Note that the above-mentioned 2D CNNs-based methods, such as CNNs and CNN+LSTM/RNNs, have the ability to capture energy modulation patterns across time and frequency from the extracted 2D time-frequency spectrograms of audio signals, and hence achieve good performance on SER tasks. However, such 2D CNNs-based methods employ 2D time-frequency spectrograms as inputs of CNNs to learn feature representations. Consequently, they fail to capture the changes in 2D time-frequency representations of consecutive frames in an utterance, thereby failing to obtain discriminative enough features for SER. Although LSTM-RNNs can be used for temporal modeling of audio signals, they tend to overemphasize the temporal information.

To tackle this issue, recently-developed 3D CNNs (Dong et al., 2020; Tran et al., 2015), originally used for video processing, may provide possible solutions, since 3D CNNs are able to simultaneously learn temporal and spatial feature representations by means of 3D convolution and pooling operations. Motivated by 3D motions of videos, we will generate appropriate 3D signals from the original 1D audio signals as inputs of 3D CNNs. Extracting such video-like 3D signals aims to emphasize different spectral characteristics from neighboring regions in an utterance in the temporal and spatial dimension.

Additionally, in recent years different 1D CNN models have also been leveraged for feature learning on SER tasks. In (Fayek et al., 2017), the authors focus on investigating the performance of multiple 1D CNN structures, which contain one or two convolution layers, for learning feature representations from the original 1D raw audio waveforms on SER tasks. Nevertheless, these used 1D CNN models with one or two convolution layers are relatively shallow, so that their learned 1D CNN features may not be discriminative enough for SER. To address this issue, the recently-developed deep sample-level 1D CNNs (Kim et al., 2018; Lee et al., 2018), in which the filter size in the bottom layer to span several samples long, may provide an solution. To date, sample-level 1D CNNs have been successfully employed to learn feature representations based on the original 1D raw audio waveforms on music classification and generation tasks. Motivated by the VGG networks (Simonyan and Zisserman, 2015) in image classification that is built with deep stack of small convolutional layers, sample-level CNNs adopt very small filters in time for all convolutional layers, and show comparable performance obtained by 2D Mel-spectrograms in music classification and generation (Kim et al., 2018; Lee et al., 2018).

It is noted that the learned feature representations from 1D, and 3D CNNs may capture quite different acoustic characteristics in comparison with time-frequency representations based 2D CNNs. In particular, the raw 1D audio waveforms as inputs of 1D CNNs, are used to address log-scale amplitude compression and phase-invariance (Kim et al., 2018; Lee et al., 2018). The extracted 2D Mel-spectrogram segments as inputs of 2D CNNs, are used to capture energy modulation patterns across time and frequency (Zhang et al., 2018c). The extracted 3D video-like Mel-spectrogram segments as inputs of 3D CNNs, aim to emphasize different spectral characteristics from neighboring regions in an utterance in the temporal and spatial dimension. This demonstrate that the learned deep multimodal features from 1D, 2D, and 3D CNNs, may be complementary to each other, so that they are integrated in a multi-CNN fusion network to further improve speech emotion classification performance.

Motivated by the existing complementarity among the learned deep multimodal features from 1D, 2D, and 3D CNN networks, we propose a new spontaneous SER method, which aims to learn deep multimodal audio features with a multi-CNN fusion network to make potential performance improvements on spontaneous SER tasks. Fig. 1 shows the overview architecture of our proposed method. In particular, we propose to generate three appropriate audio inputs corresponding to three different CNN architectures, so as to learn deep multimodal features from the original 1D audio signals in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling.

The main contributions of this paper are summarized in three-folds:

  • (1)

    To the best of our knowledge, it is the first attempt to present a new method of spontaneous SER by integrating deep multimodal audio feature leaning with 1D, 2D, and 3D CNNs. Multiple high-level feature representations from three proposed CNNs are extracted as deep features, followed by an average-pooling and a score-level fusion network used for final emotion classification.

  • (2)

    We generate three different audio inputs for multi-CNNs from the original 1D raw audio waveform so as to learn deep multimodal features. Specially, inspired by 3D motions of videos, we generate appropriate 3D audio signals from the original 1D audio waveform as inputs of 3D CNNs for spatial-temporal feature learning. This is similar to video processing with 3D CNNs in computer vision.

  • (3)

    We conduct extensive experiments on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 (Dhall et al., 2015) and BAUM-1 s (Zhalehpour et al., 2017) databases. Experiment results show the validity of our proposed method on spontaneous SER tasks.

The rest of this paper is structured as follows. In Section 2 we introduce our proposed method in detail. In Section 3 we provide experiment results and analysis. Finally, we present conclusions and future work in Section 4.

Section snippets

Proposed method

Fig. 1 presents the overall architecture of our proposed deep multimodal feature learning with multi-CNNs for spontaneous speech emotion recognition. Our proposed method comprises of three steps: (1) Generating appropriate multimodal audio representations, (2) learning multimodal audio features with multi-CNNs, (3) fusing multimodal results at score-level. In the followings, we will describe the details of the above-mentioned three steps.

Experiments

To verify the effectiveness of our proposed method on spontaneous SER tasks, two challenging spontaneous emotional speech datasets, i.e., AFEW5.0 (Dhall et al., 2015) and BAUM-1 s (Zhalehpour et al., 2017), are employed for spontaneous SER experiments. We do not use other acted emotional speech datasets for experiments, because this work focuses on spontaneous SER rather than conventional acted SER.

Conclusions and future work

Considering the existing complementarity among the learned multimodal feature representations from 1D, 2D, and 3D CNN networks, this paper proposes a new method of spontaneous SER by means of deep multimodal audio feature learning with multi-CNNs. The key step of our proposed method is to generate appropriate inputs for 1D, 2D, and 3D CNN networks from the original 1D audio waveforms, and design suitable CNN network architectures for multimodal feature learning. Experiment results on the

Author contributions

Shiqing Zhang: Writing-original draft preparation, Xin Tao and Yuelong Chuang:Experiment test, Xiaoming Zhao: Supervision, review & editing.

Declaration of Competing Interest

The authors declare that they have no competing interests.

Acknowledgments

This work was supported by Zhejiang Provincial National Science Foundation of China and National Science Foundation of China (NSFC) under Grant No. LZ20F020002, LQ21F020002 and 61976149.

References (51)

  • J. Cai et al.

    Feature-level and model-level audiovisual fusion for emotion recognition in the wild

  • J. Cai et al.

    Island loss for learning discriminative features in facial expression recognition

  • S. Demircan et al.

    Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech

    Neural Comput. Appl.

    (2017)
  • A. Dhall et al.

    Video and image based emotion recognition challenges in the wild: emotiw

  • S. Dong et al.

    IoT-based 3D convolution for video salient object detection

    Neural Comput. Appl.

    (2020)
  • S. Ebrahimi Kahou et al.

    Recurrent neural networks for emotion recognition in video

  • F. Eyben et al.

    The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing

    IEEE Trans. Affect. Comput.

    (2016)
  • D. Gharavian et al.

    Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network

    Neural Comput. Appl.

    (2012)
  • A. Graves

    Supervised Sequence Labelling with Recurrent Neural Networks

    (2012)
  • K. Han et al.

    Speech emotion recognition using deep neural network and extreme learning machine

    Interspeech

    (2014)
  • K. He et al.

    Deep residual learning for image recognition

  • G.E. Hinton et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • C.-.W. Huang et al.

    Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition

  • M. Kayaoglu et al.

    Affect recognition using key frame selection based on minimum sparse reconstruction

  • T. Kim et al.

    Sample-level CNN architectures for music auto-tagging using raw waveforms

  • Cited by (54)

    View all citing articles on Scopus
    View full text