Emotion recognition using heterogeneous convolutional neural networks combined with multimodal factorized bilinear pooling

https://doi.org/10.1016/j.bspc.2022.103877Get rights and content

Highlights

  • Use normalized mutual information to measure the relationships between EEG signal features and corresponding emotions.

  • Employ the HCNNs to extract the convolutional features and use the MFB method to fuse these features.

  • Adopt an ensemble strategy to verify the performance of the HCNNs model on the fusion features.

Abstract

Multimodal emotion recognition is one of the challenging topics in the field of knowledge-based systems and many methods have been studied successfully. Nevertheless, multimodal emotion recognition needs effective fusion representations of multimodal domains, and available methods still have problems on this challenging task. In view of this, this paper proposes a new deep learning model for emotion recognition based on heterogeneous convolutional neural networks (HCNNs) and multimodal factorized bilinear pooling (MFB). In the proposed model, firstly, we select the channels of electroencephalogram (EEG) signals to reduce the interference caused by the redundant channels. Secondly, the HCNNs extract the convolutional features of each modality, and then the MFB method fuses the deep convolution features of the different modalities. Finally, the ensembled strategy is used to verify the model proposed in this paper and explore the influence of various bands on the experiment. The proposed method allows all elements of each component to effectively contact with each other to express the complex internal relationship of each component modality. The experimental results show that the best average result of our proposed method achieves the accuracy of 91.84% on DEAP dataset and 90.17% on MAHNOB-HCI dataset, which proves that the proposed method can improve the performance of multimodal emotion recognition and significantly outperform the state-of-the-art.

Introduction

Emotion recognition technology can sense a person's emotional state by analyzing and processing the signals collected by sensors. To realize a harmonious and efficient environment in human–computer interaction and make computers more intelligent, many researchers have conducted extensive research on various emotional information [1], [2], [3]. These studies have proved that people's emotion swings can lead to changes in behavior, mentality, and physicality. Non-physiological signals such as facial expressions and body postures can express emotions independently. However, in some circumstances, people are unable to accurately reflect their emotions in non-physiological signals [4]. For example, human facial expressions, pronunciation, and intonation et al. As opposed to non-physiological signals, physiological signals such as electroencephalogram (EEG), and physiological (PPS) signals including electromyography (EMG), electrooculogram (EOG) signals et al are not influenced by human subjective factors, but dominated by the human autonomic nervous system, which can objectively and truthfully reflect human stimulation by the emotional state. Therefore, many emotion recognition methods based on EEG signals and PPS signals have been appeared. Researchers have begun to use EEG signals in combination with other modal signals for emotion recognition [5], [6], [7], which is more practical and reliable than traditional emotion recognition methods.

Emotion, as a subjective feeling, is difficult to represent by quantitative models [8]. In many studies, a two-dimensional space composed of valence and arousal is often used to simulate emotion. The valence space ranges from unpleasant to pleasant, while the arousal space ranges from inactive to active. Despite EEG signals can reveal the rules of brain activity and denote various emotional states of the external environment, emotion recognition using the whole channel EEG signals has the problems of data redundancy and complex hardware, which makes it unsuitable for developing daily wearable monitoring devices [9]. Therefore, it is very important to select appropriate channels and alleviate the overfitting problem due to unrelated channels for improving the recognition performance. The literature on emotion recognition has also investigated the characteristics of brain asymmetry, and the results show that the areas of EEG signals that are useful differ according to the aspect of emotion [10].

The main steps of multimodal emotion recognition and emotion computation are feature extraction and multimodal fusion [11]. The goal of feature extraction is to identify important elements of the input signal, create feature vectors based on these elements, and use the feature vectors to classify the corresponding emotions in order to simplify the subsequent emotion classification tasks [12]. Recently, deep learning is gradually applied in feature extraction [13]. In this method, the neural network can process raw data without any preprocessing, that is, the raw data can be decomposed into various abstract levels, and relevant features are automatically extracted to avoid the use of feature extraction techniques.

Multimodal fusion is the integration of statistical properties of different modalities to benefit from complementarity between models, which can significantly improve recognition performance and provide robustness when feature extraction fails [14]. Multimodal fusion typically uses early fusion and late fusion. Although these fusion techniques show satisfactory performance in recognizing multimodal emotions, they are on the basis of shallow fusion strategies and have limited capabilities in joint modeling and handling multiple input features. Recently, researchers have investigated new methods that directly integrate and optimize all available pattern information with deep learning models. A single mode is fused from multiple modes to increase the reliability of emotion recognition [1].

Based on the previous studies discussed above, we propose a novel emotion recognition model and evaluate it as follows: (i) We present a novel neural network fusion method that using heterogeneous convolutional combined with multimodal factorized bilinear pooling (HC-MFB) [15], first of all, constructs different neural network structures according to the characteristics of different modal signals to extract recognizable depth features, and then the multimodal factorized bilinear pooling (MFB) can better jointly model the correlation and association between the feature vectors of each component. (ii) We experimentally investigate the recognition performance that can be realized by combining EEG signal bands and adding additional eye movement signals, and further verify the performance of the proposed model. (iii) The experimental results indicate that the proposed multimodal emotion recognition model based on HC-MFB significantly increases the recognition accuracy compared with other deep learning methods, and has obvious application prospects in dealing with different types of multimodal signals.

The main contributions of this paper are as follows:

  • 1)

    In processing EEG signals, we use normalized mutual information (NMI) to measure the relationships between EEG signal features and corresponding emotions, which solves the problem of information interference caused by irrelevant channels and increases the robustness of the model.

  • 2)

    The trained heterogeneous convolutional neural networks (HCNNs) are employed to automatically extract the convolutional features of different modalities, and the MFB method is used to fuse the multimodal convolutional features output by HCNNs.

  • 3)

    This paper adopts an ensemble strategy to verify the performance of the HCNNs model on the fusion features, which includes EEG signal and peripheral signal of DEAP and MAHNOB-HCI datasets, and increases the eye movement signal of MAHNOB-HCI dataset to further explore the effect of the model.

The outline of this paper is listed as follows. The related studies are briefly introduced in Section 2. The multimodal emotion recognition method based on HC-MFB is presented in Section 3. Section 4 provides experimental results and discussion. In Section 5 we summarize the main conclusions.

Section snippets

Related studies

Recent research has mainly focused on pattern fusion of multimodal emotion recognition. Multimodal datasets typically include video, audio, and physiological signals. Among them, emotion recognition models that use physiological EEG signals in combination with other modal signals have been widely studied.

NEMATI et al. [16] proposed a mixed multimodal fusion strategy, which fused the audio and visual information by a linear latent-space mapping. An evidence fusion method based on Dempster-Shafer

Methodology

The proposed HC-MFB multimodal emotion recognition model is shown in Fig. 1. Our model includes four sequential tasks: EEG signal channel selection, heterogeneous feature extraction, multimodal fusion and classification. In this model, the NMI is used to select the optimal channel from all channels of EEG signals. The HCNNs and MFB is the main component of this model. The heterogeneous features of each modality are extracted by the HCNNs and fused by the MFB, and an ensemble strategy is used to

Experiment and results

We use a ten-fold cross-validation method to assess the performance of the proposed method on the DEAP [26] and MAHNOB-HCI [27] datasets. First, for the experimental emotion recognition task, we designed a combination of different EEG bands to validate the performance of the proposed method and added the eye movement signals from the MAHNOB-HCI dataset to further evaluate. Then we compared the proposed method with other models to further verify its performance.

Conclusions

This paper proposes a novel emotion recognition model using HC-MFB to extract and fuse multimodal convolutional features. First, we extract the pre-processed multimodal signals and use NMI for channel selection of EEG signals. Then, we use HCNNs to extract the features of different modal information, and use MFB to fuse the convolutional features of the different bands of EEG signals with the convolutional features of each modality. Finally, an ensemble classifier is used to simulate the

CRediT authorship contribution statement

Yong Zhang: Methodology, Writing – original draft. Cheng Cheng: Methodology, Software. Shuai Wang: Investigation, Visualization. Tianqi Xia: Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank anonymous reviewers for their valuable comments. This work was supported in part by the National Natural Science Foundation of China (No. 61772252), the Natural Science Foundation of Liaoning Province of China (No. 2019-MS-216), and Scientific Research Foundation of the Education Department of Liaoning Province (No. LJKZ0965).

References (33)

  • X.F. Xing et al.

    SAE+LSTM: A new framework for emotion recognition from multi-channel EEG

    Front. Neurorob.

    (June 2019)
  • Z.M. Wang et al.

    Channel selection method for EEG emotion recognition using normalized mutual information

    IEEE Access

    (2019)
  • Y. Li et al.

    From regional to global brain: a novel hierarchical spatial-temporal neural network model for EEG emotion recognition

    IEEE Trans. Affective Comput.

    (2022)
  • M. Wu et al.

    Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition

    IEEE Trans. Affective Comput., in press

    (2022)
  • J. Shukla et al.

    Feature extraction and selection for emotion recognition from electrodermal activity

    IEEE Trans. Affective Comput.

    (2021)
  • S.U. Amin et al.

    Multilevel weighted feature fusion using convolutional neural networks for EEG motor imagery classification

    IEEE Access

    (January 2019)
  • Cited by (0)

    View full text