Emotion recognition using heterogeneous convolutional neural networks combined with multimodal factorized bilinear pooling
Introduction
Emotion recognition technology can sense a person's emotional state by analyzing and processing the signals collected by sensors. To realize a harmonious and efficient environment in human–computer interaction and make computers more intelligent, many researchers have conducted extensive research on various emotional information [1], [2], [3]. These studies have proved that people's emotion swings can lead to changes in behavior, mentality, and physicality. Non-physiological signals such as facial expressions and body postures can express emotions independently. However, in some circumstances, people are unable to accurately reflect their emotions in non-physiological signals [4]. For example, human facial expressions, pronunciation, and intonation et al. As opposed to non-physiological signals, physiological signals such as electroencephalogram (EEG), and physiological (PPS) signals including electromyography (EMG), electrooculogram (EOG) signals et al are not influenced by human subjective factors, but dominated by the human autonomic nervous system, which can objectively and truthfully reflect human stimulation by the emotional state. Therefore, many emotion recognition methods based on EEG signals and PPS signals have been appeared. Researchers have begun to use EEG signals in combination with other modal signals for emotion recognition [5], [6], [7], which is more practical and reliable than traditional emotion recognition methods.
Emotion, as a subjective feeling, is difficult to represent by quantitative models [8]. In many studies, a two-dimensional space composed of valence and arousal is often used to simulate emotion. The valence space ranges from unpleasant to pleasant, while the arousal space ranges from inactive to active. Despite EEG signals can reveal the rules of brain activity and denote various emotional states of the external environment, emotion recognition using the whole channel EEG signals has the problems of data redundancy and complex hardware, which makes it unsuitable for developing daily wearable monitoring devices [9]. Therefore, it is very important to select appropriate channels and alleviate the overfitting problem due to unrelated channels for improving the recognition performance. The literature on emotion recognition has also investigated the characteristics of brain asymmetry, and the results show that the areas of EEG signals that are useful differ according to the aspect of emotion [10].
The main steps of multimodal emotion recognition and emotion computation are feature extraction and multimodal fusion [11]. The goal of feature extraction is to identify important elements of the input signal, create feature vectors based on these elements, and use the feature vectors to classify the corresponding emotions in order to simplify the subsequent emotion classification tasks [12]. Recently, deep learning is gradually applied in feature extraction [13]. In this method, the neural network can process raw data without any preprocessing, that is, the raw data can be decomposed into various abstract levels, and relevant features are automatically extracted to avoid the use of feature extraction techniques.
Multimodal fusion is the integration of statistical properties of different modalities to benefit from complementarity between models, which can significantly improve recognition performance and provide robustness when feature extraction fails [14]. Multimodal fusion typically uses early fusion and late fusion. Although these fusion techniques show satisfactory performance in recognizing multimodal emotions, they are on the basis of shallow fusion strategies and have limited capabilities in joint modeling and handling multiple input features. Recently, researchers have investigated new methods that directly integrate and optimize all available pattern information with deep learning models. A single mode is fused from multiple modes to increase the reliability of emotion recognition [1].
Based on the previous studies discussed above, we propose a novel emotion recognition model and evaluate it as follows: (i) We present a novel neural network fusion method that using heterogeneous convolutional combined with multimodal factorized bilinear pooling (HC-MFB) [15], first of all, constructs different neural network structures according to the characteristics of different modal signals to extract recognizable depth features, and then the multimodal factorized bilinear pooling (MFB) can better jointly model the correlation and association between the feature vectors of each component. (ii) We experimentally investigate the recognition performance that can be realized by combining EEG signal bands and adding additional eye movement signals, and further verify the performance of the proposed model. (iii) The experimental results indicate that the proposed multimodal emotion recognition model based on HC-MFB significantly increases the recognition accuracy compared with other deep learning methods, and has obvious application prospects in dealing with different types of multimodal signals.
The main contributions of this paper are as follows:
- 1)
In processing EEG signals, we use normalized mutual information (NMI) to measure the relationships between EEG signal features and corresponding emotions, which solves the problem of information interference caused by irrelevant channels and increases the robustness of the model.
- 2)
The trained heterogeneous convolutional neural networks (HCNNs) are employed to automatically extract the convolutional features of different modalities, and the MFB method is used to fuse the multimodal convolutional features output by HCNNs.
- 3)
This paper adopts an ensemble strategy to verify the performance of the HCNNs model on the fusion features, which includes EEG signal and peripheral signal of DEAP and MAHNOB-HCI datasets, and increases the eye movement signal of MAHNOB-HCI dataset to further explore the effect of the model.
The outline of this paper is listed as follows. The related studies are briefly introduced in Section 2. The multimodal emotion recognition method based on HC-MFB is presented in Section 3. Section 4 provides experimental results and discussion. In Section 5 we summarize the main conclusions.
Section snippets
Related studies
Recent research has mainly focused on pattern fusion of multimodal emotion recognition. Multimodal datasets typically include video, audio, and physiological signals. Among them, emotion recognition models that use physiological EEG signals in combination with other modal signals have been widely studied.
NEMATI et al. [16] proposed a mixed multimodal fusion strategy, which fused the audio and visual information by a linear latent-space mapping. An evidence fusion method based on Dempster-Shafer
Methodology
The proposed HC-MFB multimodal emotion recognition model is shown in Fig. 1. Our model includes four sequential tasks: EEG signal channel selection, heterogeneous feature extraction, multimodal fusion and classification. In this model, the NMI is used to select the optimal channel from all channels of EEG signals. The HCNNs and MFB is the main component of this model. The heterogeneous features of each modality are extracted by the HCNNs and fused by the MFB, and an ensemble strategy is used to
Experiment and results
We use a ten-fold cross-validation method to assess the performance of the proposed method on the DEAP [26] and MAHNOB-HCI [27] datasets. First, for the experimental emotion recognition task, we designed a combination of different EEG bands to validate the performance of the proposed method and added the eye movement signals from the MAHNOB-HCI dataset to further evaluate. Then we compared the proposed method with other models to further verify its performance.
Conclusions
This paper proposes a novel emotion recognition model using HC-MFB to extract and fuse multimodal convolutional features. First, we extract the pre-processed multimodal signals and use NMI for channel selection of EEG signals. Then, we use HCNNs to extract the features of different modal information, and use MFB to fuse the convolutional features of the different bands of EEG signals with the convolutional features of each modality. Finally, an ensemble classifier is used to simulate the
CRediT authorship contribution statement
Yong Zhang: Methodology, Writing – original draft. Cheng Cheng: Methodology, Software. Shuai Wang: Investigation, Visualization. Tianqi Xia: Software, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank anonymous reviewers for their valuable comments. This work was supported in part by the National Natural Science Foundation of China (No. 61772252), the Natural Science Foundation of Liaoning Province of China (No. 2019-MS-216), and Scientific Research Foundation of the Education Department of Liaoning Province (No. LJKZ0965).
References (33)
- et al.
Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques
Knowl.-Based Syst.
(2021) - et al.
Development of a real-time emotion recognition system using facial expressions and EEG based on machine learning and deep neural network methods
Inf. Med. Unlocked
(2020) - et al.
A machine learning model for emotion recognition from physiological signals
Biomed. Signal Process. Control
(2020) - et al.
A multimodal emotion recognition method based on facial expressions and electroencephalography
Biomed. Signal Process. Control
(2021) - et al.
Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
Comput. Vis. Image Underst.
(September 2018) - et al.
EEG-based emotion recognition using simple recurrent units network and ensemble learning
Biomed. Signal Process. Control
(2020) - et al.
Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video
Int. J. Mach. Learn. Cybern.
(2020) - P. Santhiya, Dr. S. Chitrakala, “A survey on emotion recognition from EEG signals: approaches, techniques &...
- et al.
Multimodal emotion recognition based on ensemble convolutional neural network
IEEE Access
(2020) - et al.
MPED: a multi-modal physiological emotion database for discrete emotion recognition
IEEE Access
(2019)