1 Introduction

Emotion recognition systems is an increasingly important research subject in communication between humans and machines for the development of technologies that allow a more natural interaction. Emotions are fundamental in the daily life of human beings as they play an essential role in human cognition, rational decision-making, perception, human interaction, and human intelligence [3].

Emotion is a mental state and an affective reaction caused by an event based on subjective experience. However, there is an explicit separation between the physiological arousal, the behavioral expression (affect), and the conscious expression of emotion (feeling) [10, 11]. Emotions play an important role in human communication and can be expressed either verbally through emotional vocabulary or by expressing nonverbal cues. However, behavioral expression as facial expressions and gestures can always be controlled voluntarily. Thus they are easy to fake or change providing unreliable information. It differs from using physiological signals such as electrocardiography (ECG), electromyography (EMG), galvanic skin response (GSR), respiration rate (RR) and, particularly, electroencephalography (EEG).

Over the last years, EEG signals analysis is the most preferred technique to analyzing the physiological expressions, due to the information that it contains, which allows differentiating emotional states, which helps researchers to a better understanding of human brain physiology and psychology [8]. Also, the analysis of these signals allows a more effective recognition of emotions because the subject under test cannot alter it. Several applications have been developed in deferents areas such as entertainment, e-learning, virtual worlds, or e-healthcare [1]

However, EEG characterization is a remaining issue depending on the application. In this case to recognize emotion requires to generate a feature set, which contains the most relevant information to recognize emotion, specifically the binary problem for arousal and Valence models. Authors in [1] propose a scheme for emotion recognition based on audio-visual stimulus and using as classifier an SVM. In [7], it was performed a scheme of feature selection and extraction, over features computed from EEG in time, frequency and time-frequency domains, and then use a quadratic discriminant analysis (QDA) as classifier. In this paper, we propose a methodology for emotion recognition algorithm using EEG signals based on Valence-arousal emotion model. The spectral and temporal features have been derived using the fast Fourier transform (FFT) over four different frequency bands theta (4–8 Hz), alpha (8–16 Hz), beta (16–32 Hz) and gamma (32–64 Hz). Mutual information with forward selection and backward elimination has been used for feature selection stage. Support vector machine has been used for classification stage and different binary classification problems were proposed: the classification of low/high arousal, low/high Valence. The ratings for each of these scales are thresholded into two classes (low and high). On the 9-point rating scales, the threshold was merely placed in the middle. The classifier was trained with user-independent data.

2 Model of Emotion

An emotion is a complex psychological state that involves three distinct components: a subjective experience, a physiological response, and a behavioral or expressive response [6]. Various discrete categorizations of emotion have been proposed. One of them is the discrete emotion model, according to Plutchik [13], there are eight basic emotions as acceptance, anger, anticipation, disgust, fear, joy, sadness and surprise. Ekman [4] exposed the relationship between facial expressions and emotions. In his theory proposes six emotions: anger, disgust, fear, happiness, sadness, and surprise. Later he expands the basic emotion by adding amusement, contempt, contentment, embarrassment, excitement, guilt, pride, relief, satisfaction, sensory pleasure and shame.

Other is the bi-dimensional emotion model by Russell [14], which is the most widely used from the dimensional perspective. This model is used to represent emotional states on a multidimensional scale spanned by Valence and arousal that can be subdivided into four quadrants, namely, low arousal/low Valence (LALV), low arousal/high Valence (LAHV), high arousal/low Valence (HALV), and high arousal/high Valence (HAHV). Valence represents the quality of emotion, can range from unpleasant (e.g., sad, stressed) to pleasant (e.g., happy, elated), whereas arousal denotes the quantitative activation level, ranges from inactive (e.g., uninterested, bored) to active (e.g., alert, excited). In this model, besides the arousal and Valence dimensions, an additional dimension called dominance is added. It ranges from a feeling of being in control during an emotional experience to a feeling of being controlled by the emotion.

3 Experimental Setup

3.1 Database

The dataset for emotion analysis using EEG, physiological and video signal, DEAP, was used in this research [9]. 32 participants took part in the experiment and their EEG and peripheral physiological signals such as electromyography (EMG), electrooculography (EOG), skin temperature, respiration pattern, blood volume pressure, and GSR, were recorded as they watched the 40 selected music videos. The 40 video clips were carefully pre-selected so that their intended arousal and Valence values span as large as possible in an area of the arousal/Valence space. Each participant watched a one-minute long music video as the visual stimuli to elicit different emotions. After each trial/video, each participant performs self-assessment and then to give continuous marks from 1 to 9 of their level of arousal, Valence, like/dislike, and dominance. Self-assessment manikins were used to visualize the scales. EEG and peripheral signals were recorded at a sampling rate of 512 Hz, but then the data was downsampled to 128 Hz, eye artifacts were removed and a high-pass filter was applied from 4–45 Hz. For further information, interested readers can refer to [9].

3.2 Feature Generation

In the design of an emotion recognition system, selection of effective features is an important step. Coan et al. [2] showed that positive emotions are associated with left frontal brain activity, whereas negative emotions are associated with right frontal brain activity. They also revealed that the decrease in activity in other brain regions such as the central, temporal and mid-frontal was less than the case in the frontal region. Therefore, only ten channels of the EEG record have been selected: F3, F4, F7, F8, FC1, FC2, FC5, FC6, FP1, and FP2.

Time Domain Features. Time domain features are computed as the natural representation of the EEG. So, we consider the time descriptor in Table 1.

Table 1. Time domain features

Frequency Domain Features. Some of the time-frequency features are computed based on the well-known fast Fourier transform (FFT) to discriminate harmonic patterns. Here, we call a spectrum vector as \(\varXi (f)\in \mathbb {R}^H\), \(\mathbf {\Lambda }\in \mathbb {R}^H\) a frequency index vector \(\lambda _h=hF/2H\), and \(F\in \mathbb {R}\) the sampling frequency. The features computed based on FFT are consigned in Table 2.

Table 2. Frequency domain features based on FFT

Time-Frequency Domain Features. To find an informative representation of the EEG signal that could relate the time domain events with the frequency ones, we compute the Hilbert-Huang Spectrum (HHS) for each signal, which is done via empirical mode decomposition to arrive at intrinsic mode functions (IMFs) to represent the original signal. Also, the Discrete Wavelet Transform, which decomposes the signals in different approximation and detail levels corresponding to different frequency ranges, while conserving the time information of the signal.

Electrode Combination-Based Features. Considering the relations between the channels of the EEG, we calculated the magnitude coherence estimate as: \(C_{ij}=\frac{\left| P_{ij}\right| ^2}{P_i(f)P_j(f)}\), where the \(P_{ij}\) is the cross-power a pair of electrodes ij and the differential asymmetry as follows: \(\varDelta \xi =\xi _l-\xi _r\) for l and r electrodes on the left/right hemisphere of the scalp, which measures these channels relations.

4 Feature Selection Based on Mutual Information

Let \(\left\{ {\mathbf {x}}_i, y_i\right\} _{i=1}^{N}\) be the training data set of a multi-class classification problem, where \({\mathbf {x}}_i\) is a \(P-\)dimensional feature vector corresponding to the instance i, and \(y_i \in \left\{ 1, \dots , C\right\} \) is the label for the instance \({\mathbf {x}}_i\). For compactness, we define the input matrix \({\mathbf {X}} = \left\{ {\mathbf {x}}_i\right\} _{i=1}^{N} \in \mathbb {R}^{N\times P}\), and \({\mathbf {y}} =\left\{ y_1, \dots , y_N\right\} {{\,\mathrm{\!\,\in \!\,}\,}}\mathbb {R}^N\) as the labels vector. Similarly, let \(\varvec{\zeta }_j \in \mathbb {R}^{N}\) be the column \(j = \left\{ 1, \ldots , P\right\} \) of the matrix \({\mathbf {X}}\). We use the criterion proposed by Peng et al. [12], called minimal-redundancy-maximal-relevance (mRMR), which is a combination of Max-Relevance (D) and min-Redundancy criteria (R). mRMR finds a set \({\mathbf {s}} {{\,\mathrm{\!\,\in \!\,}\,}}\mathbb {R}^L\) with \(L\le P\), which contains the index \(j = \left\{ 1, \ldots , P\right\} \) corresponding to the most relevant features, which jointly achieves the highest explanation for the target class \({\mathbf {y}}\). mRMR is obtained by maximizing \(\varPhi \left( \text {D},\text {R}\right) \), where \(\varPhi \) is defined as \(\varPhi \left( \text {D},\text {R}\right) =\text{ D }-\text{ R }.\) Now, the Max-Relevance D, and the minimal-Redundancy R criteria are defined as follows

$$\begin{aligned} \text{ D }\left( {\mathbf {s}}, {\mathbf {y}}\right)&=\frac{1}{\left| {\mathbf {s}}\right| _{\#}}\sum _{j\in {{\mathbf {s}}}}{\text {I}}\left( \varvec{\zeta }_j; {\mathbf {y}}\right) , \qquad \text{ R }({\mathbf {s}}) =\frac{1}{\left| {\mathbf {s}}\right| _{\#}^{2}}\sum _{j,k\in {{\mathbf {s}}}}{\text {I}}\left( x_j;x_k\right) , \end{aligned}$$

where \(|{\mathbf {s}}|_{\#}\) represents the cardinality of \({\mathbf {s}}\), and \({\text {I}}({\mathbf {a}}; {\mathbf {b}})\) is the mutual information of \({\mathbf {a}}\) and \({\mathbf {b}}\). A remaining issue is how to determine the optimal number of features L. The algorithm used for this task are based on two searches: forward selection (FS) and backward elimination (BE) over the matrix \({\mathbf {Q}} {{\,\mathrm{\!\,\in \!\,}\,}}\mathbb {R}^{P\times P}\), where the element jk are defined as \( {Q}_{j,k} = {\text {I}}\left( \varvec{\zeta }_j,\varvec{\zeta }_k; {\mathbf {y}}\right) \).

5 RUSBoost Ensemble

Given an unbalanced training data \(\left\{ {\mathbf {X}}, {\mathbf {y}}\right\} \), the RUSBoost algorithm proposed by Seiffert et al. [15] is a combination of two components: Random Under-sampling and Adaptive Boosting (AdaBoost), both used for imbalance classification. Here, we introduce a brief description of both techniques random under-sampling and AdaBoost, in order to describe RUSboost algorithm.

Random Under-Sampling: Data sampling techniques attempt to alleviate the problem of class imbalance by adjusting the class distribution of the training data set. This can be accomplished by either removing examples from the majority class.

AdaBoost: Boosting is a meta-learning technique designed to improve the classification performance of weak learners. The main idea of boosting is to iteratively create an ensemble of weak hypotheses, which are combined to predict the class of unlabeled examples. Initially, all examples in the training dataset are assigned with equal weights. During the iterations of AdaBoost, a weak hypothesis is formed by the base learner. The error associated with the hypothesis is calculated, and the wight of each example is adjusted such that misclassified examples have their wights increased while correctly classified examples have their weights decreased. Therefore, subsequent iterations of boosting will generate hypotheses that are more likely to classify the previously mislabeled examples correctly. Once all the iterations are completed, a weighted vote of all hypothesis is used to assign a class to the unlabeled examples. Since boosting assigns higher weights to misclassified examples and minority class examples are those most likely to be misclassified. The above is the reason for which minority class examples will receive higher weights during the boosting process, making it similar in many ways to cost-sensitive classification [5].

6 Results

Both sequential searches (FS and BE), show for Arousal and Valence targets, that they tow use more 1200 features, where almost entirely belong to the frequency domain and some of them to time-frequency (Fig. 1).

Fig. 1.
figure 1

Scores for every feature in both class targets, sub-figure (a) represent the score found by sequential searches in Valence class, and sub-figure (b) represent the same score for Arousal class.

The Tables 3 and 4 show the performance obtained with the base learners using all the set of features (NSF), the features selected with FS and the features selected with BE. These results were obtained with a hold-out cross-validation with 70% for training and 30% for testing.

Table 3. Table of comparative results between the base learners used for the ensemble in terms of accuracy, sensitivity and specificity for Arousal.

Decision Tree was the base learner who gets the highest accuracy in the classification: NSF, FS, and BE. The SVM also reach results similar to the decision tree for both, Arousal and Valence targets. However, the SVM is more unstable than the decision tree. It is important to note that both selections improved the classification considerably in the tree statistics (accuracy, sensitivity, and specificity). This highlight indicates that there are several features, which indeed may confuse a classifier. A possible reason for this the redundancy added from many of them, or maybe some of these features do not offer any information about the interest state.

Table 4. Table of comparative results between the base learners used for the ensemble in terms of accuracy, sensitivity and specificity for Valence.

From the table of results, we can conclude that Valence target is more difficult to detect than Arousal, this could be for the features computed, or even, the selected channels may not be enough for Valence recognition. Nevertheless, with the ensembles is possible to obtain results of classification above 70% of accuracy, sensitivity, and specificity. Even though, we do not perform a parameters tuning for some of the base learners, these result probes that is possible to achieve results more efficient, demonstrating that ensemble classifiers could be stronger than single classifiers.

It is important to remark, that our data only has 22 patients with more balance between classes for both scales, Arousal and Valence. Besides, we test with all the patient mixed, instead of leaving one out and training with him.

7 Conclusions

We have presented an effective strategy to classify emotions that achieves an accuracy over 70% with a feature selection stage that efficiently finds the best set of features that explains the best the target label for scales. Nevertheless, our data employment is not the standard of state-of-art methods that train with some patient and test with the remaining, our methodology proves to be an alternative, and as future work, we propose to use the same training/test structure as the state-of-art methods.