Keywords

1 Introduction

Automatic pain and emotion recognition have been developed based on a specific modality such as video signals, particularly facial expressions [7, 15, 21, 22, 30, 32] or biophysiological signals [1, 3, 5, 10, 12, 16, 17]. More recently, multi-modal systems where several modalities are combined to improve the pain intensity recognition performance have been investigated [2, 8, 9, 13, 14, 19, 33]. In recent works also the audio modality has been successfully applied to pain intensity estimation [29, 31], but still the most common modalities involved in the assessment of pain intensity are the video and biophysiological channels.

Nowadays, artificial Neural Networks (ANNs) are considered to be powerful methods for pattern recognition and for data analysis [23, 24], in particular, the so-called deep neural networks have achieved enormous attention in recent times. However, the design of an ANN architecture for a classification or estimation task is still an open issue and the success of an ANN-architecture is typically highly depending on the experience of the machine learning or ANN engineer. Automatic generation, design and evaluation of ANN architectures would be a useful concept as in many problems the optimal architecture is not known beforehand. Typically, developers use a trial and error method to determine the ANN-structure.

Here is a brief list of hyper-parameters that people often vary by hand during ANN hyper optimisation: learning rate, batch size, training epoch, types of layers, number of layers, activation function for the neurons in different layers, and the dropout parameter. In our experimental study we are focussing mainly on the following hyper-parameters: number of layers and their types, number of neurons per layer and their activation functions.

In this paper, we propose to use Evolutionary Algorithms (EAs) for the ANN structure adaptation. We use two types of ANNs: Feedforward Neural Network (FNN) and Recurrent Neural Network (RNN). FNN and RNN structures are optimised using a Self-Configuring Genetic Algorithm (SelfCGA) and Self-Configuring Genetic Programming (SelfCGP) accordingly, borrowed from [25, 26]. Self-Configuring modifications allow us to overcome the problems of selecting settings for the Genetic Algorithm (GA) and Genetic Programming (GP).

We use the Keras framework [4] in Python to train and build ANNs.

The remainder of this work is organised as follows. Section 2 consists of the description of the dataset. Section 3 has a short description of the Self-Configuring technique for EAs used for the ANN structure adaptation. In Sect. 6 a description of Particle Swarm Optimisation with parasitic behaviour (PSOPB) for feature selection is provided. Experiments as well as the corresponding results are presented in Sect. 7 followed by the discussion and conclusion in Sect. 8.

2 Dataset Description

The data utilized in the present work was recently collected with the goal of generating a multimodal corpus designed specifically for research in the domain of emotion and pain recognition. It consists of 40 participants (20 male, 20 female), each subjected to two sessions of experiments of about 40 min each, during which several pain and emotion stimuli were triggered and the demeanour of each participant was recorded using audio, video and biophysiological sensors.

The pain stimuli were elicited through heat generated by a Medoc Pathway thermal simulatorFootnote 1. The experiment was repeated for each participant twice, each time with the ATS thermode attached to a different forearm (left and right). Before the data was recorded, each participant’s pain threshold temperature and pain tolerance temperature were determined. Based on both temperatures, an intermediate heat stimulation temperature was computed such that the range between both the threshold and tolerance temperatures was divided into 2 equally spaced ranges.

Fig. 1.
figure 1

Pain stimulation. \(T_{0}\): baseline temperature (\(32\,^{\circ }\)C); \(T_{1}\): pain threshold temperature; \(T_{2}\): intermediate temperature; \(T_{3}\): pain tolerance temperature.

A specific emotional elicitation was triggered simultaneously to each pain elicitation in the form of pictures and video clips. The latter were carefully selected with the purpose of triggering specific emotional responses. This allowed a categorisation of the emotion stimuli using a two-dimensional valence-arousal space in the following groups: positive (positive valence, high arousal); negative (negative valence, low arousal); neutral (neutral valence, neutral arousal).

Each heat temperature (pain stimulation) was triggered randomly 30 times with a randomised pause lasting between 8 and 12 s between consecutive stimuli. The randomised and simultaneous emotion stimuli were distributed for each heat temperature (pain stimulation) as well as the baseline temperature (no pain stimulation) as follows: 10 positive, 10 negative and 10 neutral emotion elicitations. Each stimulation consisted of a 2-s onset during which the temperature was gradually elevated starting from the baseline temperature until the specific heat temperature was reached. Following this, the attained temperature was maintained for 4 s before being gradually decreased until the baseline temperature was reached. A recovery phase of 8–12 s followed before the next pain stimulation was elicited (see Fig. 1 for more details).

Therefore, each participant is represented by two sets of data, each one representing the experiments conducted on each forearm (left and right). Each dataset consists of 120 pain stimuli with 30 stimuli per temperature (\(T_{0}\): baseline, \(T_{1}\): threshold, \(T_{2}\): intermediate, \(T_{3}\): tolerance), and 120 emotion stimuli with 40 stimuli per emotion category (positive, negative, neutral).

The synchronous data recorded from the experiments consists of 3 high-resolution video streams from 3 different perspectives, 2 audio lines recorded respectively from a headset and a directional microphone, and 4 physiological channels, namely the electromyographic activity of the trapezius muscle (EMG), the galvanic skin response (GSR), the electrocardiogram (ECG) and the respiration (RSP). Furthermore, an additional video and audio stream were recorded using the Microsoft Kinect sensor.

The focus of the present work is the investigation of the relevance of both audio and video channels regarding the task of pain intensity recognition. Thus the recognition of the different categories of emotion or the impact of the emotion stimuli on pain recognition will not be investigated.

In this paper, we have been concentrated on a binary classification of pain level according to \(T_{0}\) and \(T_{3}\) temperature. All approaches have been tested on two parts of the dataset for the left and right forearms.

3 Evolutionary Algorithms

Evolutionary algorithms (EAs) are common population-based methods used for global optimisation problems [6]. During this investigation, we take into account two EAs: the Genetic Algorithm, which represents solutions in a binary string form, and the Genetic Programming algorithm, where solutions are encoded as binary trees. The main advantage of EAs in contradistinction to gradient methods lies in their “creativity” - due to the recombination pieces of solutions from the population, unexpectedly effective results can arise that would otherwise be difficult to predict. However, at the same time, the main disadvantage is the large amount of computation required for this in the case of poorly selected settings. Indeed, in the course of evolution, it is necessary to test many bad hypotheses. To address this problem, it is necessary to use modified algorithms. For instance, effective combinations of evolutionary operators allow an optimal solution to be found with fewer objective function evaluations.

3.1 Self-configuration of EAs

Both GA and GP have many adjustable parameters. Since the number of parameter combinations is large, the use of brute force is not always possible. In order to overcome this defect in the study, we use the operator-based Self-Configuration technique. The main idea is that this technique should choose the most useful combinations of operators from all available ones in GA and GP. In the beginning, all operators have the same probability to be chosen for new offspring generation. During the course of work, Self-Configuration changes these probabilities based on the offspring fitness improvement generated by the a certain operator. There are several evolutionary operators, namely selection, crossover and level of mutation.

We have added the following types of operators for the Self-Configuration in this study. There are three types of selection: proportional, rank-based and tournament with three tournament sizes (2, 5 and 9). Three types of crossover (one-point, two-point, and uniform) for GA and two types (standard and one-point) for GP are included. Three levels of mutation: weak \(\frac{1}{5*n}\), medium \(\frac{1}{n}\), and strong \(\frac{5}{n}\) are also included. Where n is an actual depth of a tree in GP and a length of a binary string in GA.

The general procedure of SelfCGA and SelfCGP:

  1. 1.

    Set equal probabilities for all possible options of each operator type (each of the operators has its own probability distribution)

  2. 2.

    Initialize the first population

  3. 3.

    Select types of selection, recombination and mutation

  4. 4.

    Identify parents using the selected selection operator

  5. 5.

    Cross parents using the selected crossover type

  6. 6.

    Mutate the offspring with the selected probability level of mutation

  7. 7.

    Estimate the fitness of the new offspring

  8. 8.

    Repeat steps 3–7 until the new generation is formed

  9. 9.

    Recalculate the operator type probabilities using the average fitness of offspring obtained with the certain operator

  10. 10.

    Check the stop conditions, and if it is not reached, go to step 3, otherwise stop the search and take the offspring with the best fitness as a final solution.

3.2 Fitness Function

The fitness function is an indicator of the solution success in EAs. It ranges from 0 to 1 (a perfect solution would have a fitness of 1). We should define this function for each certain problem we solve. In this research, we should evaluate the effectiveness of different ANN structures. Usually, in the course of ANN training, cross-entropy is used as a loss function for backpropagation. The cross-entropy ranges between 0 and 1 (a perfect model would have a cross-entropy loss of 0). It allows ANN to be trained most effectively. Therefore, we take the cross-entropy (in the case of binary classification problems it is the binary version) for calculating the fitness function as follows:

$$\begin{aligned} fitness=\frac{1}{1+mean\_loss} \end{aligned}$$
(1)

Where \(mean\_loss\) is the average loss of participant independent leave one participant out cross validation performance with m participants from the dataset:

$$\begin{aligned} mean\_loss=\frac{\sum _{i=1}^{m}binary\_crossentropy_i}{m} \end{aligned}$$
(2)

m is equal to 5 within the frame of SelfCGP and SelfCGA work. After the final structures are found, they are tested on the whole dataset with m equals 40.

4 RNN Structure Optimisation Using SelfCGP

As is mentioned above, we apply SelfCGP for the RNN structure adaptation. SelfCGP is already successfully used in the design of FNNs for solving various data analysis problems [27].

Keras has several kinds of recurrent layers. The following layer types have been used: Simple recurrent neural network (a fully-connected RNN where the output is to be fed back to input); LSTM (a long-short term memory RNN); Dense (a regular densely-connected NN layer). In addition, we take Dropout layers (randomly setting a fraction rate of input units to 0 at each update during the training time, which helps prevent overfitting) [28]. It is very important to give an opportunity for SelfCGP to design RNNs with Dropout layers. It is worth noting that ANNs cannot consist only of Dropout layers, so we include them in the functional set, but not in the terminal one. Next, we need to define several different coefficients of Dropout, for instance, \(0.1, 0.2,\ldots ,0.9\).

4.1 Terminal and Functional Sets

The terminal set can contain a lot of variations of layers with the different parameters described above. In this study, we have included follow layer types in the terminal set: SimpleRNN, LSTM, GRU, and Dense layers. In addition, all the activation functions available in Keras have been included in the terminal set: softmax, elu, selu, softplus, softsign, relu, tanh, sigmoid, hard_sigmoid and linear. The range of the available number of neurons per layer has been set from 1 to 40.

Therefore, the terminal set contains 400 elements (all possible combinations of layers, activation functions and numbers of neurons).

The functional set should include possible operations on elements from the terminal set. We have defined two operations to be included in the functional set:

  1. 1.

    Sequential union (“+”)

  2. 2.

    Sequential union with additional Dropout layer (“+” Dropout with coefficients \(\{0.1, 0.2,\ldots ,0.9\}\)).

4.2 The Structure Encoding Description

GP uses binary trees for encoding all structures (unlike GA that uses binary strings). In this study, we propose using the following method of encoding ANN structures into trees. For instance, Fig. 2 shows the structure encoded into the tree.

Fig. 2.
figure 2

The example of encoding the neural network structure into the tree for GP

The code below represents the decoded structure from Fig. 2 already in a form suitable for Keras. All leaves belong to the terminal set, and at the same time, nodes with two child belong to the functional set.

figure a

This kind of encoding allows to encode various types of ANN structures with an unlimited number of layers. We prevent the design of trees with only Dense layers by using a restriction on the presence of at least one recurrent layer.

4.3 Experiment Description

As a baseline, we take FNN with one hidden layer and 40 neurons calculated by the following function:

$$\begin{aligned} N_{neurons}=\frac{n_{inputs}+n_{outputs}}{2}+1 \end{aligned}$$
(3)

When using GSR features \(n_{inputs}=77\) and \(n_{outputs}=2\) then \(N_{neurons}=40\).

The final best structure found by SelfCGP is tested on all 40 patients using the cross-validation described above. After that, we calculate the Student’s t-test to determine statistically significant differences among all results.

The problem we solve has no time dependence, and at first glance, it would appear that the use of RNNs will be ineffective. However, according to [11], RNNs can surpass the effectiveness of FNNs for problems with no time dependency. Since an RNN requires the presence of a time factor for learning, but the problem is static, in this paper we duplicate the input feature vector in time. Thus, a constant signal is emitted. We have defined the input vector repeats for 3 and 5 times (\(SelfCGP_{3st}\) and \(SelfCGP_{5st}\)) for tests. We also compare training on 1 and 3 epochs. The main parameters of SelfCGP are: the population size is 100 individuals, the number of generations is 100, the maximum depth is 3, and the fully growth at initialization step.

5 FNN Structure Optimisation Using SelfCGA

We have tested two different methods of FNN encoding. The first one uses only Dense layers, but the second one uses Dense and Dropout layers. In this case, SelfCGA is used for finding the optimal number of neurons, layer activation function and the total number of layers. The length that describes one part of the FNN (Dense layer + Dropout layer) is divided into four sets. The first set represents the type of activation function of the Dense layer. The second represents the number of neurons in the Dense layer. The third set represents the presence or absence of the Dropout layer after Dense, and then the fourth set represents the fraction of the input units to drop. The FNN in the genotype can be represented as shown in Figs. 3 and 4. After the second and fourth layers of the dense type, the layers of the droplet types do not follow, and the next part of the network immediately begins. The architecture of each network is coded into the chromosomes of SelfCGA, where each chromosome is composed of \((n - m)*4\) genes. n is the maximal number of layers (or we can call it parts of the network: Dense + Dropout), which must be selected before running the program, m is the number of inactive layers (parts of the neural network which contain 0 neurons on the Dense layer not expressed in the phenotype). If we use only one type of layer, we can remove the part which describes Dense layer, and if we use more types of layers, we can add more parts in the string.

Fig. 3.
figure 3

Example of the NN architecture

FNN is optimised by the Adam algorithm.

6 Feature Selection Using PSOPB

Within this research, the dimension of the feature space is reduced by Particle Swarm Optimisation with parasitic behavior (PSOPB) [20]. This dataset contains big data (data of large dimensions). Thus, the solution of the classification problem becomes a complicated task for some algorithms. For example, using an artificial neural network, the number of neurons in the first layer is equal to the dimensionality of the feature space, and the number of weights that any training algorithm needs to find could be even greater. Thus, the problem of consuming a large number of resources, both temporary and memory, arises. A possible solution is to reduce the feature space. The decrease dimension of the feature space has already been made by classical methods, such as PCA, so another approach was chosen to solve this problem. Each attribute is evaluated by a certain feature. Thus, the string of input classification parameters is the vector of binomial values for the optimisation algorithm (see Fig. 5) [18].

Fig. 4.
figure 4

Example of a genotype. The genotype defines the NN architecture from Fig. 3. In this case, second and forth Dropout layers are an inactive.

Fig. 5.
figure 5

Example of converting an attributes vector into a “particle” vector

The fitness of the “particle” is the result of a classification algorithm (accuracy, f1-measure or something else). The fitness function is a “deleting” of parameters with a null feature (see Fig. 6).

Fig. 6.
figure 6

Example of Fitness values calculation

RNN is chosen like the Fitness function of PSOPB.

The structure of RNN contains one hidden layer with 40 neurons of the LSTM type, n input neurons (depends on the number of features) and two output neurons.

figure b

Accuracy is chosen like the fitness value of the particle.

7 Experiments

7.1 Feature Selection Results

Experiments are conducted using GSR features. It has two parts: right and left, so the results of the PSOPB working are two binomial vectors of parameter coefficients.

PSOPB with 10 generations and 30 individuals in the population found two binomial vectors for each (“left” and “right” forearms) problem. The strings below are the final results of PSOPB. These strings show which GSR feature takes part in the designing of the classifier and which does not.

For the left forearm:

$$\begin{aligned}&X=[0,0,1,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,1,\\&\qquad \qquad 1,1,1,0,1,1,1,0,1,0,0,0,0,1,1,1,1,1,0,0,1,1,1,1,0,1,\\&\qquad \qquad \qquad \qquad 1,1,0,0,1,1,0,1,0,1,1,1,1,1,1,0,1,1,1,0,1,1,1,0,1] \end{aligned}$$

For the right forearm:

$$\begin{aligned}&X=[0,1,0,0,0,1,1,0,0,0,0,1,1,1,1,0,1,0,1,1,0,0,1,1,0,\\&\qquad \qquad 1,0,1,0,1,0,0,0,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,0,0,\\&\qquad \qquad \qquad \qquad 0,0,1,0,1,1,1,0,0,1,1,1,0,0,0,1,1,0,0,1,1,1,1,0,1,1,0] \end{aligned}$$

Therefore, we have 44 active features for the “left” part and 39 active features for the “right” part.

Table 1. RNN accuracy on original data (77 features) and reduced by PSOPB (44 for Left Forearm and 39 for Right Forearm)

The results of the work of the two algorithms are compared with each other by the Student’s t-test. Experiments are conducted in different conditions to find the relation between RNN structure or settings and the work of PSOPB (Table 1). The mean accuracy was obtained by conducting 40 runs with different patients for testing. 39 people are taken for training on the RNN and one person for the test.

7.2 RNN Optimisation Results

The following structures are found by SelfCGP. For the left forearm problem:

figure c

As we can see, there are 4 layers with only GRU and SimpleRNN types of layers.

For the right forearm problem:

figure d

In this case, there are only 3 SimpleRNN layers and “tanh” as an activation function for each layer.

The Table 2 shows the average participant independent leave one participant out cross validation performance.

Table 2. SelfCGP without reduction

As can be seen from the Table 2, the best average value for the classification accuracy is achieved by training in 3 epochs on the structure obtained with the help of SelfCGP and 3 time steps for the left forearm problem and 5 time steps for the right forearm problem (bold values). There are no statistically significant differences among all the results according to the Student’s t-test.

PSOPB reduced the dimensionality from 77 to 44 for the left forearm problem. The best structure found by SelfCGP is:

figure e
Table 3. SelfCGP with reduction

For the right forearm problem PSOPB allowed the dimensionality to be reduced from 77 to 39. SelfCGP found the following structure:

figure f

The Table 3 shows the average for the RNN with reduced dimension dataset participant independent leave one participant out cross validation performance.

Here there are also no statistically significant differences among all the results according to the Student’s t-test.

7.3 FNN Optimisation Results

Below are two example Neural Network topologies found by SelfCGA that showed the best results.

The structure found by the SelfCGA for the group of features of GSR on the data of the right forearm is given:

figure g

The structure was found using the self-configuring GA for group video features using the data of the left forearm:

figure h

These architectures outperforms the other found models. Still, its performance is not significantly better.

Tables 4 and 5 show the results of mean accuracy for different neural network architectures obtained by cross-validation. The Table 4 includes the results which was get using GSR features, and the Table 5 includes results for video features. 1st encoding type means that SelfCGA used only dense layers in neural network arcitecture construction process, and 2st encoding type means that SelfCGA could build neural networks using Dense and Dropout layers.

Table 4. SelfCGA for GSR features
Table 5. SelfCGA for video features

8 Conclusion

In this paper, we have presented two EAs for the design of ANN classifiers. With the obtained results, we can state that SelfCGP allows the structure of RNNs to be optimised, but the results of the Student’s t-test do not allow us to assert that the obtained improvements are statistically significant for these problems. Reducing the dimension space by PSOPB did not change the accuracy in the work of the FNN in statistical terms, but the dimension space was reduced by about half. It is better to calculate big and complicate models like an RNN with an optimised structure by SelfCGA or SelfCGP on this dataset. Therefore, we can conclude that this method of reducing the dimension space can be implemented in work with the SenseEmotion dataset.