Keywords

1 Introduction

Pain is an unpleasant feeling which is related to tissue damage and unhealthy condition. Accurate pain intensity estimation is a central problem in mental health and clinical treatment. Traditionally, pain intensity is evaluated by the observation of expert and self-reported data, such as Observer Pain Intensity (OPI), Visual Analog Scale (VAS). However, for elderly people with dementia who lack the ability to express pain intensity, evaluating pain intensity becomes a basic issue in some medical diagnosis applications. In addition, manual pain estimation is time-consuming, inaccurate without professional training and not available for real-time pain assessment. In such situations, accurate pain intensity evaluation plays an important role in medical treatment and health care. Hence, there is a large demand to build automatic assessment system for pain intensity estimation.

To solve the great demand, a large majority of research has focused on automatic pain intensity assessment. In the initial stage, detecting pain in video by facial action units has been proposed by Lucey et al. [7]. Later, some methods have been proposed to evaluate pain intensity using multimodal data, such as thermal and depth data from camera [3], biomedical signals from the electrocardiogram signals (ECG) and the electromyogram signals (EMG) [17]. Recently, deep convolutional neural networks have achieved great successful results in face recognition, face detection and so on. Therefore, deep neural networks are attracting widespread interest in the fields of facial expression recognition, especially pain intensity estimation. Recurrent Convolutional Neural Network used for object detection was utilized by Zhou et al. [20] for pain intensity estimation. Another method was developed by fine-tuning deep face verification net with regularized regression loss [15]. Pau et al. [11] proposed a method by combining deep convolutional neural networks with long short-term memory networks for pain intensity estimation. Their study suggested extracting features from both the spatial space and the temporal space can obtain good performance for frame-level pain intensity estimation. Tavakolian et al. [14] developed a method by using binary coding of discriminative statistical feature representation from the convolutional neural network. Hamming distance is applied in the new loss function and benefits the training of the whole framework.

Attention mechanism is one of key properties of human being’s visual system which can selectively concentrate on the important areas in an image or a scene for better understanding. Inspired by that, there are several attempts to utilize attention mechanism to improve the performance in Image captioning [19] and other applications. More recently, a concise attention module was proposed by Hu et al. [2] to build the relationship between different channels inside the neural network. The global average pooling was used for estimating the channel-wise attention. Later, Woo et al. [18] proposed new attention model called Convolutional Block Attention Module (CBAM). In the CBAM module, the attention mechanism was applied not only in the channel space, but also in the spatial space. Extensive experimental results have shown that CBAM module can achieve the best performance in both the image classification task and the object detection task.

Until now, a few research in the field of pain intensity estimation has attempted to utilize the attention mechanism in their research work. The purpose of this study is to propose and examine end-to-end locally spatial attention learning architecture for pain intensity assessment. The overview of the pipeline of our approach is illustrated in Fig. 1. The approach we have applied in this work aims to exploit “where” is important in spatial space for pain intensity estimation. Besides, our architecture also exploits the relationship between different frames in the video sequence. The proposed attention-based architecture is validated in the widely used benchmark database [8]. The paper is organized as follows: In the method part, we describe the details of our approach. In the experiments part, we investigate and analyze our proposed method in the database. Finally, the conclusion is given for our research.

Fig. 1.
figure 1

The illustration of the whole pipeline of our architecture. The input of the architecture is a five-dimensional tensor, including batch size, length of the sequence, channel, height, width.

2 Proposed Method

Feeling pain always generates painful facial expression which is the structural and geometric change in the face. The purpose of this study is to estimate the pain intensity directly from the patient’s face in a recorded video or in the real-time surveillance system. The motivation of our method is that each region of the face is not equally contributed to the painful expression. In order to capture the local detailed variation of face, we propose the locally spatial attention learning architecture for pain assessment. Our structure is based on the VGG network [12] and previous face recognition work [13]. The input tensor is rescaled by the locally spatial attention model, which can enhance the ability of our network for extracting features from images. Some previous behavioral and emotion study suggests the dynamic information of facial expression is useful and efficient for emotional assessment [5]. In our architecture, the LSTM network is adopted for capturing the dynamic information in the temporal domain. By combining the spatial variation and the temporal variation in the video sequence of patient’s face, we are able to estimate the pain intensity robustly. Our architecture consists of CNN with spatial attention model and LSTM. Each frame in the video sequence is fed into the whole architecture. The CNN block extracts features from single frame, then puts the feature vector into the LSTM block to estimate the pain intensity. The details of our architecture is shown in Fig. 2. In the next section, we elaborate on the details of our approach.

Fig. 2.
figure 2

In our study, we utilize CNN network and LSTM network as the backbone of our architecture. We incorporate the locally spatial attention learning model into the convolutional network as illustrated in this figure.

2.1 Locally Spatial Attention Learning

Figure 3 shows the overview of our locally spatial attention learning model. For extracting the static features from the patient face, we utilize the convolutional neural network which is based on VGG11 network (configuration A) [12] for pain intensity estimation. The motivation of our study is that each region of the face is not equally contributed to the painful expression. In an attempt to design the spatial attention model, our intention is to provide a way of detecting the important region for pain intensity estimation. The proposed model which consists of two layers is inserted inside the third block in the convolutional neural network to capture the detailed information from the previous layers.

The aim of the locally spatial attention learning is to capture more detailed information from the face region. A major problem with the previous attention model based on the conventional convolutional kernel is that the generated attention map is translation invariant so that the local details of the image are hard to capture. The geometry attribute of the face image is symmetrical and structural. Inspired by previous face recognition study [13] and other research field based on the facial expression [9], we propose the locally spatial attention learning model which is incorporated into the convolutional neural network. Given the output tensor T from the previous building block in the convolutional neural network, the shape of the tensor is \(C \times H \times W\), which C is the channel number of the tensor while H and W represent height and width respectively. The spatial attention model takes the input tensor T and generates a 2D spatial attention map \(A_s\), with size \(H \times W\). To generate the spatial attention map, the locally convolutional layer is adopted in the first layer of the spatial attention model. For each location in the spatial dimension of the input tensor, the first layer of our locally spatial attention learning model uses different convolutional kernel for extracting discriminative appearance feature. The \(P_{ij}\) denotes the different weights for the input tensor T of different location \(T_{ij}\). Each \(T_{ij}\) has its own receptive field of the face image. In hence, the more the spatial attention model is behind, the larger the receptive field of the attention model becomes. The kernel size of the locally convolutional layer is \(1 \times 1\), so shape of the output tensor of the first layer is \(R \times H \times W\), \(R=16\). The \(\tanh \) function is used as activate function in the first layer. In the second layer of spatial attention model, we apply the conventional \(1 \times 1\) convolutional kernel to generate the spatial attention map, which describes the informative parts of the face region. The sigmoid function is applied on the top of the spatial attention model to let the attention weight lie from zero to one. The shape of the attention map \(A_{s}\) is \(1 \times H \times W\). In short, the spatial attention model is calculated as:

$$\begin{aligned} T_{res}=T \otimes A_{s} \end{aligned}$$
(1)

where, the operation \(\otimes \) denotes the element-wise multiplication. The output of the attention map rescales the input tensor T. In our implementation, the attention map \(A_{s}\) is broadcasted in the channel dimension of the input tensor T. Then, the rescaled tensor \(T_{res}\) is fed to the latter convolutional layer in the convolutional neural network. We utilized dropout strategy to avoid the overfitting problems. The dropout ratio of the fully connected layer was set to 0.3. The arrangement of the two layers inside the attention model is a key problem for pain intensity estimation. We compare different spatial attention models in the experiments section, and the results demonstrate that locally convolutional layer in the first order is better than other structures.

2.2 Temporal Learning

The structure we have used in this study aims not only to extract features from the single frame, but also to build the dynamic temporal relationship among different frames. For this purpose, the LSTM network for learning the long-term relationship among sequence data is adopted in our temporal model. The output feature vector \(Z^{(t)}\) of convolutional neural network is fed into the LSTM, so the input node of the LSTM is 256-dimensional. \(Z^{(t)}\) denotes the feature vector of tth frame in the video sequence. In our temporal model, we only use one LSTM layer. The hidden node of our LSTM is 128-dimensional. The dropout ratio of our temporal model is 0.3. The LSTM network has three gates to control the information from the previous sequence data and the existing sequence data. The video sequence of the patient is divided into small groups to train our architecture. For instance, the first sequence \(s_1= \{f_1, f_2, \dots , f_{16}\}\), is 16 consecutive frames from one video, the next sequence is \(s_2= \{f_2, f_3, \dots , f_{17}\}\), until the last sequence of this video. Here, \(f_{i}\) denotes the ith frame in the video sequence. The frame label of our training database is Prkachin and Solomon’s Pain Intensity metric (PSPI) [10]. The label of each sequence is the PSPI label of the last frame in this sequence. Therefore, the PSPI label is predicted by considering all the 16 frames. Finally, two fully connected layers are used for predicting the PSPI value based on the output of the LSTM network. The dimension of the first layer is 64 and the dimension of the second layer is only 1 for estimating pain intensity. Considering that estimating the pain intensity is the regression task, we chose the mean squared error loss for training our architecture.

Fig. 3.
figure 3

The overview of the locally spatial attention learning model. As illustrated in Fig. 2, the spatial attention model is inserted inside the third block of convolutional neural network. The first layer is \(1 \times 1\) locally convolutional layer. The second layer is the conventional \(1 \times 1\) convolutional layer for dimension reduction. The input of the attention model is the orange block which is the output of first layer in the third block of convolutional network. The spatial attention map is used for rescaling the input tensor. (Color figure online)

3 Experiment

In this section, we will discuss the details of our experiments and results for our proposed architecture.

3.1 Database and Preprocessing Details

We trained and validated our spatial attention architecture in the UNBC-McMaster shoulder pain database [8], which consists of 25 subjects with 200 videos. All the participants in this database have got shoulder pain. In the recording stage, they did a list of active and passive range-of-motion tests with their limbs under the professional guidelines. The database provides three types of labels for calculating pain intensity, including VAS, OPI and PSPI. As reported in the previous research [16], the PSPI label in the database is not always reliable. In their finding, both VAS and OPI labels for some subjects are not zeros which means the subject feels pain. However, the PSPI label for that subject is zero. It is known from the literatures [6], that some action units related to the pain expression is not calculated in the original PSPI equation. To investigate the ability of our spatial attention model, we did the data cleaning process for reliable PSPI labels in the database. Therefore, we excluded one subject which has no obvious pain (101-mg101 subjects, including 9 video sequences) and some video sequences without reliable PSPI labels (bg096t2afaff, ib109t2aeaff). Totally, 24 subjects with 189 video sequences were used in our experiments. We followed the previous research [15] to preprocess the PSPI label by transforming the range of value from 0–15 to 0–5. Data preprocessing is a major part in training deep neural networks. As demonstrated in Fig. 4, the OpenFace2.0 toolkit [1] was utilized in our experiments for face alignment and cropping.

Fig. 4.
figure 4

Original image and processed image by OpenFace 2.0 toolkit. All the images in the database are resized to \(224 \times 224\).

3.2 Implementation and Analysis

The convolutional neural network in our architecture was trained from scratch and the whole structure was trained in the end-to-end manner. As mentioned before, we only used 24 subjects with 189 video sequences to train our network. Therefore, we evaluated our network in 24 subjects leave-one-subject-out cross validation. The learning rate of our network was set to 0.0001. Our network was trained in 20 epochs. Furthermore, we utilized Adam [4] optimizer with weight decay 0.001 to train our architecture. The whole architecture was implemented by PyTorch framework with batch size 32 in 4 GPUs. The order of the locally convolutional layer and the conventional convolutional layer in the spatial attention model is an essential issue in our study. Here, we compare two different spatial attention models. As illustrated in Fig. 5, the left attention map is derived from the locally spatial attention learning model used in our architecture, while the right attention map is from the different spatial attention model which the first layer is the convolutional layer and the second layer is the locally convolutional layer. As can be seen from Fig. 5, our proposed locally spatial attention learning model indicates that the cheek of the face and the region between eyebrows are important for pain intensity estimation. The right attention map shows nearly same importance in the whole face region which indicates this architecture cannot detect significant region for pain intensity estimation. Comparison of two different attention maps shows our proposed model can capture the important region of face more effectively than another model with different order. We also compare the performance of our method with the general architecture and the previous research. Here, the general architecture only contains CNN network without locally spatial attention learning and LSTM network. The mean absolute error (MAE), mean squared error (MSE) and Pearson Correlation Coefficient (PCC) are reported in Table 1. As listed in Table 1, it is obvious that the locally spatial attention model can achieve an improvement on both MAE and MSE with comparison of general architecture. Comparison between our method and previous research shows the performance of our architecture is not perfect. It should, however, be noted that we use smaller training database for our neural network. Accurate and reliable PSPI label is important for training deep neural networks. Estimating the Pain intensity should be accurately related to the painful expression and feeling which are crucial issues for some medical diagnosis.

Table 1. Comparison of different methods
Fig. 5.
figure 5

Example of two different attention maps for image is displayed in Fig. 4. The left attention map is derived from our proposed attention model. The right is comparison model which exchanges the order of \(1 \times 1\) locally convolutional layer and \(1 \times 1\) common convolutional layer.

4 Conclusion

Automatic pain intensity estimation is a key technique in some medical applications. In this paper, we propose locally spatial attention learning method to find important region on the face and to enhance the performance of the whole architecture. The results indicate that our proposed method can capture the important area of face for pain intensity estimation. Our current study expands the prior work in this research area and provides a new method for future study on painful expression analysis. We conducted our experiments in the shoulder pain database. At present, the results show the performance of our architecture is better than the general structure without locally spatial attention learning and is not outstanding compared with the state-of-the-art methods. We will improve our spatial attention architecture for better results by effective network engineering in the future.