Frame by Frame Pain Estimation Using Locally Spatial Attention Learning

Yu, Jun; Kurihara, Toru; Zhan, Shu

doi:10.1007/978-3-030-31321-0_20

Jun Yu¹²,
Toru Kurihara¹² &
Shu Zhan¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11868))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

1143 Accesses
3 Citations

Abstract

Estimating pain intensity for patient is a challenging area in clinic treatment and medical diagnosis. Painful facial expression only relates to some areas of face. Inspired by this fact, we introduce end-to-end locally spatial attention learning for pain estimation. By focusing on important region in the face with $1 \times 1$ locally convolutional layer, the local features related to pain intensity can be captured. Furthermore, facial expression is the dynamic deformation of face in the time domain. In order to model the information, the long short-term memory network (LSTM) is incorporated into our architecture. The feature extracted by the convolutional neural network (CNN) with the locally spatial attention learning is fed to the LSTM network. The results show that our locally spatial attention learning can provide the fine-grained variation on the face region for pain intensity assessment.

You have full access to this open access chapter, Download conference paper PDF

A Deep Attention Transformer Network for Pain Estimation with Facial Expression Video

Multi-stream Integrated Neural Networks for Facial Expression-Based Pain Recognition

Chronic Pain Estimation Through Deep Facial Descriptors Analysis

Keywords

1 Introduction

Pain is an unpleasant feeling which is related to tissue damage and unhealthy condition. Accurate pain intensity estimation is a central problem in mental health and clinical treatment. Traditionally, pain intensity is evaluated by the observation of expert and self-reported data, such as Observer Pain Intensity (OPI), Visual Analog Scale (VAS). However, for elderly people with dementia who lack the ability to express pain intensity, evaluating pain intensity becomes a basic issue in some medical diagnosis applications. In addition, manual pain estimation is time-consuming, inaccurate without professional training and not available for real-time pain assessment. In such situations, accurate pain intensity evaluation plays an important role in medical treatment and health care. Hence, there is a large demand to build automatic assessment system for pain intensity estimation.

To solve the great demand, a large majority of research has focused on automatic pain intensity assessment. In the initial stage, detecting pain in video by facial action units has been proposed by Lucey et al. [7]. Later, some methods have been proposed to evaluate pain intensity using multimodal data, such as thermal and depth data from camera [3], biomedical signals from the electrocardiogram signals (ECG) and the electromyogram signals (EMG) [17]. Recently, deep convolutional neural networks have achieved great successful results in face recognition, face detection and so on. Therefore, deep neural networks are attracting widespread interest in the fields of facial expression recognition, especially pain intensity estimation. Recurrent Convolutional Neural Network used for object detection was utilized by Zhou et al. [20] for pain intensity estimation. Another method was developed by fine-tuning deep face verification net with regularized regression loss [15]. Pau et al. [11] proposed a method by combining deep convolutional neural networks with long short-term memory networks for pain intensity estimation. Their study suggested extracting features from both the spatial space and the temporal space can obtain good performance for frame-level pain intensity estimation. Tavakolian et al. [14] developed a method by using binary coding of discriminative statistical feature representation from the convolutional neural network. Hamming distance is applied in the new loss function and benefits the training of the whole framework.

Attention mechanism is one of key properties of human being’s visual system which can selectively concentrate on the important areas in an image or a scene for better understanding. Inspired by that, there are several attempts to utilize attention mechanism to improve the performance in Image captioning [19] and other applications. More recently, a concise attention module was proposed by Hu et al. [2] to build the relationship between different channels inside the neural network. The global average pooling was used for estimating the channel-wise attention. Later, Woo et al. [18] proposed new attention model called Convolutional Block Attention Module (CBAM). In the CBAM module, the attention mechanism was applied not only in the channel space, but also in the spatial space. Extensive experimental results have shown that CBAM module can achieve the best performance in both the image classification task and the object detection task.

Until now, a few research in the field of pain intensity estimation has attempted to utilize the attention mechanism in their research work. The purpose of this study is to propose and examine end-to-end locally spatial attention learning architecture for pain intensity assessment. The overview of the pipeline of our approach is illustrated in Fig. 1. The approach we have applied in this work aims to exploit “where” is important in spatial space for pain intensity estimation. Besides, our architecture also exploits the relationship between different frames in the video sequence. The proposed attention-based architecture is validated in the widely used benchmark database [8]. The paper is organized as follows: In the method part, we describe the details of our approach. In the experiments part, we investigate and analyze our proposed method in the database. Finally, the conclusion is given for our research.

2 Proposed Method

Feeling pain always generates painful facial expression which is the structural and geometric change in the face. The purpose of this study is to estimate the pain intensity directly from the patient’s face in a recorded video or in the real-time surveillance system. The motivation of our method is that each region of the face is not equally contributed to the painful expression. In order to capture the local detailed variation of face, we propose the locally spatial attention learning architecture for pain assessment. Our structure is based on the VGG network [12] and previous face recognition work [13]. The input tensor is rescaled by the locally spatial attention model, which can enhance the ability of our network for extracting features from images. Some previous behavioral and emotion study suggests the dynamic information of facial expression is useful and efficient for emotional assessment [5]. In our architecture, the LSTM network is adopted for capturing the dynamic information in the temporal domain. By combining the spatial variation and the temporal variation in the video sequence of patient’s face, we are able to estimate the pain intensity robustly. Our architecture consists of CNN with spatial attention model and LSTM. Each frame in the video sequence is fed into the whole architecture. The CNN block extracts features from single frame, then puts the feature vector into the LSTM block to estimate the pain intensity. The details of our architecture is shown in Fig. 2. In the next section, we elaborate on the details of our approach.

2.1 Locally Spatial Attention Learning

Figure 3 shows the overview of our locally spatial attention learning model. For extracting the static features from the patient face, we utilize the convolutional neural network which is based on VGG11 network (configuration A) [12] for pain intensity estimation. The motivation of our study is that each region of the face is not equally contributed to the painful expression. In an attempt to design the spatial attention model, our intention is to provide a way of detecting the important region for pain intensity estimation. The proposed model which consists of two layers is inserted inside the third block in the convolutional neural network to capture the detailed information from the previous layers.

The aim of the locally spatial attention learning is to capture more detailed information from the face region. A major problem with the previous attention model based on the conventional convolutional kernel is that the generated attention map is translation invariant so that the local details of the image are hard to capture. The geometry attribute of the face image is symmetrical and structural. Inspired by previous face recognition study [13] and other research field based on the facial expression [9], we propose the locally spatial attention learning model which is incorporated into the convolutional neural network. Given the output tensor T from the previous building block in the convolutional neural network, the shape of the tensor is $C \times H \times W$, which C is the channel number of the tensor while H and W represent height and width respectively. The spatial attention model takes the input tensor T and generates a 2D spatial attention map $A_s$, with size $H \times W$. To generate the spatial attention map, the locally convolutional layer is adopted in the first layer of the spatial attention model. For each location in the spatial dimension of the input tensor, the first layer of our locally spatial attention learning model uses different convolutional kernel for extracting discriminative appearance feature. The $P_{ij}$ denotes the different weights for the input tensor T of different location $T_{ij}$. Each $T_{ij}$ has its own receptive field of the face image. In hence, the more the spatial attention model is behind, the larger the receptive field of the attention model becomes. The kernel size of the locally convolutional layer is $1 \times 1$, so shape of the output tensor of the first layer is $R \times H \times W$, $R=16$. The $\tanh $ function is used as activate function in the first layer. In the second layer of spatial attention model, we apply the conventional $1 \times 1$ convolutional kernel to generate the spatial attention map, which describes the informative parts of the face region. The sigmoid function is applied on the top of the spatial attention model to let the attention weight lie from zero to one. The shape of the attention map $A_{s}$ is $1 \times H \times W$. In short, the spatial attention model is calculated as:

$$\begin{aligned} T_{res}=T \otimes A_{s} \end{aligned}$$

(1)

where, the operation $\otimes $ denotes the element-wise multiplication. The output of the attention map rescales the input tensor T. In our implementation, the attention map $A_{s}$ is broadcasted in the channel dimension of the input tensor T. Then, the rescaled tensor $T_{res}$ is fed to the latter convolutional layer in the convolutional neural network. We utilized dropout strategy to avoid the overfitting problems. The dropout ratio of the fully connected layer was set to 0.3. The arrangement of the two layers inside the attention model is a key problem for pain intensity estimation. We compare different spatial attention models in the experiments section, and the results demonstrate that locally convolutional layer in the first order is better than other structures.

2.2 Temporal Learning

The structure we have used in this study aims not only to extract features from the single frame, but also to build the dynamic temporal relationship among different frames. For this purpose, the LSTM network for learning the long-term relationship among sequence data is adopted in our temporal model. The output feature vector $Z^{(t)}$ of convolutional neural network is fed into the LSTM, so the input node of the LSTM is 256-dimensional. $Z^{(t)}$ denotes the feature vector of tth frame in the video sequence. In our temporal model, we only use one LSTM layer. The hidden node of our LSTM is 128-dimensional. The dropout ratio of our temporal model is 0.3. The LSTM network has three gates to control the information from the previous sequence data and the existing sequence data. The video sequence of the patient is divided into small groups to train our architecture. For instance, the first sequence $s_1= \{f_1, f_2, \dots , f_{16}\}$, is 16 consecutive frames from one video, the next sequence is $s_2= \{f_2, f_3, \dots , f_{17}\}$, until the last sequence of this video. Here, $f_{i}$ denotes the ith frame in the video sequence. The frame label of our training database is Prkachin and Solomon’s Pain Intensity metric (PSPI) [10]. The label of each sequence is the PSPI label of the last frame in this sequence. Therefore, the PSPI label is predicted by considering all the 16 frames. Finally, two fully connected layers are used for predicting the PSPI value based on the output of the LSTM network. The dimension of the first layer is 64 and the dimension of the second layer is only 1 for estimating pain intensity. Considering that estimating the pain intensity is the regression task, we chose the mean squared error loss for training our architecture.

3 Experiment

In this section, we will discuss the details of our experiments and results for our proposed architecture.

3.1 Database and Preprocessing Details

We trained and validated our spatial attention architecture in the UNBC-McMaster shoulder pain database [8], which consists of 25 subjects with 200 videos. All the participants in this database have got shoulder pain. In the recording stage, they did a list of active and passive range-of-motion tests with their limbs under the professional guidelines. The database provides three types of labels for calculating pain intensity, including VAS, OPI and PSPI. As reported in the previous research [16], the PSPI label in the database is not always reliable. In their finding, both VAS and OPI labels for some subjects are not zeros which means the subject feels pain. However, the PSPI label for that subject is zero. It is known from the literatures [6], that some action units related to the pain expression is not calculated in the original PSPI equation. To investigate the ability of our spatial attention model, we did the data cleaning process for reliable PSPI labels in the database. Therefore, we excluded one subject which has no obvious pain (101-mg101 subjects, including 9 video sequences) and some video sequences without reliable PSPI labels (bg096t2afaff, ib109t2aeaff). Totally, 24 subjects with 189 video sequences were used in our experiments. We followed the previous research [15] to preprocess the PSPI label by transforming the range of value from 0–15 to 0–5. Data preprocessing is a major part in training deep neural networks. As demonstrated in Fig. 4, the OpenFace2.0 toolkit [1] was utilized in our experiments for face alignment and cropping.

3.2 Implementation and Analysis

The convolutional neural network in our architecture was trained from scratch and the whole structure was trained in the end-to-end manner. As mentioned before, we only used 24 subjects with 189 video sequences to train our network. Therefore, we evaluated our network in 24 subjects leave-one-subject-out cross validation. The learning rate of our network was set to 0.0001. Our network was trained in 20 epochs. Furthermore, we utilized Adam [4] optimizer with weight decay 0.001 to train our architecture. The whole architecture was implemented by PyTorch framework with batch size 32 in 4 GPUs. The order of the locally convolutional layer and the conventional convolutional layer in the spatial attention model is an essential issue in our study. Here, we compare two different spatial attention models. As illustrated in Fig. 5, the left attention map is derived from the locally spatial attention learning model used in our architecture, while the right attention map is from the different spatial attention model which the first layer is the convolutional layer and the second layer is the locally convolutional layer. As can be seen from Fig. 5, our proposed locally spatial attention learning model indicates that the cheek of the face and the region between eyebrows are important for pain intensity estimation. The right attention map shows nearly same importance in the whole face region which indicates this architecture cannot detect significant region for pain intensity estimation. Comparison of two different attention maps shows our proposed model can capture the important region of face more effectively than another model with different order. We also compare the performance of our method with the general architecture and the previous research. Here, the general architecture only contains CNN network without locally spatial attention learning and LSTM network. The mean absolute error (MAE), mean squared error (MSE) and Pearson Correlation Coefficient (PCC) are reported in Table 1. As listed in Table 1, it is obvious that the locally spatial attention model can achieve an improvement on both MAE and MSE with comparison of general architecture. Comparison between our method and previous research shows the performance of our architecture is not perfect. It should, however, be noted that we use smaller training database for our neural network. Accurate and reliable PSPI label is important for training deep neural networks. Estimating the Pain intensity should be accurately related to the painful expression and feeling which are crucial issues for some medical diagnosis.

Table 1. Comparison of different methods

Full size table

4 Conclusion

Automatic pain intensity estimation is a key technique in some medical applications. In this paper, we propose locally spatial attention learning method to find important region on the face and to enhance the performance of the whole architecture. The results indicate that our proposed method can capture the important area of face for pain intensity estimation. Our current study expands the prior work in this research area and provides a new method for future study on painful expression analysis. We conducted our experiments in the shoulder pain database. At present, the results show the performance of our architecture is better than the general structure without locally spatial attention learning and is not outstanding compared with the state-of-the-art methods. We will improve our spatial attention architecture for better results by effective network engineering in the future.

References

Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. IEEE (2018)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Irani, R., et al.: Spatiotemporal analysis of RGB-DT facial images for multimodal pain level recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 88–95 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krumhuber, E.G., Kappas, A., Manstead, A.S.: Effects of dynamic aspects of facial expressions: a review. Emot. Rev. 5(1), 41–46 (2013)
Article Google Scholar
Kunz, M., Lautenbacher, S.: The faces of pain: a cluster analysis of individual differences in facial activity patterns of pain. Eur. J. Pain 18(6), 813–823 (2014)
Article Google Scholar
Lucey, P., et al.: Automatically detecting pain in video through facial action units. IEEE Trans. Syst. Man Cybern Part B (Cybern.) 41(3), 664–674 (2011)
Article Google Scholar
Lucey, P., Cohn, J.F., Prkachin, K.M., Solomon, P.E., Matthews, I.: Painful data: the UNBC-McMaster shoulder pain expression archive database. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011), pp. 57–64. IEEE (2011)
Google Scholar
Pei, W., Dibeklioğlu, H., Baltrušaitis, T., Tax, D.M.: Attended end-to-end architecture for age estimation from facial expression videos. arXiv preprint arXiv:1711.08690 (2017)
Prkachin, K.M., Solomon, P.E.: The structure, reliability and validity of pain expression: evidence from patients with shoulder pain. Pain 139(2), 267–274 (2008)
Article Google Scholar
Rodriguez, P., et al.: Deep pain: exploiting long short-term memory networks for facial expression classification. IEEE Trans. Cybern. 99, 1–11 (2017)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
Google Scholar
Tavakolian, M., Hadid, A.: Deep binary representation of facial expressions: a novel framework for automatic pain intensity recognition. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1952–1956. IEEE (2018)
Google Scholar
Wang, F., et al.: Regularizing face verification nets for pain intensity regression. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1087–1091. IEEE (2017)
Google Scholar
Werner, P., Al-Hamadi, A., Limbrecht-Ecklundt, K., Walter, S., Gruss, S., Traue, H.C.: Automatic pain assessment with facial activity descriptors. IEEE Trans. Affect. Comput. 8(3), 286–299 (2017)
Article Google Scholar
Werner, P., Al-Hamadi, A., Niese, R., Walter, S., Gruss, S., Traue, H.C.: Automatic pain recognition from video and biomedical signals. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 4582–4587. IEEE (2014)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Zhou, J., Hong, X., Su, F., Zhao, G.: Recurrent convolutional neural network regression for continuous pain intensity estimation in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 84–92 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Kochi University of Technology, Kochi, Japan
Jun Yu & Toru Kurihara
School of Computer and Information, Hefei University of Technology, Anhui, China
Shu Zhan

Authors

Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Toru Kurihara
View author publications
You can also search for this author in PubMed Google Scholar
Shu Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Yu .

Editor information

Editors and Affiliations

Universidad Autónoma de Madrid, Madrid, Spain
Aythami Morales
Universidad Autónoma de Madrid, Madrid, Spain
Julian Fierrez
Universitat Jaume I, Castellón de la Plana, Spain
José Salvador Sánchez
University of Coimbra, Coimbra, Portugal
Bernardete Ribeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, J., Kurihara, T., Zhan, S. (2019). Frame by Frame Pain Estimation Using Locally Spatial Attention Learning. In: Morales, A., Fierrez, J., Sánchez, J., Ribeiro, B. (eds) Pattern Recognition and Image Analysis. IbPRIA 2019. Lecture Notes in Computer Science(), vol 11868. Springer, Cham. https://doi.org/10.1007/978-3-030-31321-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-31321-0_20
Published: 22 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31320-3
Online ISBN: 978-3-030-31321-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)