Abstract
Dynamic facial expression recognition (DFER) is a promising research area because it concerns the dynamic change pattern of facial expressions, but it is difficult to effectively capture the facial appearances and dynamic temporal information of each image in an image sequence. In this paper, a cascaded spatiotemporal attention network (CSTAN) is proposed to learn and integrate spatial and temporal emotional information in the process of facial expression change. Three types of attention modules are embedded into the cascaded network to enable it to extract more informative spatiotemporal features for the DFER task in different dimensions. A channel attention module helps the network focus on the meaningful spatial feature maps for the DFER task, a spatial attention module focuses on the regions of interest among the spatial feature maps, and a temporal attention module aims to explore the dynamic temporal information when an expression changes. The experimental results on three public facial expression recognition datasets prove the good performance of the CSTAN, and it can extract representative spatiotemporal features. Meanwhile, the visualization results reveal that the CSTAN can locate regions of interest and contributing timesteps, which illustrates the effectiveness of the multidimensional attention modules.










Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Mehrabian (1965) Communication without words. Lancet 286(7401):30. https://doi.org/10.1016/S0140-6736(65)90194-7
Zhang Z, Lai C, Liu H, Li Y-F (2020) Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection. Neurocomputing 409:341–350. https://doi.org/10.1016/j.neucom.2020.05.081https://doi.org/10.1016/j.neucom.2020.05.081
Liu T, Liu H, Li Y, Zhang Z, Liu S (2019) Efficient blind signal reconstruction with wavelet transforms regularization for educational robot infrared vision sensing. IEEE/ASME Trans Mechatron 24(1):384–394. https://doi.org/10.1109/TMECH.2018.2870056
Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) Mfdnet: collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2021.3081873
Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li YF (2022) Arhpe: asymmetric relation-aware representation learning for head pose estimation in industrial human-machine interaction. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3143605
Kobayashi H, Hara F, Ikeda S, Yamada H (1993) A basic study of dynamic recognition of human facial expressions. In: Proceedings of 1993 2nd IEEE international workshop on robot and human communication. https://doi.org/10.1109/ROMAN.1993.367709. http://ieeexplore.ieee.org/document/367709/. IEEE, Tokyo, pp 271–275
Dornaika F, Moujahid A, Raducanu B (2013) Facial expression recognition using tracked facial actions: classifier performance analysis. Eng Appl Artif Intell 26(1):467–477. https://doi.org/10.1016/j.engappai.2012.09.002https://doi.org/10.1016/j.engappai.2012.09.002
Tian Y-I, Kanade T, Cohn JF (2001) Recognizing action units for facial expression analysis. IEEE Trans Pattern Anal Machine Intell 23(2):97–115. https://doi.org/10.1109/34.908962
Yu Z, Liu Q, Liu G (2018) Deeper cascaded peak-piloted network for weak expression recognition. Vis Comput 34(12):1691–1699. https://doi.org/10.1007/s00371-017-1443-0
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society conference on computer vision and pattern, recognition—workshops. https://doi.org/10.1109/CVPRW.2010.5543262. http://ieeexplore.ieee.org/document/5543262/. IEEE, San Francisco, pp 94–101
Taini M, Zhao G, Li SZ, Pietikainen M (2008) Facial expression recognition from near-infrared video sequences. In: 2008 19th International conference on pattern recognition. https://doi.org/10.1109/ICPR.2008.4761697. ISSN: 1051-4651. http://ieeexplore.ieee.org/document/4761697/. IEEE, Tampa, pp 1–4
Khan RA, Arthur C, Meyer A, Bouakaz S (2019) A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vis Comput 83–84:61–69. https://doi.org/10.1016/j.imavis.2019.02.004https://doi.org/10.1016/j.imavis.2019.02.004. arXiv:1812.01555
Pantie M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Machine Intell 22(12):1424–1445. https://doi.org/10.1109/34.895976
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J et al (2018) Recent advances in convolutional neural networks. Pattern Recognit 77:354–377
Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst
Saurav S, Saini R, Singh S (2021) Emnet: a deep integrated convolutional neural network for facial emotion recognition in the wild. Appl Intell 51(8):5543–5570
Ko B (2018) A brief review of facial emotion recognition based on visual information. Sensors 18(2):401. https://doi.org/10.3390/s18020401https://doi.org/10.3390/s18020401
Rodriguez P, Cucurull G, Gonalez J, Gonfaus JM, Nasrollahi K, Moeslund TB, Roca FX (2017) Deep pain: exploiting long short-term memory networks for facial expression classification. IEEE Trans Cybern–10110920172662199. https://doi.org/10.1109/TCYB.2017.2662199
Uddin MA, Joolee JB, Sohn K-A (2021) Dynamic facial expression understanding using deep spatiotemporal LDSP on spark. IEEE Access 9:16866–16877. https://doi.org/10.1109/ACCESS.2021.3053276https://doi.org/10.1109/ACCESS.2021.3053276
Qu X, Zou Z, Su X, Zhou P, Wei W, Wen S, Wu D (2021) Attend to where and when: cascaded attention network for facial expression recognition. IEEE Trans Emerg Top Comput Intell 1–13. https://doi.org/10.1109/TETCI.2021.3070713
Liu H, Zheng C, Li D, Shen X, Lin K, Wang J, Zhang Z, Zhang Z, Xiong NN (2022) Edmf: efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans Ind Inform 18(7):4361–4371. https://doi.org/10.1109/TII.2021.3128240
Liu H, Zheng C, Li D, Zhang Z, Lin K, Shen X, Xiong NN, Wang J (2022) Multi-perspective social recommendation method with graph representation learning. Neurocomputing 468:469–481. https://doi.org/10.1016/j.neucom.2021.10.050
Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Trans Neural Netw Learn Syst 1–13. https://doi.org/10.1109/TNNLS.2021.3055147
Liu T, Wang J, Yang B, Wang X (2021) Ngdnet: nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220. https://doi.org/10.1016/j.neucom.2020.12.090https://doi.org/10.1016/j.neucom.2020.12.090
Liu H, Nie H, Zhang Z, Li Y-F (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322. https://doi.org/10.1016/j.neucom.2020.09.068
Sun W, Zhao H, Jin Z (2018) A visual attention based ROI detection method for facial expression recognition. Neurocomputing 296:12–22. https://doi.org/10.1016/j.neucom.2018.03.034
Fernandez PDM, Pena FAG, Ren TI, Cunha A (2019) FERAtt: facial expression recognition with attention net. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW). https://doi.org/10.1109/CVPRW.2019.00112. https://ieeexplore.ieee.org/document/9025630/. IEEE, Long Beach, pp 837–846
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.90, pp 770–778
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018. Springer, Cham, pp 3–19
Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans Image Process 26(9):4193–4203. https://doi.org/10.1109/TIP.2017.2689999https://doi.org/10.1109/TIP.2017.2689999
Zhao X, Liang X, Liu L, Li T, Han Y, Vasconcelos N, Yan S (2016) Peak-piloted deep network for facial expression recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016, vol 9906. https://doi.org/10.1007/978-3-319-46475-6_27. Series Title: Lecture Notes in Computer Science. http://link.springer.com/10.1007/978-3-319-46475-6_27. Springer, Cham, pp 425–442
Wang S, Shuai H, Liu Q (2020) Phase space reconstruction driven spatio-temporal feature learning for dynamic facial expression recognition. IEEE Trans Affective Comput. Early access –10110920203007531. https://doi.org/10.1109/TAFFC.2020.3007531
Yang H, Ciftci U, Yin L (2018) Facial expression recognition by de-expression residue learning. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00231. https://ieeexplore.ieee.org/document/8578329/. IEEE, Salt Lake City, pp 2168–2177
Sikka K, Sharma G, Bartlett M (2016) LOMo: latent ordinal model for facial analysis in videos. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.602. https://ieeexplore.ieee.org/document/7780971/. IEEE, Las Vegas, pp 5580–5589
Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: 2015 IEEE international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2015.341, pp 2983–2991
Zhou J, Zhang X, Liu Y, Lan X (2020) Facial expression recognition using spatial-temporal semantic graph network. In: 2020 IEEE International conference on image processing (ICIP). https://doi.org/10.1109/ICIP40778.2020.9191181. https://ieeexplore.ieee.org/document/9191181/. IEEE, Abu Dhabi, pp 1961–1965
Kulkarni K, Corneanu CA, Ofodile I, Escalera S, Baró X, Hyniewska S, Allik J, Anbarjafari G (2021) Automatic recognition of facial displays of unfelt emotions. IEEE Trans Affect Comput 12 (2):377–390. https://doi.org/10.1109/TAFFC.2018.2874996
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928. https://doi.org/10.1109/TPAMI.2007.1110
Zhao G, Pietikäinen M (2009) Boosted multi-resolution spatiotemporal descriptors for facial expression recognition. Pattern Recognit Lett 30(12):1117–1127. https://doi.org/10.1016/j.patrec.2009.03.018
Islam MA, Uddin MA, Lee Y-K (2020) A distributed automatic video annotation platform. Appl Sci 10(15):5319. https://doi.org/10.3390/app10155319
Uddin MA, Akhond MR, Lee Y-K (2018) Dynamic scene recognition using spatiotemporal based DLTP on spark. IEEE Access 6:66123–66133. https://doi.org/10.1109/ACCESS.2018.2878865
Maaten Lvd, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9(86):2579–2605
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2017.74. https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.74. IEEE Computer Society, Los Alamitos, pp 618–626
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Liu D, Ouyang X, Xu S, Zhou P, He K, Wen S (2020) Saanet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413:145–157. https://doi.org/10.1016/j.neucom.2020.06.062
Acknowledgments
This study was supported by the National Natural Science Foundation of China under grant 62076103, the Key Realm R and D Program of Guangzhou under grant 202007030005, and the Guangdong Natural Science Foundation under grant 2019A1515011375.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ye, Y., Pan, Y., Liang, Y. et al. A cascaded spatiotemporal attention network for dynamic facial expression recognition. Appl Intell 53, 5402–5415 (2023). https://doi.org/10.1007/s10489-022-03781-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03781-0