Skip to main content
Log in

A cascaded spatiotemporal attention network for dynamic facial expression recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Dynamic facial expression recognition (DFER) is a promising research area because it concerns the dynamic change pattern of facial expressions, but it is difficult to effectively capture the facial appearances and dynamic temporal information of each image in an image sequence. In this paper, a cascaded spatiotemporal attention network (CSTAN) is proposed to learn and integrate spatial and temporal emotional information in the process of facial expression change. Three types of attention modules are embedded into the cascaded network to enable it to extract more informative spatiotemporal features for the DFER task in different dimensions. A channel attention module helps the network focus on the meaningful spatial feature maps for the DFER task, a spatial attention module focuses on the regions of interest among the spatial feature maps, and a temporal attention module aims to explore the dynamic temporal information when an expression changes. The experimental results on three public facial expression recognition datasets prove the good performance of the CSTAN, and it can extract representative spatiotemporal features. Meanwhile, the visualization results reveal that the CSTAN can locate regions of interest and contributing timesteps, which illustrates the effectiveness of the multidimensional attention modules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Mehrabian (1965) Communication without words. Lancet 286(7401):30. https://doi.org/10.1016/S0140-6736(65)90194-7

    Article  Google Scholar 

  2. Zhang Z, Lai C, Liu H, Li Y-F (2020) Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection. Neurocomputing 409:341–350. https://doi.org/10.1016/j.neucom.2020.05.081https://doi.org/10.1016/j.neucom.2020.05.081

    Article  Google Scholar 

  3. Liu T, Liu H, Li Y, Zhang Z, Liu S (2019) Efficient blind signal reconstruction with wavelet transforms regularization for educational robot infrared vision sensing. IEEE/ASME Trans Mechatron 24(1):384–394. https://doi.org/10.1109/TMECH.2018.2870056

    Article  Google Scholar 

  4. Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) Mfdnet: collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2021.3081873

  5. Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li YF (2022) Arhpe: asymmetric relation-aware representation learning for head pose estimation in industrial human-machine interaction. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3143605

  6. Kobayashi H, Hara F, Ikeda S, Yamada H (1993) A basic study of dynamic recognition of human facial expressions. In: Proceedings of 1993 2nd IEEE international workshop on robot and human communication. https://doi.org/10.1109/ROMAN.1993.367709. http://ieeexplore.ieee.org/document/367709/. IEEE, Tokyo, pp 271–275

  7. Dornaika F, Moujahid A, Raducanu B (2013) Facial expression recognition using tracked facial actions: classifier performance analysis. Eng Appl Artif Intell 26(1):467–477. https://doi.org/10.1016/j.engappai.2012.09.002https://doi.org/10.1016/j.engappai.2012.09.002

    Article  Google Scholar 

  8. Tian Y-I, Kanade T, Cohn JF (2001) Recognizing action units for facial expression analysis. IEEE Trans Pattern Anal Machine Intell 23(2):97–115. https://doi.org/10.1109/34.908962

    Article  Google Scholar 

  9. Yu Z, Liu Q, Liu G (2018) Deeper cascaded peak-piloted network for weak expression recognition. Vis Comput 34(12):1691–1699. https://doi.org/10.1007/s00371-017-1443-0

    Article  Google Scholar 

  10. Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society conference on computer vision and pattern, recognition—workshops. https://doi.org/10.1109/CVPRW.2010.5543262. http://ieeexplore.ieee.org/document/5543262/. IEEE, San Francisco, pp 94–101

  11. Taini M, Zhao G, Li SZ, Pietikainen M (2008) Facial expression recognition from near-infrared video sequences. In: 2008 19th International conference on pattern recognition. https://doi.org/10.1109/ICPR.2008.4761697. ISSN: 1051-4651. http://ieeexplore.ieee.org/document/4761697/. IEEE, Tampa, pp 1–4

  12. Khan RA, Arthur C, Meyer A, Bouakaz S (2019) A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vis Comput 83–84:61–69. https://doi.org/10.1016/j.imavis.2019.02.004https://doi.org/10.1016/j.imavis.2019.02.004. arXiv:1812.01555

    Article  Google Scholar 

  13. Pantie M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Machine Intell 22(12):1424–1445. https://doi.org/10.1109/34.895976

    Article  Google Scholar 

  14. Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J et al (2018) Recent advances in convolutional neural networks. Pattern Recognit 77:354–377

    Article  Google Scholar 

  15. Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst

  16. Saurav S, Saini R, Singh S (2021) Emnet: a deep integrated convolutional neural network for facial emotion recognition in the wild. Appl Intell 51(8):5543–5570

    Article  Google Scholar 

  17. Ko B (2018) A brief review of facial emotion recognition based on visual information. Sensors 18(2):401. https://doi.org/10.3390/s18020401https://doi.org/10.3390/s18020401

    Article  Google Scholar 

  18. Rodriguez P, Cucurull G, Gonalez J, Gonfaus JM, Nasrollahi K, Moeslund TB, Roca FX (2017) Deep pain: exploiting long short-term memory networks for facial expression classification. IEEE Trans Cybern–10110920172662199. https://doi.org/10.1109/TCYB.2017.2662199

  19. Uddin MA, Joolee JB, Sohn K-A (2021) Dynamic facial expression understanding using deep spatiotemporal LDSP on spark. IEEE Access 9:16866–16877. https://doi.org/10.1109/ACCESS.2021.3053276https://doi.org/10.1109/ACCESS.2021.3053276

    Article  Google Scholar 

  20. Qu X, Zou Z, Su X, Zhou P, Wei W, Wen S, Wu D (2021) Attend to where and when: cascaded attention network for facial expression recognition. IEEE Trans Emerg Top Comput Intell 1–13. https://doi.org/10.1109/TETCI.2021.3070713

  21. Liu H, Zheng C, Li D, Shen X, Lin K, Wang J, Zhang Z, Zhang Z, Xiong NN (2022) Edmf: efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans Ind Inform 18(7):4361–4371. https://doi.org/10.1109/TII.2021.3128240

    Article  Google Scholar 

  22. Liu H, Zheng C, Li D, Zhang Z, Lin K, Shen X, Xiong NN, Wang J (2022) Multi-perspective social recommendation method with graph representation learning. Neurocomputing 468:469–481. https://doi.org/10.1016/j.neucom.2021.10.050

    Article  Google Scholar 

  23. Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Trans Neural Netw Learn Syst 1–13. https://doi.org/10.1109/TNNLS.2021.3055147

  24. Liu T, Wang J, Yang B, Wang X (2021) Ngdnet: nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220. https://doi.org/10.1016/j.neucom.2020.12.090https://doi.org/10.1016/j.neucom.2020.12.090

    Article  Google Scholar 

  25. Liu H, Nie H, Zhang Z, Li Y-F (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322. https://doi.org/10.1016/j.neucom.2020.09.068

    Article  Google Scholar 

  26. Sun W, Zhao H, Jin Z (2018) A visual attention based ROI detection method for facial expression recognition. Neurocomputing 296:12–22. https://doi.org/10.1016/j.neucom.2018.03.034

    Article  Google Scholar 

  27. Fernandez PDM, Pena FAG, Ren TI, Cunha A (2019) FERAtt: facial expression recognition with attention net. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW). https://doi.org/10.1109/CVPRW.2019.00112. https://ieeexplore.ieee.org/document/9025630/. IEEE, Long Beach, pp 837–846

  28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.90, pp 770–778

  29. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018. Springer, Cham, pp 3–19

  30. Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans Image Process 26(9):4193–4203. https://doi.org/10.1109/TIP.2017.2689999https://doi.org/10.1109/TIP.2017.2689999

    Article  MathSciNet  MATH  Google Scholar 

  31. Zhao X, Liang X, Liu L, Li T, Han Y, Vasconcelos N, Yan S (2016) Peak-piloted deep network for facial expression recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016, vol 9906. https://doi.org/10.1007/978-3-319-46475-6_27. Series Title: Lecture Notes in Computer Science. http://link.springer.com/10.1007/978-3-319-46475-6_27. Springer, Cham, pp 425–442

  32. Wang S, Shuai H, Liu Q (2020) Phase space reconstruction driven spatio-temporal feature learning for dynamic facial expression recognition. IEEE Trans Affective Comput. Early access –10110920203007531. https://doi.org/10.1109/TAFFC.2020.3007531

  33. Yang H, Ciftci U, Yin L (2018) Facial expression recognition by de-expression residue learning. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00231. https://ieeexplore.ieee.org/document/8578329/. IEEE, Salt Lake City, pp 2168–2177

  34. Sikka K, Sharma G, Bartlett M (2016) LOMo: latent ordinal model for facial analysis in videos. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.602. https://ieeexplore.ieee.org/document/7780971/. IEEE, Las Vegas, pp 5580–5589

  35. Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: 2015 IEEE international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2015.341, pp 2983–2991

  36. Zhou J, Zhang X, Liu Y, Lan X (2020) Facial expression recognition using spatial-temporal semantic graph network. In: 2020 IEEE International conference on image processing (ICIP). https://doi.org/10.1109/ICIP40778.2020.9191181. https://ieeexplore.ieee.org/document/9191181/. IEEE, Abu Dhabi, pp 1961–1965

  37. Kulkarni K, Corneanu CA, Ofodile I, Escalera S, Baró X, Hyniewska S, Allik J, Anbarjafari G (2021) Automatic recognition of facial displays of unfelt emotions. IEEE Trans Affect Comput 12 (2):377–390. https://doi.org/10.1109/TAFFC.2018.2874996

    Article  Google Scholar 

  38. Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928. https://doi.org/10.1109/TPAMI.2007.1110

    Article  Google Scholar 

  39. Zhao G, Pietikäinen M (2009) Boosted multi-resolution spatiotemporal descriptors for facial expression recognition. Pattern Recognit Lett 30(12):1117–1127. https://doi.org/10.1016/j.patrec.2009.03.018

    Article  Google Scholar 

  40. Islam MA, Uddin MA, Lee Y-K (2020) A distributed automatic video annotation platform. Appl Sci 10(15):5319. https://doi.org/10.3390/app10155319

    Article  Google Scholar 

  41. Uddin MA, Akhond MR, Lee Y-K (2018) Dynamic scene recognition using spatiotemporal based DLTP on spark. IEEE Access 6:66123–66133. https://doi.org/10.1109/ACCESS.2018.2878865

    Article  Google Scholar 

  42. Maaten Lvd, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9(86):2579–2605

    MATH  Google Scholar 

  43. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2017.74. https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.74. IEEE Computer Society, Los Alamitos, pp 618–626

  44. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  45. Liu D, Ouyang X, Xu S, Zhou P, He K, Wen S (2020) Saanet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413:145–157. https://doi.org/10.1016/j.neucom.2020.06.062

    Article  Google Scholar 

Download references

Acknowledgments

This study was supported by the National Natural Science Foundation of China under grant 62076103, the Key Realm R and D Program of Guangzhou under grant 202007030005, and the Guangdong Natural Science Foundation under grant 2019A1515011375.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiahui Pan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, Y., Pan, Y., Liang, Y. et al. A cascaded spatiotemporal attention network for dynamic facial expression recognition. Appl Intell 53, 5402–5415 (2023). https://doi.org/10.1007/s10489-022-03781-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03781-0

Keywords

Navigation