A cascaded spatiotemporal attention network for dynamic facial expression recognition

Ye, Yaoguang; Pan, Yongqi; Liang, Yan; Pan, Jiahui

doi:10.1007/s10489-022-03781-0

A cascaded spatiotemporal attention network for dynamic facial expression recognition

Published: 23 June 2022

Volume 53, pages 5402–5415, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yaoguang Ye¹,
Yongqi Pan²,
Yan Liang¹ &
…
Jiahui Pan ORCID: orcid.org/0000-0002-7576-6743^1,3

728 Accesses
5 Citations
Explore all metrics

Abstract

Dynamic facial expression recognition (DFER) is a promising research area because it concerns the dynamic change pattern of facial expressions, but it is difficult to effectively capture the facial appearances and dynamic temporal information of each image in an image sequence. In this paper, a cascaded spatiotemporal attention network (CSTAN) is proposed to learn and integrate spatial and temporal emotional information in the process of facial expression change. Three types of attention modules are embedded into the cascaded network to enable it to extract more informative spatiotemporal features for the DFER task in different dimensions. A channel attention module helps the network focus on the meaningful spatial feature maps for the DFER task, a spatial attention module focuses on the regions of interest among the spatial feature maps, and a temporal attention module aims to explore the dynamic temporal information when an expression changes. The experimental results on three public facial expression recognition datasets prove the good performance of the CSTAN, and it can extract representative spatiotemporal features. Meanwhile, the visualization results reveal that the CSTAN can locate regions of interest and contributing timesteps, which illustrates the effectiveness of the multidimensional attention modules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 3

A multi-scale multi-attention network for dynamic facial expression recognition

Article 05 October 2021

Facial Expression Recognition Based on Deep Spatio-Temporal Attention Network

STAN: spatiotemporal attention network for video-based facial expression recognition

Article 19 November 2022

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Mehrabian (1965) Communication without words. Lancet 286(7401):30. https://doi.org/10.1016/S0140-6736(65)90194-7
Article Google Scholar
Zhang Z, Lai C, Liu H, Li Y-F (2020) Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection. Neurocomputing 409:341–350. https://doi.org/10.1016/j.neucom.2020.05.081 https://doi.org/10.1016/j.neucom.2020.05.081
Article Google Scholar
Liu T, Liu H, Li Y, Zhang Z, Liu S (2019) Efficient blind signal reconstruction with wavelet transforms regularization for educational robot infrared vision sensing. IEEE/ASME Trans Mechatron 24(1):384–394. https://doi.org/10.1109/TMECH.2018.2870056
Article Google Scholar
Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) Mfdnet: collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2021.3081873
Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li YF (2022) Arhpe: asymmetric relation-aware representation learning for head pose estimation in industrial human-machine interaction. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3143605
Kobayashi H, Hara F, Ikeda S, Yamada H (1993) A basic study of dynamic recognition of human facial expressions. In: Proceedings of 1993 2nd IEEE international workshop on robot and human communication. https://doi.org/10.1109/ROMAN.1993.367709. http://ieeexplore.ieee.org/document/367709/. IEEE, Tokyo, pp 271–275
Dornaika F, Moujahid A, Raducanu B (2013) Facial expression recognition using tracked facial actions: classifier performance analysis. Eng Appl Artif Intell 26(1):467–477. https://doi.org/10.1016/j.engappai.2012.09.002 https://doi.org/10.1016/j.engappai.2012.09.002
Article Google Scholar
Tian Y-I, Kanade T, Cohn JF (2001) Recognizing action units for facial expression analysis. IEEE Trans Pattern Anal Machine Intell 23(2):97–115. https://doi.org/10.1109/34.908962
Article Google Scholar
Yu Z, Liu Q, Liu G (2018) Deeper cascaded peak-piloted network for weak expression recognition. Vis Comput 34(12):1691–1699. https://doi.org/10.1007/s00371-017-1443-0
Article Google Scholar
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society conference on computer vision and pattern, recognition—workshops. https://doi.org/10.1109/CVPRW.2010.5543262. http://ieeexplore.ieee.org/document/5543262/. IEEE, San Francisco, pp 94–101
Taini M, Zhao G, Li SZ, Pietikainen M (2008) Facial expression recognition from near-infrared video sequences. In: 2008 19th International conference on pattern recognition. https://doi.org/10.1109/ICPR.2008.4761697. ISSN: 1051-4651. http://ieeexplore.ieee.org/document/4761697/. IEEE, Tampa, pp 1–4
Khan RA, Arthur C, Meyer A, Bouakaz S (2019) A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vis Comput 83–84:61–69. https://doi.org/10.1016/j.imavis.2019.02.004 https://doi.org/10.1016/j.imavis.2019.02.004. arXiv:1812.01555
Article Google Scholar
Pantie M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Machine Intell 22(12):1424–1445. https://doi.org/10.1109/34.895976
Article Google Scholar
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J et al (2018) Recent advances in convolutional neural networks. Pattern Recognit 77:354–377
Article Google Scholar
Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst
Saurav S, Saini R, Singh S (2021) Emnet: a deep integrated convolutional neural network for facial emotion recognition in the wild. Appl Intell 51(8):5543–5570
Article Google Scholar
Ko B (2018) A brief review of facial emotion recognition based on visual information. Sensors 18(2):401. https://doi.org/10.3390/s18020401 https://doi.org/10.3390/s18020401
Article Google Scholar
Rodriguez P, Cucurull G, Gonalez J, Gonfaus JM, Nasrollahi K, Moeslund TB, Roca FX (2017) Deep pain: exploiting long short-term memory networks for facial expression classification. IEEE Trans Cybern–10110920172662199. https://doi.org/10.1109/TCYB.2017.2662199
Uddin MA, Joolee JB, Sohn K-A (2021) Dynamic facial expression understanding using deep spatiotemporal LDSP on spark. IEEE Access 9:16866–16877. https://doi.org/10.1109/ACCESS.2021.3053276 https://doi.org/10.1109/ACCESS.2021.3053276
Article Google Scholar
Qu X, Zou Z, Su X, Zhou P, Wei W, Wen S, Wu D (2021) Attend to where and when: cascaded attention network for facial expression recognition. IEEE Trans Emerg Top Comput Intell 1–13. https://doi.org/10.1109/TETCI.2021.3070713
Liu H, Zheng C, Li D, Shen X, Lin K, Wang J, Zhang Z, Zhang Z, Xiong NN (2022) Edmf: efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans Ind Inform 18(7):4361–4371. https://doi.org/10.1109/TII.2021.3128240
Article Google Scholar
Liu H, Zheng C, Li D, Zhang Z, Lin K, Shen X, Xiong NN, Wang J (2022) Multi-perspective social recommendation method with graph representation learning. Neurocomputing 468:469–481. https://doi.org/10.1016/j.neucom.2021.10.050
Article Google Scholar
Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Trans Neural Netw Learn Syst 1–13. https://doi.org/10.1109/TNNLS.2021.3055147
Liu T, Wang J, Yang B, Wang X (2021) Ngdnet: nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220. https://doi.org/10.1016/j.neucom.2020.12.090 https://doi.org/10.1016/j.neucom.2020.12.090
Article Google Scholar
Liu H, Nie H, Zhang Z, Li Y-F (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322. https://doi.org/10.1016/j.neucom.2020.09.068
Article Google Scholar
Sun W, Zhao H, Jin Z (2018) A visual attention based ROI detection method for facial expression recognition. Neurocomputing 296:12–22. https://doi.org/10.1016/j.neucom.2018.03.034
Article Google Scholar
Fernandez PDM, Pena FAG, Ren TI, Cunha A (2019) FERAtt: facial expression recognition with attention net. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW). https://doi.org/10.1109/CVPRW.2019.00112. https://ieeexplore.ieee.org/document/9025630/. IEEE, Long Beach, pp 837–846
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.90, pp 770–778
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018. Springer, Cham, pp 3–19
Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans Image Process 26(9):4193–4203. https://doi.org/10.1109/TIP.2017.2689999 https://doi.org/10.1109/TIP.2017.2689999
Article MathSciNet MATH Google Scholar
Zhao X, Liang X, Liu L, Li T, Han Y, Vasconcelos N, Yan S (2016) Peak-piloted deep network for facial expression recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016, vol 9906. https://doi.org/10.1007/978-3-319-46475-6_27. Series Title: Lecture Notes in Computer Science. http://link.springer.com/10.1007/978-3-319-46475-6_27. Springer, Cham, pp 425–442
Wang S, Shuai H, Liu Q (2020) Phase space reconstruction driven spatio-temporal feature learning for dynamic facial expression recognition. IEEE Trans Affective Comput. Early access –10110920203007531. https://doi.org/10.1109/TAFFC.2020.3007531
Yang H, Ciftci U, Yin L (2018) Facial expression recognition by de-expression residue learning. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00231. https://ieeexplore.ieee.org/document/8578329/. IEEE, Salt Lake City, pp 2168–2177
Sikka K, Sharma G, Bartlett M (2016) LOMo: latent ordinal model for facial analysis in videos. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.602. https://ieeexplore.ieee.org/document/7780971/. IEEE, Las Vegas, pp 5580–5589
Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: 2015 IEEE international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2015.341, pp 2983–2991
Zhou J, Zhang X, Liu Y, Lan X (2020) Facial expression recognition using spatial-temporal semantic graph network. In: 2020 IEEE International conference on image processing (ICIP). https://doi.org/10.1109/ICIP40778.2020.9191181. https://ieeexplore.ieee.org/document/9191181/. IEEE, Abu Dhabi, pp 1961–1965
Kulkarni K, Corneanu CA, Ofodile I, Escalera S, Baró X, Hyniewska S, Allik J, Anbarjafari G (2021) Automatic recognition of facial displays of unfelt emotions. IEEE Trans Affect Comput 12 (2):377–390. https://doi.org/10.1109/TAFFC.2018.2874996
Article Google Scholar
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928. https://doi.org/10.1109/TPAMI.2007.1110
Article Google Scholar
Zhao G, Pietikäinen M (2009) Boosted multi-resolution spatiotemporal descriptors for facial expression recognition. Pattern Recognit Lett 30(12):1117–1127. https://doi.org/10.1016/j.patrec.2009.03.018
Article Google Scholar
Islam MA, Uddin MA, Lee Y-K (2020) A distributed automatic video annotation platform. Appl Sci 10(15):5319. https://doi.org/10.3390/app10155319
Article Google Scholar
Uddin MA, Akhond MR, Lee Y-K (2018) Dynamic scene recognition using spatiotemporal based DLTP on spark. IEEE Access 6:66123–66133. https://doi.org/10.1109/ACCESS.2018.2878865
Article Google Scholar
Maaten Lvd, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9(86):2579–2605
MATH Google Scholar
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2017.74. https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.74. IEEE Computer Society, Los Alamitos, pp 618–626
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Liu D, Ouyang X, Xu S, Zhou P, He K, Wen S (2020) Saanet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413:145–157. https://doi.org/10.1016/j.neucom.2020.06.062
Article Google Scholar

Download references

Acknowledgments

This study was supported by the National Natural Science Foundation of China under grant 62076103, the Key Realm R and D Program of Guangzhou under grant 202007030005, and the Guangdong Natural Science Foundation under grant 2019A1515011375.

Author information

Authors and Affiliations

School of Software, South China Normal University, Foshan, 528200, China
Yaoguang Ye, Yan Liang & Jiahui Pan
College of Mathematics and Informatics College of Software Engineering, South China Agricultural University, Guangzhou, 510642, China
Yongqi Pan
Pazhou Lab, Guangzhou, 510330, China
Jiahui Pan

Authors

Yaoguang Ye
View author publications
You can also search for this author inPubMed Google Scholar
Yongqi Pan
View author publications
You can also search for this author inPubMed Google Scholar
Yan Liang
View author publications
You can also search for this author inPubMed Google Scholar
Jiahui Pan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jiahui Pan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ye, Y., Pan, Y., Liang, Y. et al. A cascaded spatiotemporal attention network for dynamic facial expression recognition. Appl Intell 53, 5402–5415 (2023). https://doi.org/10.1007/s10489-022-03781-0

Download citation

Accepted: 17 May 2022
Published: 23 June 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10489-022-03781-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cascaded spatiotemporal attention network for dynamic facial expression recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A multi-scale multi-attention network for dynamic facial expression recognition

Facial Expression Recognition Based on Deep Spatio-Temporal Attention Network

STAN: spatiotemporal attention network for video-based facial expression recognition

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now