Skip to main content

Advertisement

Log in

Hard semantic mask strategy for automatic facial action unit recognition with teacher–student model

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Facial Action Coding System (FACS) is a widely used technique in affective computing, which defines a series of facial action units (AUs) corresponding to localized regions of the face. Fine-grained feature information of critical regions is crucial for accurate AU recognition. However, conventional random masking techniques used in Masked Image Modeling (MIM) often overlook the inherent symmetry of faces and the complex interrelationships among facial muscles, leading to a lack of critical local details and poor AU recognition performance. To address these limitations, we propose a novel teacher-student model-based MIM framework called Hard Semantic Masking Strategy Teacher–Student (HSMS-TS). Specifically, we first introduce a hard semantic mask strategy in the teacher model, aims to guide the student network to focus on learning fine-grained AU-related representations. Then, the student network utilizes the attention maps from the pretrained teacher model to generate a more challenging masking method from a predefined template, increasing the learning difficulty and helping the student acquire better AU-related representations. The experimental results on two publicly available datasets, i.e., BP4D and DISFA, show the effectiveness of our proposed method with exceptional performance. Code will be publicly available at http://github.com/lzichen/HSMS-TS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Rent this article via DeepDyve

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availibility Statement

The DISFA dataset analyzed during the current study is available at http://mohammadmahoor.com/disfa/. The BP4D dataset analyzed during the current study is available at http://www.cs.binghamton.edu/~lijun/Research/3DFE/3DFE_Analysis.html.

References

  1. Szajnberg, N.M.: What the face reveals: Basic and applied studies of spontaneous expression using the facial action coding system (facs). J. Am. Psychoanal. Assoc. 70, 591–595 (2022)

    Article  Google Scholar 

  2. Jyoti, S., Dhall, A.: Expression empowered residen network for facial action unit detection. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–8 (2018)

  3. Corneanu, C.A., Madadi, M., Escalera, S.: Deep structure inference network for facial action unit recognition. In: European Conference on Computer Vision (2018)

  4. Yang, H., Yin, L., Zhou, Y., Gu, J.: Exploiting semantic embedding and visual feature for facial action unit detection. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10477–10486 (2021)

  5. Zhang, W., Li, L., Ding, Y.-q., Chen, W., Deng, Z., Yu, X.: Detecting facial action units from global-local fine-grained expressions. IEEE Transactions on Circuits and Systems for Video Technology (2023)

  6. Luo, C., Song, S., Xie, W., Shen, L., Gunes, H.: Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. In: International Joint Conference on Artificial Intelligence (2022)

  7. Wang, C., Wang, Z.: Progressive multi-scale vision transformer for facial action unit detection. Frontiers in Neurorobotics 15 (2022)

  8. Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9630–9640 (2021)

  9. Li, Y., Shan, S.: Meta auxiliary learning for facial action unit detection. IEEE Trans. Affect. Comput. 14, 2526–2538 (2021)

    Article  Google Scholar 

  10. Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9620–9629 (2021)

  11. Ge, J., Liu, Y., Gui, J., Fang, L., Lin, M., Kwok, J.T.-Y., Huang, L., Luo, B.: Learning the relation between similarity loss and clustering loss in self-supervised learning. IEEE Trans. Image Process. 32, 3442–3454 (2023)

    Article  Google Scholar 

  12. Yu, C., Pei, H.: Dynamic graph clustering learning for unsupervised diabetic retinopathy classification. Diagnostics 13 (2023)

  13. Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. ArXiv arXiv:2103.03230 (2021)

  14. Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. ArXiv arXiv:2011.00362 (2020)

  15. Li, Y., Zeng, J., Shan, S., Chen, X.: Self-supervised representation learning from videos for facial action unit detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10916–10925 (2019). https://doi.org/10.1109/CVPR.2019.01118

  16. Lu, L., Tavabi, L., Soleymani, M.: Self-supervised learning for facial action unit recognition through temporal consistency. In: British Machine Vision Conference (2020)

  17. Song, J., Liu, Z.: Self-supervised facial action unit detection with region and relation learning. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (2023)

  18. Sun, X., Zeng, J., Shan, S.: Emotion-aware contrastive learning for facial action unit detection. 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 01–08 (2021)

  19. Niinuma, K., Ertugrul, I.O., Cohn, J.F., Jeni, L.A.: Facial expression manipulation for personalized facial action estimation. In: Frontiers in Signal Processing (2022)

  20. Wang, C., Wang, Z.: Unsupervised facial action representation learning by temporal prediction. Frontiers in Neurorobotics 16 (2022)

  21. Yan, J., Wang, J., Li, Q., Wang, C., Pu, S.: Weakly supervised regional and temporal learning for facial action unit recognition. IEEE Trans. Multimedia 25, 1760–1772 (2022)

    Article  Google Scholar 

  22. Wang, X., Chen, C.L.P., Yuan, H., Zhang, T.: Semantic learning for facial action unit detection. IEEE Transactions on Computational Social Systems 10, 1372–1380 (2023)

    Article  Google Scholar 

  23. Zhang, Y., Wang, C., Ling, X., Deng, W.: Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: European Conference on Computer Vision (2022)

  24. Jing, L., Zhu, J., LeCun, Y.: Masked siamese convnets. ArXiv arXiv:2206.07700 (2022)

  25. Li, G., Zheng, H., Liu, D., Su, B., Zheng, C.: Semmae: Semantic-guided masking for learning masked autoencoders. ArXiv arXiv:2206.10207 (2022)

  26. Shi, Y., Siddharth, N., Torr, P.H.S., Kosiorek, A.R.: Adversarial masking for self-supervised learning. ArXiv arXiv:2201.13100 (2022)

  27. Feng, Z., Zhang, S.: Evolved part masking for self-supervised learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10386–10395 (2023)

  28. Xie, J., Li, W., Zhan, X., Liu, Z., Ong, Y.S., Loy, C.C.: Masked frequency modeling for self-supervised visual pre-training. ArXiv arXiv:2206.07706 (2022)

  29. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (2015)

  30. Li, H., Wang, N., Yang, X., Gao, X.: Crs-cont: A well-trained general encoder for facial expression analysis. IEEE Trans. Image Process. 31, 4637–4650 (2022)

    Article  Google Scholar 

  31. Li, H., Wang, N., Yang, X., Wang, X., Gao, X.: Towards semi-supervised deep facial expression recognition with an adaptive confidence margin. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4156–4165 (2022)

  32. Li, H., Wang, N., Yang, X., Wang, X., Gao, X.: Unconstrained facial expression recognition with no-reference de-elements learning. IEEE Trans. Affect. Comput. 15, 173–185 (2024)

    Article  Google Scholar 

  33. Kawamura, R., Murase, K.: Facial action unit detection based on teacher-student learning framework for partially occluded facial images. 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 01–05 (2021)

  34. Valstar, M.F., Pantic, M.: Fully automatic facial action unit detection and temporal analysis. 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), 149–149 (2006)

  35. Jiang, B., Valstar, M.F., Pantic, M.: Action unit detection using sparse appearance descriptors in space-time video volumes. Face and Gesture 2011, 314–321 (2011)

    Google Scholar 

  36. Zhong, L., Liu, Q., Yang, P., Huang, J., Metaxas, D.N.: Learning multiscale active facial patches for expression analysis. IEEE Transactions on Cybernetics 45, 1499–1510 (2015)

    Article  Google Scholar 

  37. Zeng, J., Chu, W.-S., la Torre, F.D., Cohn, J.F., Xiong, Z.: Confidence preserving machine for facial action unit detection. IEEE Trans. Image Process. 25, 4753–4767 (2015)

    Article  Google Scholar 

  38. Chu, W.-S., la Torre, F.D., Cohn, J.F.: Learning spatial and temporal cues for multi-label facial action unit detection. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), 25–32 (2017)

  39. Han, S., Meng, Z., O’Reilly, J., Cai, J., Wang, X., Tong, Y.: Optimizing filter size in convolutional neural networks for facial action unit recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5070–5078 (2017)

  40. Zhao, K., Chu, W.-S., Zhang, H.: Deep region and multi-label learning for facial action unit detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3391–3399 (2016)

  41. Li, W., Abtahi, F., Zhu, Z.: Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6766–6775 (2017)

  42. Shao, Z., Liu, Z., Cai, J., Ma, L.: Jâa-net: Joint facial action unit detection and face alignment via adaptive attention. Int. J. Comput. Vision 129, 321–340 (2020)

    Article  Google Scholar 

  43. Jacob, G.M., Stenger, B.: Facial action unit detection with transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7676–7685 (2021)

  44. Tang, Y., Zeng, W., Zhao, D., Zhang, H.: Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 12879–12888 (2021)

  45. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. ArXiv arXiv:2002.05709 (2020)

  46. Yang, C., An, Z., Cai, L., Xu, Y.: Mutual contrastive learning for visual representation learning. In: AAAI Conference on Artificial Intelligence (2021)

  47. Kakogeorgiou, I., Gidaris, S., Psomas, B., Avrithis, Y., Bursuc, A., Karantzalos, K., Komodakis, N.: What to hide from your students: Attention-guided masked image modeling. In: European Conference on Computer Vision (2022)

  48. Li, H., Wang, N., Ding, X., Yang, X., Gao, X.: Adaptively learning facial expression representation via c-f labels and distillation. IEEE Trans. Image Process. 30, 2016–2028 (2021)

    Article  Google Scholar 

  49. Zhang, X., Yin, L., Cohn, J.F., Canavan, S.J., Reale, M.J., Horowitz, A., Liu, P.: Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image Vis. Comput. 32, 692–706 (2014)

    Article  Google Scholar 

  50. Mavadati, S.M., Mahoor, M.H., Bartlett, K., Trinh, P., Cohn, J.F.: Disfa: A spontaneous facial action intensity database. IEEE Trans. Affect. Comput. 4, 151–160 (2013)

    Article  Google Scholar 

  51. Li, W., Abtahi, F., Zhu, Z., Yin, L.: Eac-net: Deep nets with enhancing and cropping for facial action unit detection. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2583–2596 (2018)

    Article  Google Scholar 

  52. Song, T., Chen, L., Zheng, W., Ji, Q.: Uncertain graph neural networks for facial action unit detection. In: AAAI Conference on Artificial Intelligence (2021)

  53. Li, X., Zhang, X., Wang, T., Yin, L.: Knowledge-spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 20922–20932 (2023)

  54. Yin, Y., Chang, D., Song, G., Sang, S., Zhi, T., Liu, J., Luo, L., Soleymani, M.: Fg-net: Facial action unit detection with generalizable pyramidal features. ArXiv arXiv:2308.12380 (2023)

  55. van der Maaten, L., Hinton, G.E.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    Google Scholar 

  56. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2921–2929 (2015)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62366006, 62106054, 62167001 and 62366005.

Author information

Authors and Affiliations

Authors

Contributions

Zichen Liang: Conceptualization, Methodology, Formal analysis, Validation, Software, Writing-Original draft preparation. Yumei Tan: Conceptualization, Funding acquisition, Investigation. Haiying Xia: Conceptualization, Funding acquisition, Investigation. Shuxiang Song: Conceptualization, Funding acquisition, Resources, Supervision, Writing - Review and Editing.

Corresponding author

Correspondence to Shuxiang Song.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare no Conflict of interest.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, Z., Xia, H., Tan, Y. et al. Hard semantic mask strategy for automatic facial action unit recognition with teacher–student model. Multimedia Systems 30, 183 (2024). https://doi.org/10.1007/s00530-024-01385-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01385-x

Keywords

Navigation