Skip to main content

Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning

  • Conference paper
  • First Online:
Computer Vision – ACCV 2020 (ACCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12626))

Included in the following conference series:

  • 888 Accesses

Abstract

This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as \(2\%\) of annotated frames, which are randomly chosen. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as \(2\%\) of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We drop the dependency on the parameters \(\theta \) for the sake of clarity.

References

  1. Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: NeurIPS (2019)

    Google Scholar 

  2. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9

    Chapter  Google Scholar 

  3. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)

    Google Scholar 

  4. Chu, W., De la Torre, F., Cohn, J.F.: Learning spatial and temporal cues for multi-label facial action unit detection. In: FG (2017)

    Google Scholar 

  5. Chu, W.S., la Torre, F.D., Cohn, J.F.: Learning facial action units with spatiotemporal cues and multi-label sampling. Image Vis. Comput. 81, 1–14 (2019)

    Google Scholar 

  6. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)

    Google Scholar 

  7. Ekman, P., Friesen, W., Hager, J.: Facial action coding system. In: A Human Face (2002)

    Google Scholar 

  8. Eleftheriadis, S., Rudovic, O., Deisenroth, M.P., Pantic, M.: Variational Gaussian process auto-encoder for ordinal prediction of facial action units. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 154–170. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_10

    Chapter  Google Scholar 

  9. Ertugrul, I.O., Cohn, J.F., Jeni, L.A., Zhang, Z., Yin, L., Ji, Q.: Crossing domains for au coding: perspectives, approaches, and measures. IEEE Trans. Biomet. Behav. Identity Sci. 2(2), 158–171 (2020)

    Article  Google Scholar 

  10. Ertugrul, I.O., Jeni, L.A., Cohn, J.F.: PAttNet: patch-attentive deep network for action unit detection. In: BMVC (2019)

    Google Scholar 

  11. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations (2018)

    Google Scholar 

  12. Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: AISTATS (2010)

    Google Scholar 

  13. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)

    Google Scholar 

  14. Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: CVPR - Workshops (2019)

    Google Scholar 

  15. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)

  16. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV (2015)

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  18. Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., van der Oord, A.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)

  19. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

    Google Scholar 

  20. Jaiswal, S., Valstar, M.: Deep learning the dynamic appearance and shape of facial action units. In: Winter Conference on Applications of Computer Vision (2016)

    Google Scholar 

  21. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2014)

    Google Scholar 

  22. Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: CVPR - Workshops, pp. 1972–1979. IEEE (2017)

    Google Scholar 

  23. Kollias, D., Schulc, A., Hajiyev, E., Zafeiriou, S.: Analysing affective behavior in the first ABAW 2020 competition. arXiv preprint arXiv:2001.11409 (2020)

  24. Kollias, D., et al.: Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. IJCV 1–23 (2019)

    Google Scholar 

  25. Kollias, D., Zafeiriou, S.: Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770 (2018)

  26. Kollias, D., Zafeiriou, S.: A multi-task learning & generation framework: valence-arousal, action units & primary expressions. arXiv preprint arXiv:1811.07771 (2018)

  27. Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855 (2019)

  28. Li, W., Abtahi, F., Zhu, Z.: Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In: CVPR (2017)

    Google Scholar 

  29. Li, W., Abtahi, F., Zhu, Z., Yin, L.: EAC-Net: a region-based deep enhancing and cropping approach for facial action unit detection. In: FG (2017)

    Google Scholar 

  30. Li, Y., Zeng, J., Shan, S., Chen, X.: Self-supervised representation learning from videos for facial action unit detection. In: CVPR (2019)

    Google Scholar 

  31. Martinez, B., Valstar, M.F., Jiang, B., Pantic, M.: Automatic analysis of facial actions: a survey. IEEE Trans. Affect. Comput. 10, 325–347 (2017)

    Google Scholar 

  32. Mavadati, S.M., Mahoor, M.H., Bartlett, K., Trinh, P., Cohn, J.F.: DISFA: a spontaneous facial action intensity database. IEEE-TAC 4, 151–160 (2013)

    Google Scholar 

  33. Ming, Z., Bugeau, A., Rouas, J., Shochi, T.: Facial action units intensity estimation by the fusion of features with multi-kernel support vector machine. In: FG (2015)

    Google Scholar 

  34. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  35. Ntinou, I., Sanchez, E., Bulat, A., Valstar, M., Tzimiropoulos, G.: A transfer learning approach to heatmap regression for action unit intensity estimation. arXiv preprint arXiv:2004.06657 (2020)

  36. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  37. Paszke, A., et al.: Automatic differentiation in pytorch. In: Autodiff workshop - NeurIPS (2017)

    Google Scholar 

  38. Rudovic, O., Pavlovic, V., Pantic, M.: Automatic pain intensity estimation with heteroscedastic conditional ordinal random fields. In: Bebis, G., et al. (eds.) ISVC 2013. LNCS, vol. 8034, pp. 234–243. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41939-3_23

    Chapter  Google Scholar 

  39. Rudovic, O., Pavlovic, V., Pantic, M.: Context-sensitive conditional ordinal random fields for facial action intensity estimation. In: ICCV - Workshops (2013)

    Google Scholar 

  40. Rudovic, O., Pavlovic, V., Pantic, M.: Context-sensitive dynamic ordinal regression for intensity estimation of facial action units. IEEE-TPAMI 37, 944–958 (2015)

    Google Scholar 

  41. Sanchez, E., Tzimiropoulos, G., Valstar, M.: Joint action unit localisation and intensity estimation through heatmap regression. In: BMVC (2018)

    Google Scholar 

  42. Shrout, P., Fleiss, J.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420 (1979)

    Google Scholar 

  43. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)

  44. Tran, D.L., Walecki, R., Rudovic, O., Eleftheriadis, S., Schuller, B., Pantic, M.: Deepcoder: semi-parametric variational autoencoders for automatic facial action coding. In: ICCV (2017)

    Google Scholar 

  45. Valstar, M.F., et al.: Fera 2015 - second facial expression recognition and analysis challenge. In: FG (2015)

    Google Scholar 

  46. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103 (2008)

    Google Scholar 

  47. Walecki, R., Rudovic, O., Pavlovic, V., Schuller, B., Pantic, M.: Deep structured learning for facial action unit intensity estimation. In: CVPR (2017)

    Google Scholar 

  48. Wang, S., Peng, G.: Weakly supervised dual learning for facial action unit recognition. IEEE Trans. Multimed. 21(12) (2019)

    Google Scholar 

  49. Wang, S., Pan, B., Wu, S., Ji, Q.: Deep facial action unit recognition and intensity estimation from partially labelled data. IEEE Trans. Affect. Comput. (2019)

    Google Scholar 

  50. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)

    Article  Google Scholar 

  51. Wu, Y., Ji, Q.: Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection. In: CVPR (2016)

    Google Scholar 

  52. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)

    Google Scholar 

  53. Yang, L., Ertugrul, I.O., Cohn, J.F., Hammal, Z., Jiang, D., Sahli, H.: FACS3D-Net: 3D convolution based spatiotemporal representation for action unit detection. In: ACII (2019)

    Google Scholar 

  54. Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: CVPR (2019)

    Google Scholar 

  55. Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: valence and arousal ‘in-the-wild’ challenge. In: CVPR - Workshops. IEEE (2017)

    Google Scholar 

  56. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40

    Chapter  Google Scholar 

  57. Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., Li, S.Z.: S3FD: single shot scale-invariant face detector. In: ICCV (2017)

    Google Scholar 

  58. Zhang, X., et al.: BP4D-spontaneous: a high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 32, 692–706 (2014)

    Google Scholar 

  59. Zhang, Y., Dong, W., Hu, B.G., Ji, Q.: Classifier learning with prior probabilities for facial action unit recognition. In: CVPR (2018)

    Google Scholar 

  60. Zhang, Y., Dong, W., Hu, B.G., Ji, Q.: Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation. In: CVPR (2018)

    Google Scholar 

  61. Zhang, Y., Jiang, H., Wu, B., Fan, Y., Ji, Q.: Context-aware feature and label fusion for facial action unit intensity estimation with partially labeled data. In: ICCV (2019)

    Google Scholar 

  62. Zhang, Y., et al.: Joint representation and estimator learning for facial action unit intensity estimation. In: CVPR (2019)

    Google Scholar 

  63. Zhang, Y., Zhao, R., Dong, W., Hu, B.G., Ji, Q.: Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In: CVPR (2018)

    Google Scholar 

  64. Zhao, K., Chu, W.S., Zhang, H.: Deep region and multi-label learning for facial action unit detection. In: CVPR (2016)

    Google Scholar 

  65. Zhao, R., Gan, Q., Wang, S., Ji, Q.: Facial expression intensity estimation using ordinal information. In: CVPR (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enrique Sanchez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sanchez, E., Bulat, A., Zaganidis, A., Tzimiropoulos, G. (2021). Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12626. Springer, Cham. https://doi.org/10.1007/978-3-030-69541-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-69541-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-69540-8

  • Online ISBN: 978-3-030-69541-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics