Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning

Sanchez, Enrique; Bulat, Adrian; Zaganidis, Anestis; Tzimiropoulos, Georgios

doi:10.1007/978-3-030-69541-5_7

Enrique Sanchez¹²,
Adrian Bulat¹²,
Anestis Zaganidis¹² &
…
Georgios Tzimiropoulos¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12626))

Included in the following conference series:

Asian Conference on Computer Vision

888 Accesses

Abstract

This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2\%$ of annotated frames, which are randomly chosen. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as $2\%$ of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ABAW: Learning from Synthetic Data & Multi-task Learning Challenges

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

AUPro: Multi-label Facial Action Unit Proposal Generation for Sequence-Level Analysis

Notes

1.
We drop the dependency on the parameters $\theta $ for the sake of clarity.

References

Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: NeurIPS (2019)
Google Scholar
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9
Chapter Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Google Scholar
Chu, W., De la Torre, F., Cohn, J.F.: Learning spatial and temporal cues for multi-label facial action unit detection. In: FG (2017)
Google Scholar
Chu, W.S., la Torre, F.D., Cohn, J.F.: Learning facial action units with spatiotemporal cues and multi-label sampling. Image Vis. Comput. 81, 1–14 (2019)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)
Google Scholar
Ekman, P., Friesen, W., Hager, J.: Facial action coding system. In: A Human Face (2002)
Google Scholar
Eleftheriadis, S., Rudovic, O., Deisenroth, M.P., Pantic, M.: Variational Gaussian process auto-encoder for ordinal prediction of facial action units. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 154–170. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_10
Chapter Google Scholar
Ertugrul, I.O., Cohn, J.F., Jeni, L.A., Zhang, Z., Yin, L., Ji, Q.: Crossing domains for au coding: perspectives, approaches, and measures. IEEE Trans. Biomet. Behav. Identity Sci. 2(2), 158–171 (2020)
Article Google Scholar
Ertugrul, I.O., Jeni, L.A., Cohn, J.F.: PAttNet: patch-attentive deep network for action unit detection. In: BMVC (2019)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations (2018)
Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: AISTATS (2010)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: CVPR - Workshops (2019)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., van der Oord, A.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Jaiswal, S., Valstar, M.: Deep learning the dynamic appearance and shape of facial action units. In: Winter Conference on Applications of Computer Vision (2016)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2014)
Google Scholar
Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: CVPR - Workshops, pp. 1972–1979. IEEE (2017)
Google Scholar
Kollias, D., Schulc, A., Hajiyev, E., Zafeiriou, S.: Analysing affective behavior in the first ABAW 2020 competition. arXiv preprint arXiv:2001.11409 (2020)
Kollias, D., et al.: Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. IJCV 1–23 (2019)
Google Scholar
Kollias, D., Zafeiriou, S.: Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770 (2018)
Kollias, D., Zafeiriou, S.: A multi-task learning & generation framework: valence-arousal, action units & primary expressions. arXiv preprint arXiv:1811.07771 (2018)
Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855 (2019)
Li, W., Abtahi, F., Zhu, Z.: Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In: CVPR (2017)
Google Scholar
Li, W., Abtahi, F., Zhu, Z., Yin, L.: EAC-Net: a region-based deep enhancing and cropping approach for facial action unit detection. In: FG (2017)
Google Scholar
Li, Y., Zeng, J., Shan, S., Chen, X.: Self-supervised representation learning from videos for facial action unit detection. In: CVPR (2019)
Google Scholar
Martinez, B., Valstar, M.F., Jiang, B., Pantic, M.: Automatic analysis of facial actions: a survey. IEEE Trans. Affect. Comput. 10, 325–347 (2017)
Google Scholar
Mavadati, S.M., Mahoor, M.H., Bartlett, K., Trinh, P., Cohn, J.F.: DISFA: a spontaneous facial action intensity database. IEEE-TAC 4, 151–160 (2013)
Google Scholar
Ming, Z., Bugeau, A., Rouas, J., Shochi, T.: Facial action units intensity estimation by the fusion of features with multi-kernel support vector machine. In: FG (2015)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Ntinou, I., Sanchez, E., Bulat, A., Valstar, M., Tzimiropoulos, G.: A transfer learning approach to heatmap regression for action unit intensity estimation. arXiv preprint arXiv:2004.06657 (2020)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Paszke, A., et al.: Automatic differentiation in pytorch. In: Autodiff workshop - NeurIPS (2017)
Google Scholar
Rudovic, O., Pavlovic, V., Pantic, M.: Automatic pain intensity estimation with heteroscedastic conditional ordinal random fields. In: Bebis, G., et al. (eds.) ISVC 2013. LNCS, vol. 8034, pp. 234–243. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41939-3_23
Chapter Google Scholar
Rudovic, O., Pavlovic, V., Pantic, M.: Context-sensitive conditional ordinal random fields for facial action intensity estimation. In: ICCV - Workshops (2013)
Google Scholar
Rudovic, O., Pavlovic, V., Pantic, M.: Context-sensitive dynamic ordinal regression for intensity estimation of facial action units. IEEE-TPAMI 37, 944–958 (2015)
Google Scholar
Sanchez, E., Tzimiropoulos, G., Valstar, M.: Joint action unit localisation and intensity estimation through heatmap regression. In: BMVC (2018)
Google Scholar
Shrout, P., Fleiss, J.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420 (1979)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
Tran, D.L., Walecki, R., Rudovic, O., Eleftheriadis, S., Schuller, B., Pantic, M.: Deepcoder: semi-parametric variational autoencoders for automatic facial action coding. In: ICCV (2017)
Google Scholar
Valstar, M.F., et al.: Fera 2015 - second facial expression recognition and analysis challenge. In: FG (2015)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103 (2008)
Google Scholar
Walecki, R., Rudovic, O., Pavlovic, V., Schuller, B., Pantic, M.: Deep structured learning for facial action unit intensity estimation. In: CVPR (2017)
Google Scholar
Wang, S., Peng, G.: Weakly supervised dual learning for facial action unit recognition. IEEE Trans. Multimed. 21(12) (2019)
Google Scholar
Wang, S., Pan, B., Wu, S., Ji, Q.: Deep facial action unit recognition and intensity estimation from partially labelled data. IEEE Trans. Affect. Comput. (2019)
Google Scholar
Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)
Article Google Scholar
Wu, Y., Ji, Q.: Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection. In: CVPR (2016)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)
Google Scholar
Yang, L., Ertugrul, I.O., Cohn, J.F., Hammal, Z., Jiang, D., Sahli, H.: FACS3D-Net: 3D convolution based spatiotemporal representation for action unit detection. In: ACII (2019)
Google Scholar
Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: CVPR (2019)
Google Scholar
Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: valence and arousal ‘in-the-wild’ challenge. In: CVPR - Workshops. IEEE (2017)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar
Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., Li, S.Z.: S3FD: single shot scale-invariant face detector. In: ICCV (2017)
Google Scholar
Zhang, X., et al.: BP4D-spontaneous: a high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 32, 692–706 (2014)
Google Scholar
Zhang, Y., Dong, W., Hu, B.G., Ji, Q.: Classifier learning with prior probabilities for facial action unit recognition. In: CVPR (2018)
Google Scholar
Zhang, Y., Dong, W., Hu, B.G., Ji, Q.: Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation. In: CVPR (2018)
Google Scholar
Zhang, Y., Jiang, H., Wu, B., Fan, Y., Ji, Q.: Context-aware feature and label fusion for facial action unit intensity estimation with partially labeled data. In: ICCV (2019)
Google Scholar
Zhang, Y., et al.: Joint representation and estimator learning for facial action unit intensity estimation. In: CVPR (2019)
Google Scholar
Zhang, Y., Zhao, R., Dong, W., Hu, B.G., Ji, Q.: Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In: CVPR (2018)
Google Scholar
Zhao, K., Chu, W.S., Zhang, H.: Deep region and multi-label learning for facial action unit detection. In: CVPR (2016)
Google Scholar
Zhao, R., Gan, Q., Wang, S., Ji, Q.: Facial expression intensity estimation using ordinal information. In: CVPR (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Samsung AI Center, Cambridge, UK
Enrique Sanchez, Adrian Bulat, Anestis Zaganidis & Georgios Tzimiropoulos

Authors

Enrique Sanchez
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Bulat
View author publications
You can also search for this author in PubMed Google Scholar
Anestis Zaganidis
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Tzimiropoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrique Sanchez .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sanchez, E., Bulat, A., Zaganidis, A., Tzimiropoulos, G. (2021). Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12626. Springer, Cham. https://doi.org/10.1007/978-3-030-69541-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-69541-5_7
Published: 26 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69540-8
Online ISBN: 978-3-030-69541-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics