Abstract
This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as \(2\%\) of annotated frames, which are randomly chosen. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as \(2\%\) of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We drop the dependency on the parameters \(\theta \) for the sake of clarity.
References
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: NeurIPS (2019)
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Chu, W., De la Torre, F., Cohn, J.F.: Learning spatial and temporal cues for multi-label facial action unit detection. In: FG (2017)
Chu, W.S., la Torre, F.D., Cohn, J.F.: Learning facial action units with spatiotemporal cues and multi-label sampling. Image Vis. Comput. 81, 1–14 (2019)
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)
Ekman, P., Friesen, W., Hager, J.: Facial action coding system. In: A Human Face (2002)
Eleftheriadis, S., Rudovic, O., Deisenroth, M.P., Pantic, M.: Variational Gaussian process auto-encoder for ordinal prediction of facial action units. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 154–170. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_10
Ertugrul, I.O., Cohn, J.F., Jeni, L.A., Zhang, Z., Yin, L., Ji, Q.: Crossing domains for au coding: perspectives, approaches, and measures. IEEE Trans. Biomet. Behav. Identity Sci. 2(2), 158–171 (2020)
Ertugrul, I.O., Jeni, L.A., Cohn, J.F.: PAttNet: patch-attentive deep network for action unit detection. In: BMVC (2019)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations (2018)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: AISTATS (2010)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)
Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: CVPR - Workshops (2019)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., van der Oord, A.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Jaiswal, S., Valstar, M.: Deep learning the dynamic appearance and shape of facial action units. In: Winter Conference on Applications of Computer Vision (2016)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2014)
Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: CVPR - Workshops, pp. 1972–1979. IEEE (2017)
Kollias, D., Schulc, A., Hajiyev, E., Zafeiriou, S.: Analysing affective behavior in the first ABAW 2020 competition. arXiv preprint arXiv:2001.11409 (2020)
Kollias, D., et al.: Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. IJCV 1–23 (2019)
Kollias, D., Zafeiriou, S.: Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770 (2018)
Kollias, D., Zafeiriou, S.: A multi-task learning & generation framework: valence-arousal, action units & primary expressions. arXiv preprint arXiv:1811.07771 (2018)
Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855 (2019)
Li, W., Abtahi, F., Zhu, Z.: Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In: CVPR (2017)
Li, W., Abtahi, F., Zhu, Z., Yin, L.: EAC-Net: a region-based deep enhancing and cropping approach for facial action unit detection. In: FG (2017)
Li, Y., Zeng, J., Shan, S., Chen, X.: Self-supervised representation learning from videos for facial action unit detection. In: CVPR (2019)
Martinez, B., Valstar, M.F., Jiang, B., Pantic, M.: Automatic analysis of facial actions: a survey. IEEE Trans. Affect. Comput. 10, 325–347 (2017)
Mavadati, S.M., Mahoor, M.H., Bartlett, K., Trinh, P., Cohn, J.F.: DISFA: a spontaneous facial action intensity database. IEEE-TAC 4, 151–160 (2013)
Ming, Z., Bugeau, A., Rouas, J., Shochi, T.: Facial action units intensity estimation by the fusion of features with multi-kernel support vector machine. In: FG (2015)
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Ntinou, I., Sanchez, E., Bulat, A., Valstar, M., Tzimiropoulos, G.: A transfer learning approach to heatmap regression for action unit intensity estimation. arXiv preprint arXiv:2004.06657 (2020)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Paszke, A., et al.: Automatic differentiation in pytorch. In: Autodiff workshop - NeurIPS (2017)
Rudovic, O., Pavlovic, V., Pantic, M.: Automatic pain intensity estimation with heteroscedastic conditional ordinal random fields. In: Bebis, G., et al. (eds.) ISVC 2013. LNCS, vol. 8034, pp. 234–243. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41939-3_23
Rudovic, O., Pavlovic, V., Pantic, M.: Context-sensitive conditional ordinal random fields for facial action intensity estimation. In: ICCV - Workshops (2013)
Rudovic, O., Pavlovic, V., Pantic, M.: Context-sensitive dynamic ordinal regression for intensity estimation of facial action units. IEEE-TPAMI 37, 944–958 (2015)
Sanchez, E., Tzimiropoulos, G., Valstar, M.: Joint action unit localisation and intensity estimation through heatmap regression. In: BMVC (2018)
Shrout, P., Fleiss, J.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420 (1979)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
Tran, D.L., Walecki, R., Rudovic, O., Eleftheriadis, S., Schuller, B., Pantic, M.: Deepcoder: semi-parametric variational autoencoders for automatic facial action coding. In: ICCV (2017)
Valstar, M.F., et al.: Fera 2015 - second facial expression recognition and analysis challenge. In: FG (2015)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103 (2008)
Walecki, R., Rudovic, O., Pavlovic, V., Schuller, B., Pantic, M.: Deep structured learning for facial action unit intensity estimation. In: CVPR (2017)
Wang, S., Peng, G.: Weakly supervised dual learning for facial action unit recognition. IEEE Trans. Multimed. 21(12) (2019)
Wang, S., Pan, B., Wu, S., Ji, Q.: Deep facial action unit recognition and intensity estimation from partially labelled data. IEEE Trans. Affect. Comput. (2019)
Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)
Wu, Y., Ji, Q.: Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection. In: CVPR (2016)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)
Yang, L., Ertugrul, I.O., Cohn, J.F., Hammal, Z., Jiang, D., Sahli, H.: FACS3D-Net: 3D convolution based spatiotemporal representation for action unit detection. In: ACII (2019)
Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: CVPR (2019)
Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: valence and arousal ‘in-the-wild’ challenge. In: CVPR - Workshops. IEEE (2017)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., Li, S.Z.: S3FD: single shot scale-invariant face detector. In: ICCV (2017)
Zhang, X., et al.: BP4D-spontaneous: a high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 32, 692–706 (2014)
Zhang, Y., Dong, W., Hu, B.G., Ji, Q.: Classifier learning with prior probabilities for facial action unit recognition. In: CVPR (2018)
Zhang, Y., Dong, W., Hu, B.G., Ji, Q.: Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation. In: CVPR (2018)
Zhang, Y., Jiang, H., Wu, B., Fan, Y., Ji, Q.: Context-aware feature and label fusion for facial action unit intensity estimation with partially labeled data. In: ICCV (2019)
Zhang, Y., et al.: Joint representation and estimator learning for facial action unit intensity estimation. In: CVPR (2019)
Zhang, Y., Zhao, R., Dong, W., Hu, B.G., Ji, Q.: Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In: CVPR (2018)
Zhao, K., Chu, W.S., Zhang, H.: Deep region and multi-label learning for facial action unit detection. In: CVPR (2016)
Zhao, R., Gan, Q., Wang, S., Ji, Q.: Facial expression intensity estimation using ordinal information. In: CVPR (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Sanchez, E., Bulat, A., Zaganidis, A., Tzimiropoulos, G. (2021). Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12626. Springer, Cham. https://doi.org/10.1007/978-3-030-69541-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-69541-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69540-8
Online ISBN: 978-3-030-69541-5
eBook Packages: Computer ScienceComputer Science (R0)