Abstract
Generating dynamic 2D image-based facial expressions is a challenging task for facial animation. Much research work focused on performance-driven facial animation from given videos or images of a target face, while animating a single face image driven by emotion labels is a less explored problem. In this work, we treat the task of animating single face image from emotion labels as a conditional video prediction problem, and propose a novel framework by combining factored conditional restricted boltzmann machines (FCRBM) and reconstruction contractive auto-encoder (RCAE). A modified RCAE with an associated efficient training strategy is used to extract low dimensional features and reconstruct face images. FCRBM is used as animator to predict facial expression sequence in the feature space given discrete emotion labels and a frontal neutral face image as input. Both quantitative and qualitative evaluations on two facial expression databases, and comparison to state-of-the-art showed the effectiveness of our proposed framework for animating frontal neutral face image from given emotion labels.
Similar content being viewed by others
References
Alain G, Bengio Y (2014) What regularized auto-encoders learn from the data-generating distribution. J Mach Learn Res 15(1):3563–3593
Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3382–3389
Averbuch-Elor H, Cohen-Or D, Kopf J, Cohen MF (2017) Bringing portraits to life. ACM Trans Graph (TOG) 36(6):196
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., pp 187–194
Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Computer graphics forum vol 22. Wiley Online Library, pp 641–650
Cao Y, Tien WC, Faloutsos P, Pighin F (2005) Expressive speech-driven facial animation. ACM Trans Graph (TOG) 24(4):1283–1302
Cao C, Wu H, Weng Y, Shao T, Zhou K (2016) Real-time facial animation with image-based dynamic avatars. ACM Trans Graph (TOG) 35(4):126
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Deng Z, Noh J (2008) Computer facial animation: a survey. In: Data-driven 3D facial animation. Springer, pp 1–28
Ding H, Zhou SK, Chellappa R (2017) Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: 2017 12th IEEE International conference on automatic face & gesture recognition (FG 2017). IEEE, pp 118–126
Ersotelos N, Dong F (2008) Building highly realistic facial modeling and animation: a survey. Vis Comput 24(1):13–30
Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4884–4888
Garrido P, Zollhöfer M, Casas D, Valgaerts L, Varanasi K, Pérez P, Theobalt C (2016) Reconstruction of personalized 3d face rigs from monocular video. ACM Trans Graph (TOG) 35(3):28
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672– 2680
Ichim AE, Bouaziz S, Pauly M (2015) Dynamic 3d avatar creation from hand-held video input. ACM Trans Graph (TOG) 34(4):45
Jiang D, Zhao Y, Sahli H, Zhang Y (2014) Speech driven photo realistic facial animation based on an articulatory dbn model and aam features. Multimed Tools Appl 73(1):397–415
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:13126114
Liu Z, Shan Y, Zhang Z (2001) Expressive expression mapping with ratio images. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques. ACM, pp 271–276
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 94–101
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:14111784
Olszewski K, Li Z, Yang C, Zhou Y, Yu R, Huang Z, Xiang S, Saito S, Kohli P, Li H (2017) Realistic dynamic facial textures from a single image using gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5429–5438
Oveneke MC, Aliosha-Perez M, Zhao Y, Jiang D, Sahli H (2016) Efficient convolutional auto-encoding via random convexification and frequency-domain minimization. arXiv:161109232
Oveneke MC, Zhao Y, Jiang D, Sahli H (2017) Expressive face frontalization and its application to facial expression analysis. Tech. rep., Vrije Universiteit Brussel
Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 833–840
Shu Z, Yumer E, Hadap S, Sunkavalli K, Shechtman E, Samaras D (2017) Neural face editing with intrinsic image disentangling. arXiv:170404131
Stoiber N, Seguier R, Breton G (2009) Automatic design of a control interface for a synthetic face. In: Proceedings of the 14th international conference on intelligent user interfaces. ACM, pp 207–216
Susskind JM, Anderson AK, Hinton GE, Movellan JR (2008) Generating facial expressions with deep belief nets. INTECH Open Access Publisher
Sutskever I, Hinton GE, Taylor GW (2009) The recurrent temporal restricted boltzmann machine. In: Advances in neural information processing systems, pp 1601–1608
Taylor GW, Hinton GE (2009) Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1025–1032
Taylor GW, Hinton GE, Roweis ST (2007) Modeling human motion using binary latent variables. Adv Neural Inf Process Syst 19:1345
Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M (2016) Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2387–2395
Tulyakov S, Liu MY, Yang X, Kautz J (2018) Mocogan: decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1526–1535
Villegas R, Yang J, Zou Y, Sohn S, Lin X, Lee H (2017) Learning to generate long-term future via hierarchical prediction. arXiv:170405831
Wang Z, Bovik AC (2009) Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Process Mag 26(1):98–117
Wang L, Soong FK (2015) Hmm trajectory-guided sample selection for photo-realistic talking head. Multimed Tools Appl 74(22):9849–9869
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13 (4):600–612
Yan X, Yang J, Sohn K, Lee H (2016) Attribute2image: conditional image generation from visual attributes. In: European conference on computer vision. Springer, pp 776–791
Zhao G, Huang X, Taini M, Li SZ, PietikäInen M (2011) Facial expression recognition from near-infrared videos. Image Vis Comput 29(9):607–619
Zhao Y, Jiang D, Sahli H (2015) 3d emotional facial animation synthesis with factored conditional restricted Boltzmann machines. In: 2015 International conference on affective computing and intelligent interaction (ACII). IEEE, pp 797–803
Acknowledgements
We thank Averbuch-Elor et al. for kindly providing the sequence for comparison. We thank Tao Yang for the kindly processing of the facial expression recognition experiments and all the students for their participation to the subjective analysis. We would like to thank the reviewer for their detailed comments and suggestions for the manuscript. We believe that the comments have identified important areas which required improvement. This work is supported by the Chinese Scholarship Council (CSC) (grant 201506290085), the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the Natural Science Foundation of China (grant 61273265), the VUB Interdisciplinary Research Program through the EMO-App project, and the Agency for Innovation by Science and Technology in Flanders (IWT) – PhD grant nr. 131814.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
(AVI 26.7 MB)
(AVI 28.9 MB)
Rights and permissions
About this article
Cite this article
Zhao, Y., Oveneke, M.C., Jiang, D. et al. A video prediction approach for animating single face image. Multimed Tools Appl 78, 16389–16410 (2019). https://doi.org/10.1007/s11042-018-6952-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6952-y