ABSTRACT
The recent advancements on talking face generation from speech have mostly focused on lip synchronization, realistic facial movements like eye blinks, eye brow motions but do not generate meaningful head motions according to the speech. This results in a lack of realism, especially in long speech. A very few recent methods try to animate the head motions, but they mostly rely on a short driving head motion video. In general, the prediction of head motion is largely dependent upon the prosodic information of the speech at a current time window. In this paper, we propose a method for generating speech-driven realistic talking face animation which has speech-coherent head motions with accurate lip sync, natural eye-blink, and high fidelity texture. In particular, we propose an attention-based GAN network to identify the highly correlated audio with the speaker's head motion and learn the relationship between the prosodic information of the speech and the corresponding head motions. Experimental results show that our animations are significantly better in terms of output video quality, realism of head movements, lip sync, and eye-blinks when compared to state-of-the-art methods, both qualitatively and quantitatively. Moreover, our user study shows that our speech-coherent head motions make the animation more appealing to the users.
- Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).Google Scholar
- Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).Google Scholar
- Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.Google ScholarDigital Library
- Carlos Busso, Zhigang Deng, Michael Grimm, Ulrich Neumann, and Shrikanth Narayanan. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15, 3 (2007), 1075--1086. Google ScholarDigital Library
- Carlos Busso, Zhigang Deng, Ulrich Neumann, and Shrikanth Narayanan. 2005. Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds 16, 3-4 (2005), 283--290. Google ScholarDigital Library
- Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-head Generation with Rhythmic Head Motion. In European Conference on Computer Vision.Google Scholar
- Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV). 520--535.Google ScholarCross Ref
- Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7832--7841.Google ScholarCross Ref
- Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 349--357. Google ScholarDigital Library
- Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017).Google Scholar
- Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).Google Scholar
- J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.Google Scholar
- Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. 2020. Speech-driven Facial Animation using Cascaded GANs for Learning of Motion and Texture. In European Conference on Computer Vision.Google ScholarDigital Library
- Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4690--4699.Google ScholarCross Ref
- Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. 2020. Image Quality Assessment: Unifying Structure and Texture Similarity. CoRR abs/2004.07728 (2020). https://arxiv.org/abs/2004.07728Google Scholar
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126--1135. Google ScholarDigital Library
- Hans Peter Graf, Eric Cosatto, Volker Strom, and Fu Jie Huang. 2002. Visual prosody: Facial movements accompanying speech. In Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. IEEE, 396--401. Google ScholarDigital Library
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).Google Scholar
- Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603--615.Google ScholarDigital Library
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems. 6626--6637. Google ScholarDigital Library
- Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501--1510.Google ScholarCross Ref
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694--711.Google ScholarCross Ref
- Takaaki Kuratate, Kevin G Munhall, Philip E Rubin, Eric Vatikiotis-Bateson, and Hani Yehia. 1999. Audio-visual synthesis of talking faces from speech production correlates. In Sixth European Conference on Speech Communication and Technology.Google Scholar
- Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).Google Scholar
- Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Funding Information Natural Sciences and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music and emotional speech from Phonak. Google ScholarCross Ref
- JinHong Lu and Hiroshi Shimodaira. 2020. Prediction of head motion from speech waveforms with a canonical-correlation-constrained autoencoder. arXiv preprint arXiv:2002.01869 (2020).Google Scholar
- Kevin G Munhall, Jeffery A Jones, Daniel E Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological science 15, 2 (2004), 133--137.Google ScholarCross Ref
- Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).Google Scholar
- Niranjan D Narvekar and Lina J Karam. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In 2009 International Workshop on Quality of Multimedia Experience. IEEE, 87--91.Google ScholarCross Ref
- Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep face recognition. In bmvc, Vol. 1. 6.Google Scholar
- KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484--492. Google ScholarDigital Library
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
- Sanjana Sinha, Sandika Biswas, and Brojeshwar Bhowmick. 2020. Identity-Preserving Realistic Talking Face Generation. In arXiv. arXiv-2005.Google Scholar
- Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2020. Everybody's Talkin': Let Me Talk as You Want. arXiv preprint arXiv:2001.05201 (2020).Google Scholar
- Yang Song, Jingwen Zhu, Xiaolong Wang, and Hairong Qi. 2018. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786 (2018). Google ScholarDigital Library
- Anuj Srivastava, Shantanu H Joshi, Washington Mio, and Xiuwen Liu. 2005. Statistical shape analysis: Clustering, learning, and testing. IEEE Transactions on pattern analysis and machine intelligence 27, 4 (2005), 590--602. Google ScholarDigital Library
- Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--13. Google ScholarDigital Library
- Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2019. Neural voice puppetry: Audio-driven facial reenactment. arXiv preprint arXiv:1912.05566 (2019).Google Scholar
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google Scholar
- Konstantinos Vougioukas, Samsung AI Center, Stavros Petridis, and Maja Pantic. 2019. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 37--40.Google Scholar
- Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic speech-driven facial animation with gans. International Journal of Computer Vision (2019), 1--16.Google ScholarDigital Library
- Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Bryan Catanzaro, and Jan Kautz. 2019. Few-shot Video-to-Video Synthesis. In Advances in Neural Information Processing Systems. 5013--5024. Google ScholarDigital Library
- Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612. Google ScholarDigital Library
- Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-driven Talking Face Video Generation with Natural Head Pose. arXiv preprint arXiv:2002.10137 (2020).Google Scholar
- Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE International Conference on Computer Vision. 9459--9468.Google ScholarCross Ref
- Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3661--3670.Google ScholarCross Ref
- Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9299--9306.Google ScholarDigital Library
- Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1--15. Google ScholarDigital Library
Index Terms
- Realistic talking face animation with speech-induced head motion
Recommendations
Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling
This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual ...
Shallow Diffusion Motion Model for Talking Face Generation from Speech
Web and Big DataAbstractTalking face generation is synthesizing a lip synchronized talking face video by inputting an arbitrary face image and audio clips. People naturally conduct spontaneous head motions to enhance their speeches while giving talks. Head motion ...
Speech-driven talking face using embedded confusable system for real time mobile multimedia
This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is ...
Comments