skip to main content
10.1145/3490035.3490305acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

Realistic talking face animation with speech-induced head motion

Published:19 December 2021Publication History

ABSTRACT

The recent advancements on talking face generation from speech have mostly focused on lip synchronization, realistic facial movements like eye blinks, eye brow motions but do not generate meaningful head motions according to the speech. This results in a lack of realism, especially in long speech. A very few recent methods try to animate the head motions, but they mostly rely on a short driving head motion video. In general, the prediction of head motion is largely dependent upon the prosodic information of the speech at a current time window. In this paper, we propose a method for generating speech-driven realistic talking face animation which has speech-coherent head motions with accurate lip sync, natural eye-blink, and high fidelity texture. In particular, we propose an attention-based GAN network to identify the highly correlated audio with the speaker's head motion and learn the relationship between the prosodic information of the speech and the corresponding head motions. Experimental results show that our animations are significantly better in terms of output video quality, realism of head movements, lip sync, and eye-blinks when compared to state-of-the-art methods, both qualitatively and quantitatively. Moreover, our user study shows that our speech-coherent head motions make the animation more appealing to the users.

References

  1. Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).Google ScholarGoogle Scholar
  2. Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).Google ScholarGoogle Scholar
  3. Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Carlos Busso, Zhigang Deng, Michael Grimm, Ulrich Neumann, and Shrikanth Narayanan. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15, 3 (2007), 1075--1086. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Carlos Busso, Zhigang Deng, Ulrich Neumann, and Shrikanth Narayanan. 2005. Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds 16, 3-4 (2005), 283--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-head Generation with Rhythmic Head Motion. In European Conference on Computer Vision.Google ScholarGoogle Scholar
  7. Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV). 520--535.Google ScholarGoogle ScholarCross RefCross Ref
  8. Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7832--7841.Google ScholarGoogle ScholarCross RefCross Ref
  9. Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 349--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017).Google ScholarGoogle Scholar
  11. Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).Google ScholarGoogle Scholar
  12. J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.Google ScholarGoogle Scholar
  13. Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. 2020. Speech-driven Facial Animation using Cascaded GANs for Learning of Motion and Texture. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4690--4699.Google ScholarGoogle ScholarCross RefCross Ref
  15. Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. 2020. Image Quality Assessment: Unifying Structure and Texture Similarity. CoRR abs/2004.07728 (2020). https://arxiv.org/abs/2004.07728Google ScholarGoogle Scholar
  16. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126--1135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hans Peter Graf, Eric Cosatto, Volker Strom, and Fu Jie Huang. 2002. Visual prosody: Facial movements accompanying speech. In Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. IEEE, 396--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).Google ScholarGoogle Scholar
  19. Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603--615.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems. 6626--6637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501--1510.Google ScholarGoogle ScholarCross RefCross Ref
  22. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694--711.Google ScholarGoogle ScholarCross RefCross Ref
  23. Takaaki Kuratate, Kevin G Munhall, Philip E Rubin, Eric Vatikiotis-Bateson, and Hani Yehia. 1999. Audio-visual synthesis of talking faces from speech production correlates. In Sixth European Conference on Speech Communication and Technology.Google ScholarGoogle Scholar
  24. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).Google ScholarGoogle Scholar
  25. Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Funding Information Natural Sciences and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music and emotional speech from Phonak. Google ScholarGoogle ScholarCross RefCross Ref
  26. JinHong Lu and Hiroshi Shimodaira. 2020. Prediction of head motion from speech waveforms with a canonical-correlation-constrained autoencoder. arXiv preprint arXiv:2002.01869 (2020).Google ScholarGoogle Scholar
  27. Kevin G Munhall, Jeffery A Jones, Daniel E Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological science 15, 2 (2004), 133--137.Google ScholarGoogle ScholarCross RefCross Ref
  28. Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).Google ScholarGoogle Scholar
  29. Niranjan D Narvekar and Lina J Karam. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In 2009 International Workshop on Quality of Multimedia Experience. IEEE, 87--91.Google ScholarGoogle ScholarCross RefCross Ref
  30. Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep face recognition. In bmvc, Vol. 1. 6.Google ScholarGoogle Scholar
  31. KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  33. Sanjana Sinha, Sandika Biswas, and Brojeshwar Bhowmick. 2020. Identity-Preserving Realistic Talking Face Generation. In arXiv. arXiv-2005.Google ScholarGoogle Scholar
  34. Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2020. Everybody's Talkin': Let Me Talk as You Want. arXiv preprint arXiv:2001.05201 (2020).Google ScholarGoogle Scholar
  35. Yang Song, Jingwen Zhu, Xiaolong Wang, and Hairong Qi. 2018. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Anuj Srivastava, Shantanu H Joshi, Washington Mio, and Xiuwen Liu. 2005. Statistical shape analysis: Clustering, learning, and testing. IEEE Transactions on pattern analysis and machine intelligence 27, 4 (2005), 590--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2019. Neural voice puppetry: Audio-driven facial reenactment. arXiv preprint arXiv:1912.05566 (2019).Google ScholarGoogle Scholar
  39. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google ScholarGoogle Scholar
  40. Konstantinos Vougioukas, Samsung AI Center, Stavros Petridis, and Maja Pantic. 2019. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 37--40.Google ScholarGoogle Scholar
  41. Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic speech-driven facial animation with gans. International Journal of Computer Vision (2019), 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Bryan Catanzaro, and Jan Kautz. 2019. Few-shot Video-to-Video Synthesis. In Advances in Neural Information Processing Systems. 5013--5024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-driven Talking Face Video Generation with Natural Head Pose. arXiv preprint arXiv:2002.10137 (2020).Google ScholarGoogle Scholar
  45. Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE International Conference on Computer Vision. 9459--9468.Google ScholarGoogle ScholarCross RefCross Ref
  46. Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3661--3670.Google ScholarGoogle ScholarCross RefCross Ref
  47. Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9299--9306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  49. Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Realistic talking face animation with speech-induced head motion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICVGIP '21: Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing
      December 2021
      428 pages
      ISBN:9781450375962
      DOI:10.1145/3490035

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 December 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate95of286submissions,33%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader