Skip to main content

Make Audio Solely Drive Lip in Talking Face Video Synthesis

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2024 (ICANN 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15018))

Included in the following conference series:

  • 450 Accesses

Abstract

In this work, we investigate the problem of synthesizing a talking face video which should be synchronized with a target speech segment. Although there has been significant progress on this task, the most successful approaches are still those that are unrelated to identity. The results generated by these methods cannot capture the unique speaking characteristics of the target individual. In this paper, we propose an effective framework that enables personalized and high-fidelity lip synchronization. The emphasis of our method lies in the way of data preprocessing and inference. Specifically, we find that convolutional neural networks (CNNs) are more inclined to learn lip motions from facial and throat muscle movements rather than speech features. Only by masking region related to lip shape changes as much as possible, CNNs can associate speech features with lip motions. Given original head pose information, the features of input audio signal are solely used to generate realistic images in our model. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results. The code will be made publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: FaceWarehouse: a 3D facial expression database for visual computing. IEEE Trans. Visual. Comput. Graph. 20(3), 413–425 (2013) publisher: IEEE

    Google Scholar 

  2. Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Trans. Graph. (TOG) 24(4), 1283–1302 (2005) publisher: ACM New York, NY, USA

    Google Scholar 

  3. Chen, L., et al.: Talking-head generation with rhythmic head motion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 35–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_3

    Chapter  Google Scholar 

  4. Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)

    Google Scholar 

  5. Cheng, K., et al.: VideoReTalking: audio-based lip synchronization for talking head video editing in the wild (2022). arXiv: 2211.14758 [cs.CV]

  6. Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)

  7. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) Computer Vision – ACCV 2016 Workshops, pp. 251–263. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

    Chapter  Google Scholar 

  8. Das, D., Biswas, S., Sinha, S., Bhowmick, B.: Speech-driven facial animation using cascaded GANs for learning of motion and texture. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pp. 408–424. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_25

    Chapter  Google Scholar 

  9. Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019)

    Google Scholar 

  10. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. mach. learn. res. 12(7) (2011)

    Google Scholar 

  11. Edwards, P., Landreth, C., Fiume, E., Singh, K.: JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. graph. (TOG) 35(4), 1–11 (2016) publisher: ACM New York, NY, USA

    Google Scholar 

  12. Gafni, G., Thies, J., Zollhofer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8649–8658 (2021)

    Google Scholar 

  13. Guo, Y., Cai, J., Jiang, B., Zheng, J., et al: CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1294–1307 (2018). publisher: IEEE

    Google Scholar 

  14. Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: AD-NeRF: audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5784–5794 (2021)

    Google Scholar 

  15. He, S., et al.: Speech4Mesh: speech-assisted monocular 3D facial reconstruction for speech-driven 3D facial animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14192–14202 (2023)

    Google Scholar 

  16. Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: Obamanet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)

  17. Ling, J., et al.: StableFace: analyzing and improving motion stability for talking face generation. arXiv preprint arXiv:2208.13717 (2022)

  18. Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., Zhou, B.: Semantic-aware implicit neural audio-driven video portrait generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pp. 106–125. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_7

    Chapter  Google Scholar 

  19. Lu, Y., Chai, J., Cao, X.: Live speech portraits: real-time photorealistic talking-head animation. ACM Trans. Graph. (TOG) 40(6), 1–17 (2021) publisher: ACM New York, NY, USA

    Google Scholar 

  20. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021) publisher: ACM New York, NY, USA

    Google Scholar 

  21. Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)

    Google Scholar 

  22. Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: MeshTalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)

    Google Scholar 

  23. Song, L., Wu, W., Qian, C., He, R., Loy, C.C.: Everybody’s talkin’: let me talk as you want. IEEE Trans. Inf. Forensics Secur. 17, 585–598 (2022) publisher: IEEE

    Google Scholar 

  24. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017). https://doi.org/10.1145/3072959.3073640

    Article  Google Scholar 

  25. Thambiraja, B., Aliakbarian, S., Cosker, D., Thies, J.: 3DiFACE: diffusion-based speech-driven 3D facial animation and editing (2023). arXiv: 2312.00870 [cs.CV]

  26. Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42

    Chapter  Google Scholar 

  27. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Facevr: real-time facial reenactment and eye gaze control in virtual reality. arXiv preprint arXiv:1610.03151 (2016)

  28. Wang, Q., Fan, Z., Xia, S.: 3D-TalkEmo: learning to synthesize 3D emotional talking head. arXiv preprint arXiv:2104.12051 (2021)

  29. Wang, X., Guo, Y., Deng, B., Zhang, J.: Lightweight photometric stereo for facial details recovery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 740–749 (2020)

    Google Scholar 

  30. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004) publisher: IEEE

    Google Scholar 

  31. Williams, L.: Performance-driven facial animation. In: ACM SIGGRAPH 2006 courses, pp. 16–es (2006)

    Google Scholar 

  32. Zhang, C., et al.: FACIAL: synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3867–3876 (2021)

    Google Scholar 

  33. Zhang, W., et al.: SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)

    Google Scholar 

  34. Zhong, W., et al.: Identity-preserving talking face generation with landmark and appearance priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2023)

    Google Scholar 

  35. Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S., Singh, K.: Visemenet: audio-driven animator-centric speech animation. ACM Trans. Graph. (TOG) 37(4), 1–10 (2018) publisher: ACM New York, NY, USA

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xing Bai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bai, X., Zhou, J., Zhang, P., Hao, R. (2024). Make Audio Solely Drive Lip in Talking Face Video Synthesis. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15018. Springer, Cham. https://doi.org/10.1007/978-3-031-72338-4_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72338-4_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72337-7

  • Online ISBN: 978-3-031-72338-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics