Make Audio Solely Drive Lip in Talking Face Video Synthesis

Bai, Xing; Zhou, Jun; Zhang, Pengyuan; Hao, Ruipeng

doi:10.1007/978-3-031-72338-4_24

Xing Bai¹¹,
Jun Zhou¹¹,
Pengyuan Zhang¹¹ &
…
Ruipeng Hao¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15018))

Included in the following conference series:

International Conference on Artificial Neural Networks

450 Accesses

Abstract

In this work, we investigate the problem of synthesizing a talking face video which should be synchronized with a target speech segment. Although there has been significant progress on this task, the most successful approaches are still those that are unrelated to identity. The results generated by these methods cannot capture the unique speaking characteristics of the target individual. In this paper, we propose an effective framework that enables personalized and high-fidelity lip synchronization. The emphasis of our method lies in the way of data preprocessing and inference. Specifically, we find that convolutional neural networks (CNNs) are more inclined to learn lip motions from facial and throat muscle movements rather than speech features. Only by masking region related to lip shape changes as much as possible, CNNs can associate speech features with lip motions. Given original head pose information, the features of input audio signal are solely used to generate realistic images in our model. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results. The code will be made publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: FaceWarehouse: a 3D facial expression database for visual computing. IEEE Trans. Visual. Comput. Graph. 20(3), 413–425 (2013) publisher: IEEE
Google Scholar
Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Trans. Graph. (TOG) 24(4), 1283–1302 (2005) publisher: ACM New York, NY, USA
Google Scholar
Chen, L., et al.: Talking-head generation with rhythmic head motion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 35–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_3
Chapter Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
Google Scholar
Cheng, K., et al.: VideoReTalking: audio-based lip synchronization for talking head video editing in the wild (2022). arXiv: 2211.14758 [cs.CV]
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) Computer Vision – ACCV 2016 Workshops, pp. 251–263. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Chapter Google Scholar
Das, D., Biswas, S., Sinha, S., Bhowmick, B.: Speech-driven facial animation using cascaded GANs for learning of motion and texture. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pp. 408–424. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_25
Chapter Google Scholar
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. mach. learn. res. 12(7) (2011)
Google Scholar
Edwards, P., Landreth, C., Fiume, E., Singh, K.: JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. graph. (TOG) 35(4), 1–11 (2016) publisher: ACM New York, NY, USA
Google Scholar
Gafni, G., Thies, J., Zollhofer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8649–8658 (2021)
Google Scholar
Guo, Y., Cai, J., Jiang, B., Zheng, J., et al: CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1294–1307 (2018). publisher: IEEE
Google Scholar
Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: AD-NeRF: audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5784–5794 (2021)
Google Scholar
He, S., et al.: Speech4Mesh: speech-assisted monocular 3D facial reconstruction for speech-driven 3D facial animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14192–14202 (2023)
Google Scholar
Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: Obamanet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)
Ling, J., et al.: StableFace: analyzing and improving motion stability for talking face generation. arXiv preprint arXiv:2208.13717 (2022)
Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., Zhou, B.: Semantic-aware implicit neural audio-driven video portrait generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pp. 106–125. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_7
Chapter Google Scholar
Lu, Y., Chai, J., Cao, X.: Live speech portraits: real-time photorealistic talking-head animation. ACM Trans. Graph. (TOG) 40(6), 1–17 (2021) publisher: ACM New York, NY, USA
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021) publisher: ACM New York, NY, USA
Google Scholar
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Google Scholar
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: MeshTalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)
Google Scholar
Song, L., Wu, W., Qian, C., He, R., Loy, C.C.: Everybody’s talkin’: let me talk as you want. IEEE Trans. Inf. Forensics Secur. 17, 585–598 (2022) publisher: IEEE
Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017). https://doi.org/10.1145/3072959.3073640
Article Google Scholar
Thambiraja, B., Aliakbarian, S., Cosker, D., Thies, J.: 3DiFACE: diffusion-based speech-driven 3D facial animation and editing (2023). arXiv: 2312.00870 [cs.CV]
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Chapter Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Facevr: real-time facial reenactment and eye gaze control in virtual reality. arXiv preprint arXiv:1610.03151 (2016)
Wang, Q., Fan, Z., Xia, S.: 3D-TalkEmo: learning to synthesize 3D emotional talking head. arXiv preprint arXiv:2104.12051 (2021)
Wang, X., Guo, Y., Deng, B., Zhang, J.: Lightweight photometric stereo for facial details recovery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 740–749 (2020)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004) publisher: IEEE
Google Scholar
Williams, L.: Performance-driven facial animation. In: ACM SIGGRAPH 2006 courses, pp. 16–es (2006)
Google Scholar
Zhang, C., et al.: FACIAL: synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3867–3876 (2021)
Google Scholar
Zhang, W., et al.: SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)
Google Scholar
Zhong, W., et al.: Identity-preserving talking face generation with landmark and appearance priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2023)
Google Scholar
Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S., Singh, K.: Visemenet: audio-driven animator-centric speech animation. ACM Trans. Graph. (TOG) 37(4), 1–10 (2018) publisher: ACM New York, NY, USA
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
Xing Bai, Jun Zhou, Pengyuan Zhang & Ruipeng Hao

Authors

Xing Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Pengyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ruipeng Hao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing Bai .

Editor information

Editors and Affiliations

IDSIA USI-SUPSI, Lugano, Switzerland
Michael Wand
Comenius University, Bratislava, Slovakia
Kristína Malinovská
KAUST Center of Generative AI, Thuwal, Saudi Arabia
Jürgen Schmidhuber
Helmholtz Zentrum München, Neuherberg, Germany
Igor V. Tetko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bai, X., Zhou, J., Zhang, P., Hao, R. (2024). Make Audio Solely Drive Lip in Talking Face Video Synthesis. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15018. Springer, Cham. https://doi.org/10.1007/978-3-031-72338-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-72338-4_24
Published: 17 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72337-7
Online ISBN: 978-3-031-72338-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Make Audio Solely Drive Lip in Talking Face Video Synthesis