Manitalk: manipulable talking head generation from single image in the wild

Fang, Hui; Weng, Dongdong; Tian, Zeyu; Ma, Yin

doi:10.1007/s00371-024-03490-4

Manitalk: manipulable talking head generation from single image in the wild

Research
Published: 08 June 2024

Volume 40, pages 4913–4925, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Hui Fang¹,
Dongdong Weng²,
Zeyu Tian¹ &
…
Yin Ma³

368 Accesses
Explore all metrics

Abstract

Generating talking head videos through a face image and a piece of speech audio has gained widespread interest. Existing talking face synthesis methods typically lack the ability to generate manipulable facial details and pupils, which is desirable for producing stylized facial expressions. We present ManiTalk, the first manipulable audio-driven talking head generation system. Our system consists of three stages. In the first stage, the proposed Exp Generator and Pose Generator generate synchronized talking landmarks and presentation-style head poses. In the second stage, we parameterize the positions of eyebrows, eyelids, and pupils, enabling personalized and straightforward manipulation of facial details. In the last stage, we introduce SFWNet to warp facial images based on the landmark motions. Additional driving sketches are input to generate more precise expressions. Extensive quantitative and qualitative evaluations, along with user studies, demonstrate that the system can accurately manipulate facial details and achieve excellent lip synchronization. Our system achieves state-of-the-art performance in terms of identity preservation and video quality. Code is available at https://github.com/shanzhajuan/ManiTalk.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Talking-Head Generation with Rhythmic Head Motion

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

S $$^{3}$$ D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. IEEE (2018)
Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 11(6), 567–585 (1989)
Article Google Scholar
Chatziagapi, A., Athar, S., Jain, A., Rohith, M., Bhat, V., Samaras, D.: Lipnerf: what is the right feature space to lip-sync a nerf? In: 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2023)
Cheng, K., Cun, X., Zhang, Y., Xia, M., Yin, F., Zhu, M., Wang, X., Wang, J., Wang, N.: Videoretalking: audio-based lip synchronization for talking head video editing in the wild. In: SIGGRAPH Asia 2022 Conference Papers (2022)
Chenxu, Z., Chao, W., Jianfeng, Z., Hongyi, X., Guoxian, S., You, X., Linjie, L., Yapeng, T., Xiaohu, G., Jiashi, F.: Dream-talk: diffusion-based realistic emotional audio-driven method for single image talking face generation. arXiv preprint arXiv:2312.13578 (2023)
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,101–10,111 (2019)
Deng, H., Han, C., Cai, H., Han, G., He, S.: Spatially-invariant style-codes controlled makeup transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6549–6557 (2021)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Doukas, M.C., Ververas, E., Sharmanska, V., Zafeiriou, S.: Free-headgan: neural talking head synthesis with explicit gaze control. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Eskimez, S.E., Zhang, Y., Duan, Z.: Speech driven talking face generation from a single image and an emotion condition. IEEE Trans. Multimed. 24, 3480–3490 (2021)
Article Google Scholar
Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18,770–18,780 (2022)
Ganin, Y., Kononenko, D., Sungatullina, D., Lempitsky, V.: Deepwarp: photorealistic image resynthesis for gaze manipulation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part II 14, pp. 311–326. Springer (2016)
He, Z., Spurr, A., Zhang, X., Hilliges, O.: Photo-realistic monocular gaze redirection using generative adversarial networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6932–6941 (2019)
Houwei, C., David, G.C., Michael, K.K., Ruben, C.G., Ani, N., Ragini, V.: Crema-d: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)
Article Google Scholar
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36(4), 1–12 (2017)
Article Google Scholar
Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., Bregler, C.: Lipsync3d: data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2755–2764 (2021)
Lu, Y., Chai, J., Cao, X.: Live speech portraits: real-time photorealistic talking-head animation. ACM Trans. Graph. 40(6), 1–17 (2021)
Article Google Scholar
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Narvekar, N.D., Karam, L.J.: A no-reference image blur metric based on the cumulative probability of blur detection (cpbd). IEEE Trans. Image Process. 20(9), 2678–2683 (2011)
Article MathSciNet Google Scholar
Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Ruzzi, A., Shi, X., Wang, X., Li, G., De Mello, S., Chang, H.J., Zhang, X., Hilliges, O.: Gazenerf: 3d-aware gaze redirection with neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9676–9685 (2023)
Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,653–13,662 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, L., Wu, W., Qian, C., He, R., Loy, C.C.: Everybody’s talkin’: let me talk as you want. IEEE Trans. Inf. Forensics Secur. 17, 585–598 (2022)
Article Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017)
Article Google Scholar
Suzhen, W., Lincheng, L., Yu, D., Xin, Y.: One-shot talking face generation from single-speaker audio-visual correlation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2531–2539 (2022)
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp. 716–731. Springer (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9168–9178 (2021)
Wen, X., Wang, M., Richardt, C., Chen, Z.Y., Hu, S.M.: Photorealistic audio-driven video portraits. IEEE Trans. Visual Comput. Graph. 26(12), 3457–3466 (2020)
Article Google Scholar
Wolf, L., Freund, Z., Avidan, S.: An eye for an eye: a single camera gaze-replacement method. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 817–824. IEEE (2010)
Xinya, J., Hang, Z., Kaisiyuan, W., Qianyi, W., Wayne, W., Feng, X., Xun, C.: Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.J.: Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137 (2020)
Yu, Y., Odobez, J.M.: Unsupervised representation learning for gaze estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7314–7324 (2020)
Zhang, C., Ni, S., Fan, Z., Li, H., Zeng, M., Budagavi, M., Guo, X.: 3d talking face with personalized pose dynamics. IEEE Trans. Visual Comput. Graph. 29(2), 1438–1449 (2023)
Article Google Scholar
Zhang, C., Zhao, Y., Huang, Y., Zeng, M., Ni, S., Budagavi, M., Guo, X.: Facial: synthesizing dynamic talking face with implicit attribute learning. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3847–3856 (2021)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)
Zhang, Y., He, W., Li, M., Tian, K., Zhang, Z., Cheng, J., Wang, Y., Liao, J.: Meta talk: learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4848–4852 (2022)
Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3661–3670 (2021)
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3657–3666 (2022)
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Trans. Graph. 39(6), 1–15 (2020)

Download references

Acknowledgements

This work was supported by the National Key R &D Program of China (No. 2022YFF0902303) and the Beijing Municipal Science & Technology Commission and Administrative Commission of Zhongguancun Science Park (No. Z221100007722002) and the National Natural Science Foundation of China (No. 62072036).

Author information

Authors and Affiliations

Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, Beijing, China
Hui Fang & Zeyu Tian
Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Zhengzhou Research Institute, Beijing Institute of Technology, Zhengzhou, China
Dongdong Weng
Ningxia Baofeng Group Co. Ltd., Yinchuan, China
Yin Ma

Authors

Hui Fang
View author publications
You can also search for this author inPubMed Google Scholar
Dongdong Weng
View author publications
You can also search for this author inPubMed Google Scholar
Zeyu Tian
View author publications
You can also search for this author inPubMed Google Scholar
Yin Ma
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Dongdong Weng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 28231 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fang, H., Weng, D., Tian, Z. et al. Manitalk: manipulable talking head generation from single image in the wild. Vis Comput 40, 4913–4925 (2024). https://doi.org/10.1007/s00371-024-03490-4

Download citation

Accepted: 13 May 2024
Published: 08 June 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00371-024-03490-4

Keywords

Part of a collection:

CGI'2024 Conference

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Manitalk: manipulable talking head generation from single image in the wild

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Talking-Head Generation with Rhythmic Head Motion

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

S $$^{3}$$ D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now