Efficient Emotional Talking Head Generation via Dynamic 3D Gaussian Rendering

Liu, Tiantian; Li, Jiahe; Bai, Xiao; Zheng, Jin

doi:10.1007/978-981-97-8508-7_6

Tiantian Liu¹⁵,
Jiahe Li¹⁵,
Xiao Bai¹⁵ &
…
Jin Zheng¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

162 Accesses

Abstract

The synthesis of talking heads with outstanding fidelity, lip synchronization, emotion control, and high efficiency has received lots of research interest in recent years. While some current methods can produce high-fidelity videos in real-time based on NeRF, they are still constrained by computational resources and struggle to achieve accurate emotion control. To tackle these challenges, we propose Emo-Gaussian, a method for generating talking heads based on 3D Gaussian Splatting. In our method, a Gaussian field is utilized to model a specific character. We condition the opacity and color information on audio and emotion inputs, dynamically rendering and optimizing the 3D Gaussians, thus effectively achieving the modeling of the dynamic variations of the talking head. As for the emotion input, we introduce an emotion control module, which utilizes a pre-trained CLIP model to extract emotional priors from images of individuals. These priors are then integrated with an attention mechanism to provide emotion guidance for the process of generating talking heads. Quantitative and qualitative experiments demonstrate the superiority of our method over previous approaches in terms of image quality, lip synchronization, and emotion control, meanwhile exhibiting high efficiency compared to previous state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

EmoTalker: Audio Driven Emotion Aware Talking Head Generation

References

Baltrušaitis, T., Mahmoud, M., Robinson, P.: Cross-dataset learning and person-specific normalisation for automatic action unit detection. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 6, pp. 1–6. IEEE (2015)
Google Scholar
Baltrušaitis, T., Robinson, P., Morency, L.P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)
Google Scholar
Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 520–535 (2018)
Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Cheng, K., Cun, X., Zhang, Y., Xia, M., Yin, F., Zhu, M., Wang, X., Wang, J., Wang, N.: Videoretalking: audio-based lip synchronization for talking head video editing in the wild. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 Nov. 2016, Revised Selected Papers, Part II 13, pp. 87–103. Springer (2017)
Google Scholar
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, 20–24 Nov. 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer (2017)
Google Scholar
Gan, Y., Yang, Z., Yue, X., Sun, L., Yang, Y.: Efficient emotional adaptation for audio-driven talking-head generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22634–22645 (2023)
Google Scholar
Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: Ad-nerf: audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5784–5794 (2021)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash Equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Ji, X., Zhou, H., Wang, K., Wu, Q., Wu, W., Xu, F., Cao, X.: EAMM: one-shot emotional talking face via audio-based emotion-aware motion model. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Google Scholar
Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C.C., Cao, X., Xu, F.: Audio-driven emotional video portraits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14080–14089 (2021)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023)
Article Google Scholar
Li, J., Zhang, J., Bai, X., Zhou, J., Gu, L.: Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7568–7578 (2023)
Google Scholar
Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., Zhou, B.: Semantic-aware implicit neural audio-driven video portrait generation. In: European Conference on Computer Vision, pp. 106–125. Springer (2022)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 41(4), 1–15 (2022)
Article Google Scholar
Peng, Z., Hu, W., Shi, Y., Zhu, X., Zhang, X., He, J., Liu, H., Fan, Z.: Synctalk: the devil is in the synchronization for talking head synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024
Google Scholar
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Shen, S., Li, W., Zhu, Z., Duan, Y., Zhou, J., Lu, J.: Learning dynamic facial radiance fields for few-shot talking head synthesis. In: European Conference on Computer Vision, pp. 666–682. Springer (2022)
Google Scholar
Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G., Wang, J.: Real-time neural radiance talking portrait synthesis via audio-spatial decomposition (2022). arXiv:2211.12368
Wang, C., Wang, X., Zhang, J., Zhang, L., Bai, X., Ning, X., Zhou, J., Hancock, E.: Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recogn. 124, 108498 (2022)
Google Scholar
Wang, J., Qian, X., Zhang, M., Tan, R.T., Li, H.: Seeing what you said: talking face generation guided by a lip reading expert. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14653–14662 (2023)
Google Scholar
Wang, X., Wang, C., Liu, B., Zhou, X., Zhang, L., Zheng, J., Bai, X.: Multi-view stereo in the deep learning era: a comprehensive review. Displays 70, 102102 (2021)
Article Google Scholar
Ye, Z., Jiang, Z., Ren, Y., Liu, J., He, J., Zhao, Z.: Geneface: generalized and high-fidelity audio-driven 3d talking face synthesis (2023). arXiv:2301.13430
Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.J.: Audio-driven talking face video generation with learning-based personalized head pose (2020). arXiv:2002.10137
Zhang, P., Zhou, L., Bai, X., Wang, C., Zhou, J., Zhang, L., Zheng, J.: Learning multi-view visual correspondences with self-supervision. Displays 72, 102160 (2022)
Article Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)
Google Scholar
Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4176–4186 (2021)
Google Scholar

Download references

Acknowledgments

In this work, we are supported by the National Natural Science Foundation of China 62276016 and 62372029.

Author information

Authors and Affiliations

School of Computer Science and Engineering, State Key Laboratory of Complex & Critical Software Environment, Jiangxi Research Institute, Beihang University, Beijing, China
Tiantian Liu, Jiahe Li, Xiao Bai & Jin Zheng

Authors

Tiantian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiahe Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Bai .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1275 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, T., Li, J., Bai, X., Zheng, J. (2025). Efficient Emotional Talking Head Generation via Dynamic 3D Gaussian Rendering. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_6

Download citation

DOI: https://doi.org/10.1007/978-981-97-8508-7_6
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Emotional Talking Head Generation via Dynamic 3D Gaussian Rendering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

EmoTalker: Audio Driven Emotion Aware Talking Head Generation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1275 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Emotional Talking Head Generation via Dynamic 3D Gaussian Rendering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

EmoTalker: Audio Driven Emotion Aware Talking Head Generation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1275 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation