FusionCraft: Fusing Emotion and Identity in Cross-Modal 3D Facial Animation

Lv, Zhenyu; Wang, Xuan; Song, Wenfeng; Hou, Xia

doi:10.1007/978-981-97-5609-4_18

Zhenyu Lv¹⁰,
Xuan Wang¹⁰,
Wenfeng Song¹⁰ &
…
Xia Hou¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14871))

Included in the following conference series:

International Conference on Intelligent Computing

618 Accesses

Abstract

Recent advancements in speech-driven 3D facial animation have shown promising progress, yet authentically conveying intricate expressiveness, especially in emotions and individual identity, remains challenging. Many studies focus on lip synchronization, overlooking emotional subtleties and personal uniqueness. To address this gap, we introduce a novel method to generate 3D facial expressions that resonate deeply with both emotion and identity, guided by speech and user prompts. Our innovation lies in an emotion-identity fusion mechanism—a pre-trained self-reconstruction codebook derived from diverse emotional facial movements, serving as a benchmark for expressive motion. Prompt words evolve into facial representations capturing emotion and identity, projected onto 3D templates. Harmonized with speech audio and a specified emotion, our algorithm animates a 3D avatar, reflecting intended emotion and unique identity. Our model’s effectiveness is enhanced by advanced autoregression, uniting emotion and identity through feature fusion module and a tailored loss function. Thus, our approach is a robust tool for crafting 3D talking avatars with emotional depth and distinctive identity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Audio-Driven Lips and Expression on 3D Human Face

3D facial animation driven by speech-video dual-modal signals

Article Open access 23 May 2024

MambaTalk: Speech-Driven 3D Facial Animation with Mamba

References

Richard, A., Zollhofer, M., Wen, Y., De la Torre, F., Sheikh, Y.: Meshtalk: 3d face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)
Google Scholar
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprints arXiv:2006.11477 (2020). https://doi.org/10.48550/arXiv.2006.11477
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.: Capture, learning, and synthesis of 3D speaking styles. In: Computer Vision and Pattern Recognition (CVPR), pp. 10101–10111 (2019). http://voca.is.tue.mpg.de/
Danecek, R., Black, M.J., Bolkart, T.: EMOCA: emotion driven monocular face capture and animation. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20311–20322 (2022)
Google Scholar
Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18770–18780 (2022)
Google Scholar
Richard, A., Zollhofer, M., Wen, Y., De la Torre, F., Sheikh, Y.: Meshtalk: 3d face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)
Google Scholar
Wang, K., et al.: MEAD: a large-scale audio-visual dataset for emotional talking-face generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 700–717. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_42
Chapter Google Scholar
Wuu, C.H., et al.: Multiface: a dataset for neural face rendering. arXiv preprint arXiv:2207.11243 (2022)
Xing, J., Xia, M., Zhang, Y., Cun, X., Wang, J., Wong, T.T.: Codetalker: speech-driven 3d facial animation with discrete motion prior. arXiv preprint arXiv:2301.02379 (2023)
Zhang, M., et al.: Motiondiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Peng, Z., et al.: Emotalk: Speech-Driven Emotional Disentanglement for 3D Face Animation (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution Image Synthesis with Latent Diffusion Models (2021)
Google Scholar
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 1–17 (2017). https://doi.org/10.1145/3130800.3130813

Download references

Acknowledgments

Supported by Beijing Natural Science Foundation (L232102, 4222024), National Natural Science Foundation of China (62102036), R&D Program of Beijing Municipal Education Commission (KM202211232003). Supported by Promoting the Classification and Development of Colleges and Universities-Student Innovation and Entrepreneurship Training Programme Project-School of Computer (5112410852).

Author information

Authors and Affiliations

Computer School, Beijing Information Science and Technology University, Beijing, China
Zhenyu Lv, Xuan Wang, Wenfeng Song & Xia Hou

Authors

Zhenyu Lv
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenfeng Song
View author publications
You can also search for this author in PubMed Google Scholar
Xia Hou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenfeng Song .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Chuanlei Zhang
Xiamen University, Xiamen, China
Jiayang Guo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lv, Z., Wang, X., Song, W., Hou, X. (2024). FusionCraft: Fusing Emotion and Identity in Cross-Modal 3D Facial Animation. In: Huang, DS., Zhang, C., Guo, J. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14871. Springer, Singapore. https://doi.org/10.1007/978-981-97-5609-4_18

Download citation

DOI: https://doi.org/10.1007/978-981-97-5609-4_18
Published: 31 July 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5608-7
Online ISBN: 978-981-97-5609-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics