skip to main content
research-article

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Published: 19 July 2024 Publication History

Abstract

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io.

Supplementary Material

ZIP File (papers_980.zip)
supplemental

References

[1]
Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. 2023. Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Trans. Graph. 42, 4 (2023), 44:1--44:20.
[2]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449--12460.
[3]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR. IEEE, Vancouver, BC, Canada, 18392--18402.
[4]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual Event, 1597--1607.
[5]
Michael M. Cohen, Rashid Clark, and Dominic W. Massaro. 2001. Animated speech: research progress and applications. In AVSP. ISCA, Aalborg, Denmark, 200.
[6]
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In CVPR. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 10101--10111.
[7]
Radek Daněček, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael Black, and Timo Bolkart. 2023. Emotional Speech-Driven Animation with Content-Emotion Disentanglement. In SIGGRAPH Asia 2023 Conference Papers (, Sydney, NSW, Australia,) (SA '23). Association for Computing Machinery, New York, NY, USA, Article 41, 13 pages.
[8]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780--8794.
[9]
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. 35, 4 (2016), 127:1--127:11.
[10]
Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Face-Former: Speech-Driven 3D Facial Animation with Transformers. In CVPR. IEEE, New Orleans, LA, USA, 18749--18758.
[11]
Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. 2013. Random Forests for Real Time 3D Face Analysis. Int. J. Comput. Vision 101, 3 (February 2013), 437--458.
[12]
Panagiotis Paraskevas Filntisis, George Retsinas, Foivos Paraperas Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. 2022. Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos. CoRR abs/2207.11094 (2022), 1--17.
[13]
Zhenglin Geng, Chen Cao, and Sergey Tulyakov. 2019. 3D Guided Fine-Grained Face Manipulation. In CVPR. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 9821--9830.
[14]
Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep Speech: Scaling up end-to-end speech recognition. CoRR abs/1412.5567 (2014), 1--12.
[15]
Kazi Injamamul Haque and Zerrin Yumak. 2023. FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning. In ICMI. ACM, Paris, France, 282--291.
[16]
Thorsten Hempel, Ahmed A. Abdelrahman, and Ayoub Al-Hamadi. 2022. 6d Rotation Representation For Unconstrained Head Pose Estimation. In ICIP. IEEE, Bordeaux, France, 2496--2500.
[17]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840--6851.
[18]
Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. CoRR abs/2207.12598 (2022), 1--14.
[19]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 3451--3460.
[20]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster). ICLR, San Diego, CA., 1--15. http://arxiv.org/abs/1412.6980
[21]
Siyao Li, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. In CVPR. IEEE, New Orleans, Louisiana, USA, 11040--11049.
[22]
Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194:1--194:17.
[23]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2022. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. CoRR abs/2211.01095 (2022), 1--17.
[24]
Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Hongyan Liu, Jun He, and Zhaoxin Fan. 2023. EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation. In Proceedings of the IEEE/CVF international conference on computer vision. IEEE, Vancouver, 20687--20697.
[25]
Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2023. GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians. CoRR abs/2312.02069 (2023), 13 pages.
[26]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022a. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
[27]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022b. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
[28]
Zhiyuan Ren, Zhihong Pan, Xin Zhou, and Le Kang. 2023. Diffusion motion: Generate text-guided 3d human motion by diffusion model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Rhodes, Greece, 1--5.
[29]
Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. In ICCV. IEEE, Montreal, QC, Canada, 1153--1162.
[30]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR. IEEE, New Orleans, LA, USA, 10674--10685.
[31]
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In ICML (JMLR Workshop and Conference Proceedings, Vol. 37). JMLR.org, Lille, France, 2256--2265.
[32]
Stefan Stan, Kazi Injamamul Haque, and Zerrin Yumak. 2023. FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion. In MIG. ACM, Rennes, France, 13:1--13:11.
[33]
Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. 2023. Continuously Controllable Facial Expression Editing in Talking Face Videos. IEEE Transactions on Affective Computing (2023), 1--14.
[34]
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. 2023. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu, Kigali, Rwanda, 1--12.
[35]
Balamurugan Thambiraja, Ikhsanul Habibie, Sadegh Aliakbarian, Darren Cosker, Christian Theobalt, and Justus Thies. 2023. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Vancouver, 20621--20631.
[36]
Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. 2023. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. In CVPR. IEEE, Vancouver, 12780--12790.
[37]
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications. Comput. Surveys 56, 4 (2023), 1--39.
[38]
Ran Yi, Zipeng Ye, Zhiyao Sun, Juyong Zhang, Guoxin Zhang, Pengfei Wan, Hujun Bao, and Yong-Jin Liu. 2023. Predicting Personalized Head Movement From Short Video and Speech Signal. IEEE Transactions on Multimedia 25, 1 (2023), 6315--6328.
[39]
Chenxu Zhang, Saifeng Ni, Zhipeng Fan, Hongbo Li, Ming Zeng, Madhukar Budagavi, and Xiaohu Guo. 2023b. 3D Talking Face With Personalized Pose Dynamics. IEEE Trans. Vis. Comput. Graph. 29, 2 (2023), 1438--1449.
[40]
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023a. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. In CVPR. IEEE, Vancouver, BC, Canada, 8652--8661.
[41]
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In CVPR. Computer Vision Foundation / IEEE, Virtual Event, 3661--3670.
[42]
Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In CVPR. IEEE, Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation, 10544--10553.
[43]
Wojciech Zielonka, Timo Bolkart, and Justus Thies. 2022. Towards Metrical Reconstruction of Human Faces. In ECCV (13) (Lecture Notes in Computer Science, Vol. 13673). Springer, Tel Aviv, Israel, 250--269.

Cited By

View all
  • (2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
  • (2024)ScenePhotographer: Object-Oriented Photography for Residential ScenesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680942(7843-7851)Online publication date: 28-Oct-2024
  • (2024)Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '2410.1145/3641519.3657413(1-13)Online publication date: 13-Jul-2024
  • Show More Cited By

Index Terms

  1. DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 43, Issue 4
    July 2024
    1774 pages
    EISSN:1557-7368
    DOI:10.1145/3675116
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2024
    Published in TOG Volume 43, Issue 4

    Check for updates

    Author Tags

    1. speech-driven animation
    2. facial animation
    3. diffusion models

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)425
    • Downloads (Last 6 weeks)50
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
    • (2024)ScenePhotographer: Object-Oriented Photography for Residential ScenesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680942(7843-7851)Online publication date: 28-Oct-2024
    • (2024)Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '2410.1145/3641519.3657413(1-13)Online publication date: 13-Jul-2024
    • (2024)Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02577(27284-27293)Online publication date: 16-Jun-2024
    • (2024)FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02009(21263-21273)Online publication date: 16-Jun-2024
    • (2024)A highly naturalistic facial expression generation method with embedded vein features based on diffusion modelMeasurement Science and Technology10.1088/1361-6501/ad866f36:1(015411)Online publication date: 24-Oct-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media