research-article

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Authors:

Yong-Jin LiuAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 43, Issue 4

Article No.: 46, Pages 1 - 9

https://doi.org/10.1145/3658221

Published: 19 July 2024 Publication History

Abstract

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io.

Supplementary Material

ZIP File (papers_980.zip)

supplemental

Download
65.60 MB

References

[1]

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. 2023. Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Trans. Graph. 42, 4 (2023), 44:1--44:20.

Digital Library

[2]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449--12460.

[3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR. IEEE, Vancouver, BC, Canada, 18392--18402.

[4]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual Event, 1597--1607.

[5]

Michael M. Cohen, Rashid Clark, and Dominic W. Massaro. 2001. Animated speech: research progress and applications. In AVSP. ISCA, Aalborg, Denmark, 200.

[6]

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In CVPR. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 10101--10111.

[7]

Radek Daněček, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael Black, and Timo Bolkart. 2023. Emotional Speech-Driven Animation with Content-Emotion Disentanglement. In SIGGRAPH Asia 2023 Conference Papers (, Sydney, NSW, Australia,) (SA '23). Association for Computing Machinery, New York, NY, USA, Article 41, 13 pages.

Digital Library

[8]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780--8794.

[9]

Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. 35, 4 (2016), 127:1--127:11.

Digital Library

[10]

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Face-Former: Speech-Driven 3D Facial Animation with Transformers. In CVPR. IEEE, New Orleans, LA, USA, 18749--18758.

[11]

Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. 2013. Random Forests for Real Time 3D Face Analysis. Int. J. Comput. Vision 101, 3 (February 2013), 437--458.

[12]

Panagiotis Paraskevas Filntisis, George Retsinas, Foivos Paraperas Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. 2022. Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos. CoRR abs/2207.11094 (2022), 1--17.

[13]

Zhenglin Geng, Chen Cao, and Sergey Tulyakov. 2019. 3D Guided Fine-Grained Face Manipulation. In CVPR. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 9821--9830.

[14]

Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep Speech: Scaling up end-to-end speech recognition. CoRR abs/1412.5567 (2014), 1--12.

[15]

Kazi Injamamul Haque and Zerrin Yumak. 2023. FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning. In ICMI. ACM, Paris, France, 282--291.

[16]

Thorsten Hempel, Ahmed A. Abdelrahman, and Ayoub Al-Hamadi. 2022. 6d Rotation Representation For Unconstrained Head Pose Estimation. In ICIP. IEEE, Bordeaux, France, 2496--2500.

[17]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840--6851.

[18]

Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. CoRR abs/2207.12598 (2022), 1--14.

[19]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 3451--3460.

Digital Library

[20]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster). ICLR, San Diego, CA., 1--15. http://arxiv.org/abs/1412.6980

[21]

Siyao Li, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. In CVPR. IEEE, New Orleans, Louisiana, USA, 11040--11049.

[22]

Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194:1--194:17.

Digital Library

[23]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2022. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. CoRR abs/2211.01095 (2022), 1--17.

[24]

Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Hongyan Liu, Jun He, and Zhaoxin Fan. 2023. EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation. In Proceedings of the IEEE/CVF international conference on computer vision. IEEE, Vancouver, 20687--20697.

[25]

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2023. GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians. CoRR abs/2312.02069 (2023), 13 pages.

[26]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022a. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.

[27]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022b. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.

[28]

Zhiyuan Ren, Zhihong Pan, Xin Zhou, and Le Kang. 2023. Diffusion motion: Generate text-guided 3d human motion by diffusion model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Rhodes, Greece, 1--5.

[29]

Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. In ICCV. IEEE, Montreal, QC, Canada, 1153--1162.

[30]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR. IEEE, New Orleans, LA, USA, 10674--10685.

[31]

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In ICML (JMLR Workshop and Conference Proceedings, Vol. 37). JMLR.org, Lille, France, 2256--2265.

[32]

Stefan Stan, Kazi Injamamul Haque, and Zerrin Yumak. 2023. FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion. In MIG. ACM, Rennes, France, 13:1--13:11.

[33]

Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. 2023. Continuously Controllable Facial Expression Editing in Talking Face Videos. IEEE Transactions on Affective Computing (2023), 1--14.

[34]

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. 2023. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu, Kigali, Rwanda, 1--12.

[35]

Balamurugan Thambiraja, Ikhsanul Habibie, Sadegh Aliakbarian, Darren Cosker, Christian Theobalt, and Justus Thies. 2023. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Vancouver, 20621--20631.

[36]

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. 2023. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. In CVPR. IEEE, Vancouver, 12780--12790.

[37]

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications. Comput. Surveys 56, 4 (2023), 1--39.

Digital Library

[38]

Ran Yi, Zipeng Ye, Zhiyao Sun, Juyong Zhang, Guoxin Zhang, Pengfei Wan, Hujun Bao, and Yong-Jin Liu. 2023. Predicting Personalized Head Movement From Short Video and Speech Signal. IEEE Transactions on Multimedia 25, 1 (2023), 6315--6328.

Digital Library

[39]

Chenxu Zhang, Saifeng Ni, Zhipeng Fan, Hongbo Li, Ming Zeng, Madhukar Budagavi, and Xiaohu Guo. 2023b. 3D Talking Face With Personalized Pose Dynamics. IEEE Trans. Vis. Comput. Graph. 29, 2 (2023), 1438--1449.

[40]

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023a. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. In CVPR. IEEE, Vancouver, BC, Canada, 8652--8661.

[41]

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In CVPR. Computer Vision Foundation / IEEE, Virtual Event, 3661--3670.

[42]

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In CVPR. IEEE, Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation, 10544--10553.

[43]

Wojciech Zielonka, Timo Bolkart, and Justus Thies. 2022. Towards Metrical Reconstruction of Human Faces. In ECCV (13) (Lecture Notes in Computer Science, Vol. 13673). Springer, Tel Aviv, Israel, 250--269.

Cited By

Wu SHaque KYumak Z(2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3677388.3696320
Zhang SZhu HChen XChen JPeng ZChen ZYang YZhang SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ScenePhotographer: Object-Oriented Photography for Residential ScenesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680942(7843-7851)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680942
Zhao QLong PZhang QQin DLiang HZhang LZhang YYu JXu L(2024)Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '2410.1145/3641519.3657413(1-13)Online publication date: 13-Jul-2024
https://doi.org/10.1145/3641519.3657413
Show More Cited By

Index Terms

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
1. Computing methodologies
  1. Computer graphics

Recommendations

Emotional Speech-Driven Animation with Content-Emotion Disentanglement
SA '23: SIGGRAPH Asia 2023 Conference Papers

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions ...
Thai Speech-Driven Facial Animation
CULTURE-COMPUTING '11: Proceedings of the 2011 Second International Conference on Culture and Computing

We consider the problem of making lip movement for an animated talking character, which consumes workload and cost during the animation development process. The main idea is to extract and capture a vise me from the video of a human talking and the ...
Speech driven facial animation
PUI '01: Proceedings of the 2001 workshop on Perceptive user interfaces

The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 43, Issue 4

July 2024

1774 pages

EISSN:1557-7368

DOI:10.1145/3675116

Issue’s Table of Contents

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2024

Published in TOG Volume 43, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
425
Total Downloads

Downloads (Last 12 months)425
Downloads (Last 6 weeks)50

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu SHaque KYumak Z(2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3677388.3696320
Zhang SZhu HChen XChen JPeng ZChen ZYang YZhang SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ScenePhotographer: Object-Oriented Photography for Residential ScenesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680942(7843-7851)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680942
Zhao QLong PZhang QQin DLiang HZhang LZhang YYu JXu L(2024)Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '2410.1145/3641519.3657413(1-13)Online publication date: 13-Jul-2024
https://doi.org/10.1145/3641519.3657413
Yang KRanjan AChang JVemulapalli RTuzel O(2024)Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02577(27284-27293)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02577
Aneja SThies JDai ANieβner M(2024)FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02009(21263-21273)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02009
Song HWang YMa HXu QZhang J(2024)A highly naturalistic facial expression generation method with embedded vein features based on diffusion modelMeasurement Science and Technology10.1088/1361-6501/ad866f36:1(015411)Online publication date: 24-Oct-2024
https://doi.org/10.1088/1361-6501/ad866f

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents