research-article

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Authors:

Nannan WangAuthors Info & Claims

SA '22: SIGGRAPH Asia 2022 Conference Papers

Article No.: 30, Pages 1 - 9

https://doi.org/10.1145/3550469.3555399

Published: 30 November 2022 Publication History

Abstract

We present VideoReTalking, a new system to edit the faces of a real-world talking head video according to input audio, producing a high-quality and lip-syncing output video even with a different emotion. Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for improving photo-realism. Given a talking-head video, we first modify the expression of each frame according to the same expression template using the expression editing network, resulting in a video with the canonical expression. This video, together with the given audio, is then fed into the lip-sync network to generate a lip-syncing video. Finally, we improve the photo-realism of the synthesized faces through an identity-aware face enhancement network and post-processing. We use learning-based approaches for all three steps and all our modules can be tackled in a sequential pipeline without any user intervention. Furthermore, our system is a generic approach that does not need to be retrained to a specific person. Evaluations on two widely-used datasets and in-the-wild examples demonstrate the superiority of our framework over other state-of-the-art methods in terms of lip-sync accuracy and visual quality.

Supplemental Material

MP4 File

presentation

Download
224.88 MB

ZIP File

Appendix and Demo Video

Download
40.70 MB

ZIP File

Appendix

Download
40.70 MB

References

[1]

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).

[2]

Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision.

[3]

Peter J Burt and Edward H Adelson. 1983. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics (TOG) 2, 4 (1983), 217–236.

Digital Library

[4]

Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. 2021. Efficient Geometry-aware 3D Generative Adversarial Networks. In arXiv.

[5]

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7832–7841.

[6]

Lu Chi, Borui Jiang, and Yadong Mu. 2020. Fast Fourier Convolution. In NeurIPS.

[7]

J. S. Chung, A. Jamaludin, and A. Zisserman. 2017. You said that?. In British Machine Vision Conference.

[8]

Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In ACCV.

[9]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019a. Arcface: Additive angular margin loss for deep face recognition. In CVPR.

[10]

Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019b. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In CVPR Workshops.

[11]

Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).

[12]

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.

[13]

Yudong Guo, Keyu Chen, Sen Liang, Yongjin Liu, Hujun Bao, and Juyong Zhang. 2021. AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. In ICCV.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[15]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS (2017).

[16]

Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV.

[17]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).

[18]

Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-Driven Emotional Video Portraits. arXiv preprint arXiv:2104.07452(2021).

[19]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.

[20]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In CVPR.

[21]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In CVPR.

[22]

Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018. Deep video portraits. TOG (2018).

[23]

Prajwal KR, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, and CV Jawahar. 2019. Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia. 1428–1436.

[24]

Avisek Lahiri, Vivek Kwatra, Christian Frueh, John Lewis, and Chris Bregler. 2021. LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization. In CVPR.

[25]

Yuanxun Lu, Jinxiang Chai, and Xun Cao. 2021. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation. TOG (2021).

Digital Library

[26]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2020. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV.

[27]

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH.

[28]

Niranjan D Narvekar and Lina J Karam. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In 2009 International Workshop on Quality of Multimedia Experience. IEEE, 87–91.

[29]

Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro. 2022. SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory. In 36th AAAI Conference on Artificial Intelligence (AAAI 22). Association for the Advancement of Artificial Intelligence.

[30]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[31]

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In ACM Multimedia.

[32]

Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2018. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European conference on computer vision (ECCV). 818–833.

Digital Library

[33]

Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. 2021. PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering. In ICCV.

[34]

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. NIPS (2019).

[35]

Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2022. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security (2022).

[36]

Yang Song, Jingwen Zhu, Dawei Li, Xiaolong Wang, and Hairong Qi. 2018. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786(2018).

[37]

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2021. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161(2021).

[38]

Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. TOG (2017).

Digital Library

[39]

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In ECCV.

[40]

Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: Image synthesis using neural textures. TOG (2019).

Digital Library

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[42]

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2020. Realistic speech-driven facial animation with gans. International Journal of Computer Vision 128, 5 (2020), 1398–1413.

Digital Library

[43]

Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. 2021b. Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion. IJCAI (2021).

[44]

Suzhen Wang, Lincheng Li, Yu Ding, and Xin Yu. 2021a. One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning. arXiv preprint arXiv:2112.02749(2021).

[45]

Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. 2021e. High-fidelity gan inversion for image attribute editing. arXiv preprint arXiv:2109.06590(2021).

[46]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. arXiv preprint arXiv:1808.06601(2018).

[47]

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. 2021d. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR.

[48]

Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. 2021c. Towards Real-World Blind Face Restoration with Generative Facial Prior. In CVPR.

[49]

Xin Wen, Miao Wang, Christian Richardt, Ze-Yin Chen, and Shi-Min Hu. 2020. Photorealistic Audio-driven Video Portraits. TVCG (2020).

[50]

Tianyi Xie, Liucheng Liao, Cheng Bi, Benlai Tang, Xiang Yin, Jianfei Yang, Mingjie Wang, Jiali Yao, Yang Zhang, and Zejun Ma. 2021. Towards Realistic Visual Dubbing with Heterogeneous Sources. In Proceedings of the 29th ACM International Conference on Multimedia. 1739–1747.

Digital Library

[51]

Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. 2021. GAN Prior Embedded Network for Blind Face Restoration in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]

Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. 2022. StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN. arxiv:2203.04036 (2022).

[53]

Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV). 325–341.

Digital Library

[54]

Chenxu Zhang, Yifan Zhao, Yifei Huang, Ming Zeng, Saifeng Ni, Madhukar Budagavi, and Xiaohu Guo. 2021b. Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3867–3876.

[55]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.

[56]

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021a. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In CVPR.

[57]

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In AAAI.

[58]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In CVPR.

[59]

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeitTalk: speaker-aware talking-head animation. TOG (2020).

Cited By

李昊(2025)Speech-Driven Facial GenerationComputer Science and Application10.12677/csa.2025.15102015:01(199-208)Online publication date: 2025
https://doi.org/10.12677/csa.2025.151020
Akhtar ZPendyala TAthmakuri V(2024)Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the CurveForensic Sciences10.3390/forensicsci40300214:3(289-377)Online publication date: 13-Jul-2024
https://doi.org/10.3390/forensicsci4030021
Liu LWang JChen SLi Z(2024)VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip SynchronizationElectronics10.3390/electronics1318365713:18(3657)Online publication date: 14-Sep-2024
https://doi.org/10.3390/electronics13183657
Show More Cited By

Index Terms

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Animation

Recommendations

EmoTalker: Audio Driven Emotion Aware Talking Head Generation
Computer Vision – ACCV 2024
Abstract
Talking head synthesis aims to create videos of a person speaking with accurately synchronized lip movements and natural facial expressions that correspond to the driving audio. However, previous approaches have used reference frames or extra ...
Video-audio driven real-time facial animation

We present a real-time facial tracking and animation system based on a Kinect sensor with video and audio input. Our method requires no user-specific training and is robust to occlusions, large head rotations, and background noise. Given the color, depth ...
Text-based editing of talking-head video

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '22: SIGGRAPH Asia 2022 Conference Papers

November 2022

482 pages

ISBN:9781450394703

DOI:10.1145/3550469

Editors:
Soon Ki Jung
Kyungpook National University, South Korea
,
Jehee Lee
Seoul National University, South Korea
,
Adam Bargteil
University of Maryland Baltimore County, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Data Availability

presentation https://dl.acm.org/doi/10.1145/3550469.3555399#3550469.3555399.mp4

Appendix and Demo Video https://dl.acm.org/doi/10.1145/3550469.3555399#sa22conferencepapers-24-supps.zip

Appendix https://dl.acm.org/doi/10.1145/3550469.3555399#sa22conferencepapers-24-supps.zip

Conference

SA '22

Sponsor:

SIGGRAPH

SA '22: SIGGRAPH Asia 2022

December 6 - 9, 2022

Daegu, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
561
Total Downloads

Downloads (Last 12 months)157
Downloads (Last 6 weeks)8

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

李昊(2025)Speech-Driven Facial GenerationComputer Science and Application10.12677/csa.2025.15102015:01(199-208)Online publication date: 2025
https://doi.org/10.12677/csa.2025.151020
Akhtar ZPendyala TAthmakuri V(2024)Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the CurveForensic Sciences10.3390/forensicsci40300214:3(289-377)Online publication date: 13-Jul-2024
https://doi.org/10.3390/forensicsci4030021
Liu LWang JChen SLi Z(2024)VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip SynchronizationElectronics10.3390/electronics1318365713:18(3657)Online publication date: 14-Sep-2024
https://doi.org/10.3390/electronics13183657
Rupprecht TChang SWu YLu LNan ELi CLai CLi ZHu ZHe YKaeli DWang YLarson K(2024)Digital avatarsProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/1031(8780-8783)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/1031
Zhang LLiang SGe ZHu T(2024)PersonaTalk: Bring Attention to Your Persona in Visual DubbingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687618(1-9)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687618
Witzig PSolenthaler BGross MWampfler R(2024)EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech AnimationProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696336(1-12)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3677388.3696336
Liu MWang JQian XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ListenFormer: Responsive Listening Head Generation with Non-autoregressive TransformersProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681182(7094-7103)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681182
Xiong LCheng XTan JWu XLi XZhu LMa FLi MXu HHu ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local EditingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681108(3170-3179)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681108
Zhao GLiu YWang XYan FGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)CMFF-Face: Attention-Based Cross-Modal Feature Fusion for High-Quality Audio-Driven Talking Face GenerationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658055(101-110)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658055
Duan DSun ZNi TLi SJia XXu WLi TOkoshi TKo JLiKamWa R(2024)F2Key: Dynamically Converting Your Face into a Private Key Based on COTS Headphones for Reliable Voice InteractionProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661860(127-140)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661860
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten