skip to main content
research-article

Synthesizing Obama: learning lip sync from audio

Published: 20 July 2017 Publication History

Abstract

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

Supplementary Material

ZIP File (a95-suwajanakorn.zip)
Supplemental files.

References

[1]
Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]
Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013a. An expressive text-driven 3D talking head. In ACM SIGGRAPH 2013 Posters. ACM, 80.
[3]
Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013b. Expressive visual text-to-speech using active appearance models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3382--3389.
[4]
Fabrice Bellard, M Niedermayer, and others. 2012. FFmpeg. Availabel from: http://ffm.peg.org (2012).
[5]
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. Graph. 31, 4 (2012), 67--1.
[6]
G. Bradski. 2000. Dr. Dobb's Journal of Software Tools (2000).
[7]
Matthew Brand. 1999. Voice Puppetry. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 21--28.
[8]
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 353--360.
[9]
Peter J Burt and Edward H Adelson. 1983. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics (TOG) 2, 4 (1983), 217--236.
[10]
Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics (TOG) 35, 4 (2016), 126.
[11]
Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics (TOG) 24, 4 (2005), 1283--1302.
[12]
Timothy F Cootes, Gareth J Edwards, Christopher J Taylor, and others. 2001. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence 23, 6(2001), 681--685.
[13]
Kevin Dale, Kalyan Sunkavalli, Micah K Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video face replacement. ACM Transactions on Graphics (TOG) 30, 6 (2011), 130.
[14]
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable videorealistic speech animation. Vol. 21. ACM.
[15]
Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015a. Photo-real talking head with deep bidirectional LSTM. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884--4888.
[16]
Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K Soong. 2015b. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications (2015), 1--23.
[17]
Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243--252.
[18]
Yarin Gal. 2015. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287 (2015).
[19]
Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormahlen, Patrick Perez, and Christian Theobalt. 2014. Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4217--4224.
[20]
Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. 2015. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 193--204.
[21]
Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).
[22]
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 273--278.
[23]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602--610.
[24]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[25]
Masahide Kawai, Tomoyori Iwao, Daisuke Mima, Akinobu Maejima, and Shigeo Morishima. 2014. Data-driven speech animation synthesis focusing on realistic inside of the mouth. Journal of information processing 22, 2 (2014), 401--409.
[26]
Ira Kemelmacher-Shlizerman and Steven M Seitz. 2012. Collection flow. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 1792--1799.
[27]
Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577--586.
[28]
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755--1758.
[29]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[30]
Kai Li, Feng Xu, Jue Wang, Qionghai Dai, and Yebin Liu. 2012. A data-driven approach for facial expression synthesis in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 57--64.
[31]
Shu Liang, Linda G Shapiro, and Ira Kemelmacher-Shlizerman. 2016. Head reconstruction from internet photos. In European Conference on Computer Vision. Springer, 360--374.
[32]
Ce Liu, William T Freeman, Edward H Adelson, and Yair Weiss. 2008. Human-assisted motion annotation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 1--8.
[33]
Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7 (2013), 857--876.
[34]
Wesley Mattheyses and Werner Verhelst. 2015. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication 66 (2015), 182--217.
[35]
Aude Oliva, Antonio Torralba, and Philippe G. Schyns. 2006. Hybrid Images. ACM Trans. Graph. 25, 3 (July 2006), 527--532.
[36]
Wener Robitza. 2016. ffmpeg-normalize. https://github.com/slhck/ffmpeg-normalize. (2016).
[37]
Shunsuke Saito, Tianye Li, and Hao Li. 2016. Real-Time Facial Segmentation and Performance Capture from RGB Input. arXiv preprint arXiv:1604.02647 (2016).
[38]
Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. HMM-based text-to-audio-visual speech synthesis. In INTERSPEECH. 25--28.
[39]
YiChang Shih, Sylvain Paris, Connelly Barnes, William T Freeman, and Frédo Durand. 2014. Style transfer for headshot portraits. (2014).
[40]
Taiki Shimba, Ryuhei Sakurai, Hirotake Yamazoe, and Joo-Ho Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 100--105.
[41]
Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. 2014. Total moving face reconstruction. In European Conference on Computer Vision. Springer, 796--812.
[42]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2015. What Makes Tom Hanks Look Like Tom Hanks. In Proceedings of the IEEE International Conference on Computer Vision. 3952--3960.
[43]
Sarah Taylor, Akihiro Kato, Ben Milner, and Iain Matthews. 2016. Audio-to-Visual Speech Conversion using Deep Neural Networks. (2016).
[44]
Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation. Eurographics Association, 275--284.
[45]
Alexandru Telea. 2004. An image inpainting technique based on the fast marching method. Journal of graphics tools 9, 1 (2004), 23--34.
[46]
Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. 2015. Real-time expression transfer for facial reenactment. ACM Transactions on Graphics (TOG) 34, 6 (2015), 183.
[47]
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2016).
[48]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[49]
Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face transfer with multilinear models. In ACM Transactions on Graphics (TOG), Vol. 24. ACM, 426--433.
[50]
Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4529--4532.
[51]
Lijuan Wang, Xiaojun Qian, Wei Han, and Frank K Soong. 2010. Synthesizing photo-real talking head via trajectory-guided sample selection. In INTERSPEECH, Vol. 10. 446--449.
[52]
Lei Xie and Zhi-Qiang Liu. 2007a. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325--2340.
[53]
Lei Xie and Zhi-Qiang Liu. 2007b. Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia 9, 3 (2007), 500--510.
[54]
Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition. 532--539.
[55]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
[56]
Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, and Frank K Soong. 2013. A new language independent, photo-realistic talking head driven by voice only. In INTERSPEECH. 2743--2747.

Cited By

View all
  • (2025)Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head GenerationApplied Sciences10.3390/app1501047915:1(479)Online publication date: 6-Jan-2025
  • (2025)3D Facial Tracking and User Authentication Through Lightweight Single-Ear BiosensorsIEEE Transactions on Mobile Computing10.1109/TMC.2024.347033924:2(749-762)Online publication date: 1-Feb-2025
  • (2025)VPT: Video portraits transformer for realistic talking face generationNeural Networks10.1016/j.neunet.2025.107122184(107122)Online publication date: Apr-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics
ACM Transactions on Graphics  Volume 36, Issue 4
August 2017
2155 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3072959
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2017
Published in TOG Volume 36, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LSTM
  2. RNN
  3. audio
  4. audiovisual speech
  5. big data
  6. face synthesis
  7. lip sync
  8. uncanny valley
  9. videos

Qualifiers

  • Research-article

Funding Sources

  • Samsung
  • Google
  • Intel
  • University of Washington Animation Research Labs

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)409
  • Downloads (Last 6 weeks)46
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head GenerationApplied Sciences10.3390/app1501047915:1(479)Online publication date: 6-Jan-2025
  • (2025)3D Facial Tracking and User Authentication Through Lightweight Single-Ear BiosensorsIEEE Transactions on Mobile Computing10.1109/TMC.2024.347033924:2(749-762)Online publication date: 1-Feb-2025
  • (2025)VPT: Video portraits transformer for realistic talking face generationNeural Networks10.1016/j.neunet.2025.107122184(107122)Online publication date: Apr-2025
  • (2025)Deepfakes in digital media forensics: Generation, AI-based detection and challengesJournal of Information Security and Applications10.1016/j.jisa.2024.10393588(103935)Online publication date: Feb-2025
  • (2025)Enhanced deepfake detection with DenseNet and Cross-ViTExpert Systems with Applications10.1016/j.eswa.2024.126150267(126150)Online publication date: Apr-2025
  • (2024)Deepfake forensics: a survey of digital forensic methods for multimodal deepfake identification on social mediaPeerJ Computer Science10.7717/peerj-cs.203710(e2037)Online publication date: 27-May-2024
  • (2024)Research Paper on Introduction of DeepfakeInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17631(195-200)Online publication date: 24-Apr-2024
  • (2024)Consumer EngagementNavigating the World of Deepfake Technology10.4018/979-8-3693-5298-4.ch020(397-421)Online publication date: 26-Jul-2024
  • (2024)Audio-Driven Facial Animation with Deep Learning: A SurveyInformation10.3390/info1511067515:11(675)Online publication date: 28-Oct-2024
  • (2024)VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip SynchronizationElectronics10.3390/electronics1318365713:18(3657)Online publication date: 14-Sep-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media