skip to main content
10.1145/3595916.3626364acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Facial Parameter Splicing: A Novel Approach to Efficient Talking Face Generation

Published: 01 January 2024 Publication History

Abstract

In recent years, generating talking faces has become a popular research area due to their applications in various fields. However, most current models require high computational demands, which limits their practicality. To address this issue, some researchers have developed phoneme-face indexes to generate talking videos quickly and efficiently. But when the training video is too short, it is not possible to create mappings for all phonemes. To overcome this limitation, we introduced a large-scale phoneme-face dictionary to complete the feature mapping, designed a novel method for fast phoneme-face indexes search and trained a generative adversarial network (GAN) to generate video from phoneme-face sequences. Our proposed method is capable of completing the phoneme-face mapping using less than 10 seconds training video of the target person based on the large-scale dictionary and fast search algorithm and reducing the preprocessing and training time for talking videos generation.

References

[1]
Annosoft. 2008. Lipsync Tool. http://www.annosoft.com/docs/Visemes17.html, Math 4ĘŮ Accessed: [insert date here].
[2]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018. IEEE Computer Society, 59–66. https://doi.org/10.1109/FG.2018.00019
[3]
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1997, Los Angeles, CA, USA, August 3-8, 1997, G. Scott Owen, Turner Whitted, and Barbara Mones-Hattal (Eds.). ACM, 353–360. https://doi.org/10.1145/258734.258880
[4]
Yao Jen Chang and Tony Ezzat. 2005. Transferable video-realistic speech animation. In Acm Siggraph/eurographics Symposium on Computer Animation.
[5]
Sen Chen, Zhilei Liu, Jiaxing Liu, and Longbiao Wang. 2022. Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion. CoRR abs/2204.12756 (2022). https://doi.org/10.48550/arXiv.2204.12756 arXiv:2204.12756
[6]
Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 3444–3453. https://doi.org/10.1109/CVPR.2017.367
[7]
Ekman, Friesen, and Hager. 2002. Facial Action Coding System: The Manual. (2002).
[8]
Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, and Zhiyao Duan. 2018. Generating Talking Face Landmarks from Speech. In Latent Variable Analysis and Signal Separation - 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2-5, 2018, Proceedings(Lecture Notes in Computer Science, Vol. 10891), Yannick Deville, Sharon Gannot, Russell Mason, Mark D. Plumbley, and Dominic Ward (Eds.). Springer, 372–381. https://doi.org/10.1007/978-3-319-93764-9_35
[9]
Tony Ezzat, Gadi Geiger, and Tomaso A. Poggio. 2002. Trainable videorealistic speech animation. ACM Trans. Graph. 21, 3 (2002), 388–398. https://doi.org/10.1145/566654.566594
[10]
Bo Fan, Lijuan Wang, Frank K. Soong, and Lei Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 4884–4888. https://doi.org/10.1109/ICASSP.2015.7178899
[11]
Ohad Fried and Maneesh Agrawala. 2019. Puppet Dubbing.
[12]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video. ACM Trans. Graph. 38, 4 (2019), 68:1–68:14. https://doi.org/10.1145/3306346.3323028
[13]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572
[14]
V. I. Levenshtein. 1965. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Doklady Akademii nauk SSSR 10 (1965), 707–710.
[15]
Kang Liu and Jörn Ostermann. 2011. Realistic head motion synthesis for an image-based talking head. In Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), Santa Barbara, CA, USA, 21-25 March 2011. IEEE Computer Society, 125–130. https://doi.org/10.1109/FG.2011.5771384
[16]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 8024–8035. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
[17]
Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2018. FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces. CoRR abs/1803.09179 (2018). arXiv:1803.09179http://arxiv.org/abs/1803.09179
[18]
Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2018. Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII(Lecture Notes in Computer Science, Vol. 11217), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 725–740. https://doi.org/10.1007/978-3-030-01261-8_43
[19]
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36, 4 (2017), 95:1–95:13. https://doi.org/10.1145/3072959.3073640
[20]
Paul A. Viola and Michael J. Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. In 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA. IEEE Computer Society, 511–518. https://doi.org/10.1109/CVPR.2001.990517
[21]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Nikolai Yakovenko, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 1152–1164. https://proceedings.neurips.cc/paper/2018/hash/d86ea612dec96096c5e0fcc8dd42ab6d-Abstract.html
[22]
Xinwei Yao, Ohad Fried, Kayvon Fatahalian, and Maneesh Agrawala. 2021. Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting. ACM Trans. Graph. 40, 3 (2021), 20:1–20:14. https://doi.org/10.1145/3449063
[23]
Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America 123, 5 (2008), 3878.
[24]
Sibo Zhang, Jiahong Yuan, Miao Liao, and Liangjun Zhang. 2022. Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2659–2663. https://doi.org/10.1109/ICASSP43922.2022.9747380
[25]
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 9299–9306. https://doi.org/10.1609/aaai.v33i01.33019299
[26]
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021(IEEE Conference on Computer Vision and Pattern Recognition). IEEE; IEEE Comp Soc; CVF, 4174–4184. https://doi.org/10.1109/CVPR46437.2021.00416 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ELECTR NETWORK, JUN 19-25, 2021.
[27]
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 4176–4186. https://doi.org/10.1109/CVPR46437.2021.00416

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
December 2023
745 pages
ISBN:9798400702051
DOI:10.1145/3595916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Computer Vision
  2. Generative Adversarial Network
  3. Multi-modal Processing
  4. Phoneme
  5. Talking Face

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

MMAsia '23
Sponsor:
MMAsia '23: ACM Multimedia Asia
December 6 - 8, 2023
Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 53
    Total Downloads
  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media