short-paper

Facial Parameter Splicing: A Novel Approach to Efficient Talking Face Generation

Authors:

Weike YouAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 91, Pages 1 - 5

https://doi.org/10.1145/3595916.3626364

Published: 01 January 2024 Publication History

Abstract

In recent years, generating talking faces has become a popular research area due to their applications in various fields. However, most current models require high computational demands, which limits their practicality. To address this issue, some researchers have developed phoneme-face indexes to generate talking videos quickly and efficiently. But when the training video is too short, it is not possible to create mappings for all phonemes. To overcome this limitation, we introduced a large-scale phoneme-face dictionary to complete the feature mapping, designed a novel method for fast phoneme-face indexes search and trained a generative adversarial network (GAN) to generate video from phoneme-face sequences. Our proposed method is capable of completing the phoneme-face mapping using less than 10 seconds training video of the target person based on the large-scale dictionary and fast search algorithm and reducing the preprocessing and training time for talking videos generation.

References

[1]

Annosoft. 2008. Lipsync Tool. http://www.annosoft.com/docs/Visemes17.html, Math 4ĘŮ Accessed: [insert date here].

[2]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018. IEEE Computer Society, 59–66. https://doi.org/10.1109/FG.2018.00019

Digital Library

[3]

Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1997, Los Angeles, CA, USA, August 3-8, 1997, G. Scott Owen, Turner Whitted, and Barbara Mones-Hattal (Eds.). ACM, 353–360. https://doi.org/10.1145/258734.258880

Digital Library

[4]

Yao Jen Chang and Tony Ezzat. 2005. Transferable video-realistic speech animation. In Acm Siggraph/eurographics Symposium on Computer Animation.

[5]

Sen Chen, Zhilei Liu, Jiaxing Liu, and Longbiao Wang. 2022. Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion. CoRR abs/2204.12756 (2022). https://doi.org/10.48550/arXiv.2204.12756 arXiv:2204.12756

[6]

Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 3444–3453. https://doi.org/10.1109/CVPR.2017.367

[7]

Ekman, Friesen, and Hager. 2002. Facial Action Coding System: The Manual. (2002).

[8]

Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, and Zhiyao Duan. 2018. Generating Talking Face Landmarks from Speech. In Latent Variable Analysis and Signal Separation - 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2-5, 2018, Proceedings(Lecture Notes in Computer Science, Vol. 10891), Yannick Deville, Sharon Gannot, Russell Mason, Mark D. Plumbley, and Dominic Ward (Eds.). Springer, 372–381. https://doi.org/10.1007/978-3-319-93764-9_35

[9]

Tony Ezzat, Gadi Geiger, and Tomaso A. Poggio. 2002. Trainable videorealistic speech animation. ACM Trans. Graph. 21, 3 (2002), 388–398. https://doi.org/10.1145/566654.566594

Digital Library

[10]

Bo Fan, Lijuan Wang, Frank K. Soong, and Lei Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 4884–4888. https://doi.org/10.1109/ICASSP.2015.7178899

[11]

Ohad Fried and Maneesh Agrawala. 2019. Puppet Dubbing.

[12]

Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video. ACM Trans. Graph. 38, 4 (2019), 68:1–68:14. https://doi.org/10.1145/3306346.3323028

Digital Library

[13]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572

[14]

V. I. Levenshtein. 1965. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Doklady Akademii nauk SSSR 10 (1965), 707–710.

[15]

Kang Liu and Jörn Ostermann. 2011. Realistic head motion synthesis for an image-based talking head. In Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), Santa Barbara, CA, USA, 21-25 March 2011. IEEE Computer Society, 125–130. https://doi.org/10.1109/FG.2011.5771384

[16]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 8024–8035. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html

[17]

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2018. FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces. CoRR abs/1803.09179 (2018). arXiv:1803.09179http://arxiv.org/abs/1803.09179

[18]

Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2018. Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII(Lecture Notes in Computer Science, Vol. 11217), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 725–740. https://doi.org/10.1007/978-3-030-01261-8_43

Digital Library

[19]

Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36, 4 (2017), 95:1–95:13. https://doi.org/10.1145/3072959.3073640

Digital Library

[20]

Paul A. Viola and Michael J. Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. In 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA. IEEE Computer Society, 511–518. https://doi.org/10.1109/CVPR.2001.990517

[21]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Nikolai Yakovenko, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 1152–1164. https://proceedings.neurips.cc/paper/2018/hash/d86ea612dec96096c5e0fcc8dd42ab6d-Abstract.html

[22]

Xinwei Yao, Ohad Fried, Kayvon Fatahalian, and Maneesh Agrawala. 2021. Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting. ACM Trans. Graph. 40, 3 (2021), 20:1–20:14. https://doi.org/10.1145/3449063

Digital Library

[23]

Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America 123, 5 (2008), 3878.

[24]

Sibo Zhang, Jiahong Yuan, Miao Liao, and Liangjun Zhang. 2022. Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2659–2663. https://doi.org/10.1109/ICASSP43922.2022.9747380

[25]

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 9299–9306. https://doi.org/10.1609/aaai.v33i01.33019299

Digital Library

[26]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021(IEEE Conference on Computer Vision and Pattern Recognition). IEEE; IEEE Comp Soc; CVF, 4174–4184. https://doi.org/10.1109/CVPR46437.2021.00416 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ELECTR NETWORK, JUN 19-25, 2021.

[27]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 4176–4186. https://doi.org/10.1109/CVPR46437.2021.00416

Index Terms

Facial Parameter Splicing: A Novel Approach to Efficient Talking Face Generation
1. Computing methodologies
  1. Computer graphics
    1. Animation
      1. Motion processing
    2. Image manipulation
      1. Image processing
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Talking Face Generation via Facial Anatomy
To generate the corresponding talking face from a speech audio and a face image, it is essential to match the variations in the facial appearance with the speech audio in subtle movements of different face regions. Nevertheless, the facial movements ...
Face recognition under large age gap using age face generation

Age invariant face recognition (AIFR) is a challenging problem in the area of the face recognition. To handle large age gap for face recognition, we proposed a robust approach based on deep learning for face recognition under a large age gap. The ...
Talking Face Generation with Expression-Tailored Generative Adversarial Network
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

A key of automatically generating vivid talking faces is to synthesize identity-preserving natural facial expressions beyond audio-lip synchronization, which usually need to disentangle the informative features from multiple modals and then fuse them ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
53
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)2

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten