research-article

Virtual Human Talking-Head Generation

Authors:
Wenchao Song

State Key Laboratory of Media Convergence and Communication,, Communication University of China, China

State Key Laboratory of Media Convergence and Communication,, Communication University of China, China

0000-0003-4024-2680
View Profile

,
Qiang He

State Key Laboratory of Media Convergence and Communication, Communication University of China, China and State Key Laboratory of Media Convergence Production Technology and Systems, Xinhua News Agency, China

State Key Laboratory of Media Convergence and Communication, Communication University of China, China and State Key Laboratory of Media Convergence Production Technology and Systems, Xinhua News Agency, China

0009-0002-4818-741X
View Profile

,
Guowei Chen

State Key Laboratory of Media Convergence and Communication,, Communication University of China, China

State Key Laboratory of Media Convergence and Communication,, Communication University of China, China

0000-0002-5812-6007
View Profile

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine LearningMarch 2023Pages 1–5https://doi.org/10.1145/3590003.3590004

Published:29 May 2023Publication History

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

Pages 1–5

ABSTRACT

Abstract: Virtual humans created by computers using deep learning technology are being used widely in a variety of fields, including personal assistance, intelligent customer service, and online education. Human-computer interaction systems integrate multi-modal technologies like speech recognition, dialogue systems, speech synthesis, and virtual digital human video synthesis as one of the applications of virtual humans. In this paper, we first design the framework for a human-computer interaction system based on a virtual human; next, we classify the talking head video synthesis model according to the generation of a virtual human's depth; finally, we conduct a systematic review of the technical developments in talking head video generation over the last five years, highlighting seminal work.

References

Wang Zhaoqi, "A review of virtual human synthesis", Journal of Chinese Academy of Sciences, vol. 17, no. 2, pp. 89, 2000.Google Scholar
Chen Qixiang and Wei Kejun, Research on virtual human technology China water transportation, Academic, pp. 5, 2006.Google Scholar
Thies J, Zollhofer M, Stamminger M, Face2face: Real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2387-2395.Google Scholar
J. S. Chung, A. Zisserman, Out of time: automated lip sync in the wild, in: Asian conference on computer vision (ACCV), 2016, pp. 251–263.Google Scholar
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM ToG, vol. 36, no. 4, pp. 1–13, 2017.Google ScholarDigital Library
J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017.Google Scholar
Karras T, Aila T, Laine S, Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.Google ScholarDigital Library
Kumar R, Sotelo J, Kumar K, Obamanet: Photo-realistic lip-sync from text[J]. arXiv preprint arXiv:1801.01442, 2017.Google Scholar
Chen L, Li Z, Maddox R K, Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.Google Scholar
Kim H, Garrido P, Tewari A, Deep video portraits[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 1-14.Google ScholarDigital Library
Vougioukas K, Petridis S, Pantic M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs[C]//CVPR Workshops. 2019: 37-40.Google Scholar
Song Y, Zhu J, Li D, Talking face generation by conditional recurrent adversarial network[J]. arXiv preprint arXiv:1804.04786, 2018.Google Scholar
H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, vol. 33, no. 01, 2019, pp. 9299–9306.Google ScholarDigital Library
L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in CVPR, 2019, pp. 7832–7841.Google ScholarCross Ref
Yu L, Yu J, Ling Q. Mining audio, text and visual information for talking face generation[C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.Google Scholar
Cudeiro D, Bolkart T, Laidlaw C, Capture, learning, and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10101-10111.Google Scholar
Fried O, Tewari A, Zollhöfer M, Text-based editing of talking-head video[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 1-14.Google ScholarDigital Library
Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM TOG, vol. 39, no. 6, pp. 1–15, 2020.Google ScholarDigital Library
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.Google ScholarDigital Library
Thies J, Elgharib M, Tewari A, Neural voice puppetry: Audio-driven facial reenactment[C]//European conference on computer vision. Springer, Cham, 2020: 716-731.Google Scholar
W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “Duallip: A system for joint lip reading and generation,” in ACM MM, 2020, pp. 1985–1993.Google Scholar
Guo Y, Chen K, Liang S, Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.Google Scholar
Li L, Wang S, Zhang Z, Write-a-speaker: Text-based emotional and rhythmic talking-head generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 1911-1920.Google Scholar
Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speechdriven 3d facial animation with transformers,” arXiv:2112.05329, 2021.Google Scholar
C.-C. Yang, W.-C. Fan, C.-F. Yang, and Y.-C. F. Wang, “Crossmodal mutual learning for audio-visual speech recognition and manipulation,” in AAAI, 2022.Google Scholar
S. Zhang, J. Yuan, M. Liao and L. Zhang, "Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2659-266.Google Scholar
Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio[C]//Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 1997: 353-360.Google Scholar
Ye Z, Xia M, Yi R, Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.Google ScholarDigital Library
Li T, Bolkart T, Black M J, Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.Google ScholarDigital Library
Chen L, Wu Z, Ling J, Transformer-S2A: Robust and Efficient Speech-to-Animation[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7247-7251.Google Scholar
Hong Y, Peng B, Xiao H, Headnerf: A real-time nerf-based parametric head model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 20374-20384.Google Scholar
Neff T, Stadlbauer P, Parger M, DONeRF: Towards Real‐Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks[C]//Computer Graphics Forum. 2021, 40(4): 45-59.Google Scholar
Yu A, Li R, Tancik M, Plenoctrees for real-time rendering of neural radiance fields[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5752-5761.Google Scholar
Martin-Brualla R, Radwan N, Sajjadi M S M, Nerf in the wild: Neural radiance fields for unconstrained photo collections[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7210-7219.Google Scholar
Huang Y, Zhu Y, Qiao X, Aitransfer: Progressive ai-powered transmission for real-time point cloud video streaming[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 3989-3997.Google Scholar

Index Terms

Virtual Human Talking-Head Generation
1. Computing methodologies
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Virtual reality

Index terms have been assigned to the content through auto-classification.

Recommendations

Human-virtual human interaction by upper body gesture understanding
VRST '13: Proceedings of the 19th ACM Symposium on Virtual Reality Software and Technology

In this paper, a novel human-virtual human interaction system is proposed. This system supports a real human to communicate with a virtual human using natural body language. Meanwhile, the virtual human is capable of understanding the meaning of human ...
Read More
Improving Social Presence with a Virtual Human via Multimodal Physical -- Virtual Interactivity in AR
CHI EA '18: Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems

In a social context where a real human interacts with a virtual human (VH) in the same space, one's sense of social/co-presence with the VH is an important factor for the quality of interaction and the VH's social influence to the human user in context. ...
Read More
Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Virtual reality applications with virtual humans, such as virtual reality exposure therapy, health coaches and negotiation simulators, are developed for different contexts and usually for users from different countries. The emphasis on a virtual human's ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning
March 2023
598 pages
ISBN:9781450399449
DOI:10.1145/3590003

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 May 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multi-modal Human Computer Interaction
Talking-head Generation
Virtual Human
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
CACML '23 Paper Acceptance Rate93of241submissions,39%Overall Acceptance Rate93of241submissions,39%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 98
  Total Downloads
- Downloads (Last 12 months)98
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Virtual Human Talking-Head Generation

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

ABSTRACT

References

Cited By

Index Terms

Recommendations

Human-virtual human interaction by upper body gesture understanding

Improving Social Presence with a Virtual Human via Multimodal Physical -- Virtual Interactivity in AR

Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Virtual Human Talking-Head Generation

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

ABSTRACT

References

Cited By

Index Terms

Recommendations

Human-virtual human interaction by upper body gesture understanding

Improving Social Presence with a Virtual Human via Multimodal Physical -- Virtual Interactivity in AR

Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media