ABSTRACT
Abstract: Virtual humans created by computers using deep learning technology are being used widely in a variety of fields, including personal assistance, intelligent customer service, and online education. Human-computer interaction systems integrate multi-modal technologies like speech recognition, dialogue systems, speech synthesis, and virtual digital human video synthesis as one of the applications of virtual humans. In this paper, we first design the framework for a human-computer interaction system based on a virtual human; next, we classify the talking head video synthesis model according to the generation of a virtual human's depth; finally, we conduct a systematic review of the technical developments in talking head video generation over the last five years, highlighting seminal work.
- Wang Zhaoqi, "A review of virtual human synthesis", Journal of Chinese Academy of Sciences, vol. 17, no. 2, pp. 89, 2000.Google Scholar
- Chen Qixiang and Wei Kejun, Research on virtual human technology China water transportation, Academic, pp. 5, 2006.Google Scholar
- Thies J, Zollhofer M, Stamminger M, Face2face: Real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2387-2395.Google Scholar
- J. S. Chung, A. Zisserman, Out of time: automated lip sync in the wild, in: Asian conference on computer vision (ACCV), 2016, pp. 251–263.Google Scholar
- S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM ToG, vol. 36, no. 4, pp. 1–13, 2017.Google ScholarDigital Library
- J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017.Google Scholar
- Karras T, Aila T, Laine S, Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.Google ScholarDigital Library
- Kumar R, Sotelo J, Kumar K, Obamanet: Photo-realistic lip-sync from text[J]. arXiv preprint arXiv:1801.01442, 2017.Google Scholar
- Chen L, Li Z, Maddox R K, Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.Google Scholar
- Kim H, Garrido P, Tewari A, Deep video portraits[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 1-14.Google ScholarDigital Library
- Vougioukas K, Petridis S, Pantic M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs[C]//CVPR Workshops. 2019: 37-40.Google Scholar
- Song Y, Zhu J, Li D, Talking face generation by conditional recurrent adversarial network[J]. arXiv preprint arXiv:1804.04786, 2018.Google Scholar
- H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, vol. 33, no. 01, 2019, pp. 9299–9306.Google ScholarDigital Library
- L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in CVPR, 2019, pp. 7832–7841.Google ScholarCross Ref
- Yu L, Yu J, Ling Q. Mining audio, text and visual information for talking face generation[C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.Google Scholar
- Cudeiro D, Bolkart T, Laidlaw C, Capture, learning, and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10101-10111.Google Scholar
- Fried O, Tewari A, Zollhöfer M, Text-based editing of talking-head video[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 1-14.Google ScholarDigital Library
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM TOG, vol. 39, no. 6, pp. 1–15, 2020.Google ScholarDigital Library
- KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.Google ScholarDigital Library
- Thies J, Elgharib M, Tewari A, Neural voice puppetry: Audio-driven facial reenactment[C]//European conference on computer vision. Springer, Cham, 2020: 716-731.Google Scholar
- W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “Duallip: A system for joint lip reading and generation,” in ACM MM, 2020, pp. 1985–1993.Google Scholar
- Guo Y, Chen K, Liang S, Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.Google Scholar
- Li L, Wang S, Zhang Z, Write-a-speaker: Text-based emotional and rhythmic talking-head generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 1911-1920.Google Scholar
- Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speechdriven 3d facial animation with transformers,” arXiv:2112.05329, 2021.Google Scholar
- C.-C. Yang, W.-C. Fan, C.-F. Yang, and Y.-C. F. Wang, “Crossmodal mutual learning for audio-visual speech recognition and manipulation,” in AAAI, 2022.Google Scholar
- S. Zhang, J. Yuan, M. Liao and L. Zhang, "Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2659-266.Google Scholar
- Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio[C]//Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 1997: 353-360.Google Scholar
- Ye Z, Xia M, Yi R, Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.Google ScholarDigital Library
- Li T, Bolkart T, Black M J, Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.Google ScholarDigital Library
- Chen L, Wu Z, Ling J, Transformer-S2A: Robust and Efficient Speech-to-Animation[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7247-7251.Google Scholar
- Hong Y, Peng B, Xiao H, Headnerf: A real-time nerf-based parametric head model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 20374-20384.Google Scholar
- Neff T, Stadlbauer P, Parger M, DONeRF: Towards Real‐Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks[C]//Computer Graphics Forum. 2021, 40(4): 45-59.Google Scholar
- Yu A, Li R, Tancik M, Plenoctrees for real-time rendering of neural radiance fields[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5752-5761.Google Scholar
- Martin-Brualla R, Radwan N, Sajjadi M S M, Nerf in the wild: Neural radiance fields for unconstrained photo collections[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7210-7219.Google Scholar
- Huang Y, Zhu Y, Qiao X, Aitransfer: Progressive ai-powered transmission for real-time point cloud video streaming[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 3989-3997.Google Scholar
Index Terms
- Virtual Human Talking-Head Generation
Recommendations
Human-virtual human interaction by upper body gesture understanding
VRST '13: Proceedings of the 19th ACM Symposium on Virtual Reality Software and TechnologyIn this paper, a novel human-virtual human interaction system is proposed. This system supports a real human to communicate with a virtual human using natural body language. Meanwhile, the virtual human is capable of understanding the meaning of human ...
Improving Social Presence with a Virtual Human via Multimodal Physical -- Virtual Interactivity in AR
CHI EA '18: Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing SystemsIn a social context where a real human interacts with a virtual human (VH) in the same space, one's sense of social/co-presence with the VH is an important factor for the quality of interaction and the VH's social influence to the human user in context. ...
Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture
Virtual reality applications with virtual humans, such as virtual reality exposure therapy, health coaches and negotiation simulators, are developed for different contexts and usually for users from different countries. The emphasis on a virtual human's ...
Comments