skip to main content
10.1145/3590003.3590004acmotherconferencesArticle/Chapter ViewAbstractPublication PagescacmlConference Proceedingsconference-collections
research-article

Virtual Human Talking-Head Generation

Authors Info & Claims
Published:29 May 2023Publication History

ABSTRACT

Abstract: Virtual humans created by computers using deep learning technology are being used widely in a variety of fields, including personal assistance, intelligent customer service, and online education. Human-computer interaction systems integrate multi-modal technologies like speech recognition, dialogue systems, speech synthesis, and virtual digital human video synthesis as one of the applications of virtual humans. In this paper, we first design the framework for a human-computer interaction system based on a virtual human; next, we classify the talking head video synthesis model according to the generation of a virtual human's depth; finally, we conduct a systematic review of the technical developments in talking head video generation over the last five years, highlighting seminal work.

References

  1. Wang Zhaoqi, "A review of virtual human synthesis", Journal of Chinese Academy of Sciences, vol. 17, no. 2, pp. 89, 2000.Google ScholarGoogle Scholar
  2. Chen Qixiang and Wei Kejun, Research on virtual human technology China water transportation, Academic, pp. 5, 2006.Google ScholarGoogle Scholar
  3. Thies J, Zollhofer M, Stamminger M, Face2face: Real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2387-2395.Google ScholarGoogle Scholar
  4. J. S. Chung, A. Zisserman, Out of time: automated lip sync in the wild, in: Asian conference on computer vision (ACCV), 2016, pp. 251–263.Google ScholarGoogle Scholar
  5. S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM ToG, vol. 36, no. 4, pp. 1–13, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017.Google ScholarGoogle Scholar
  7. Karras T, Aila T, Laine S, Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kumar R, Sotelo J, Kumar K, Obamanet: Photo-realistic lip-sync from text[J]. arXiv preprint arXiv:1801.01442, 2017.Google ScholarGoogle Scholar
  9. Chen L, Li Z, Maddox R K, Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.Google ScholarGoogle Scholar
  10. Kim H, Garrido P, Tewari A, Deep video portraits[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 1-14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Vougioukas K, Petridis S, Pantic M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs[C]//CVPR Workshops. 2019: 37-40.Google ScholarGoogle Scholar
  12. Song Y, Zhu J, Li D, Talking face generation by conditional recurrent adversarial network[J]. arXiv preprint arXiv:1804.04786, 2018.Google ScholarGoogle Scholar
  13. H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, vol. 33, no. 01, 2019, pp. 9299–9306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in CVPR, 2019, pp. 7832–7841.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yu L, Yu J, Ling Q. Mining audio, text and visual information for talking face generation[C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.Google ScholarGoogle Scholar
  16. Cudeiro D, Bolkart T, Laidlaw C, Capture, learning, and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10101-10111.Google ScholarGoogle Scholar
  17. Fried O, Tewari A, Zollhöfer M, Text-based editing of talking-head video[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 1-14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM TOG, vol. 39, no. 6, pp. 1–15, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Thies J, Elgharib M, Tewari A, Neural voice puppetry: Audio-driven facial reenactment[C]//European conference on computer vision. Springer, Cham, 2020: 716-731.Google ScholarGoogle Scholar
  21. W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “Duallip: A system for joint lip reading and generation,” in ACM MM, 2020, pp. 1985–1993.Google ScholarGoogle Scholar
  22. Guo Y, Chen K, Liang S, Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.Google ScholarGoogle Scholar
  23. Li L, Wang S, Zhang Z, Write-a-speaker: Text-based emotional and rhythmic talking-head generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 1911-1920.Google ScholarGoogle Scholar
  24. Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speechdriven 3d facial animation with transformers,” arXiv:2112.05329, 2021.Google ScholarGoogle Scholar
  25. C.-C. Yang, W.-C. Fan, C.-F. Yang, and Y.-C. F. Wang, “Crossmodal mutual learning for audio-visual speech recognition and manipulation,” in AAAI, 2022.Google ScholarGoogle Scholar
  26. S. Zhang, J. Yuan, M. Liao and L. Zhang, "Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2659-266.Google ScholarGoogle Scholar
  27. Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio[C]//Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 1997: 353-360.Google ScholarGoogle Scholar
  28. Ye Z, Xia M, Yi R, Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Li T, Bolkart T, Black M J, Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chen L, Wu Z, Ling J, Transformer-S2A: Robust and Efficient Speech-to-Animation[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7247-7251.Google ScholarGoogle Scholar
  31. Hong Y, Peng B, Xiao H, Headnerf: A real-time nerf-based parametric head model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 20374-20384.Google ScholarGoogle Scholar
  32. Neff T, Stadlbauer P, Parger M, DONeRF: Towards Real‐Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks[C]//Computer Graphics Forum. 2021, 40(4): 45-59.Google ScholarGoogle Scholar
  33. Yu A, Li R, Tancik M, Plenoctrees for real-time rendering of neural radiance fields[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5752-5761.Google ScholarGoogle Scholar
  34. Martin-Brualla R, Radwan N, Sajjadi M S M, Nerf in the wild: Neural radiance fields for unconstrained photo collections[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7210-7219.Google ScholarGoogle Scholar
  35. Huang Y, Zhu Y, Qiao X, Aitransfer: Progressive ai-powered transmission for real-time point cloud video streaming[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 3989-3997.Google ScholarGoogle Scholar

Index Terms

  1. Virtual Human Talking-Head Generation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning
            March 2023
            598 pages
            ISBN:9781450399449
            DOI:10.1145/3590003

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 29 May 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            CACML '23 Paper Acceptance Rate93of241submissions,39%Overall Acceptance Rate93of241submissions,39%
          • Article Metrics

            • Downloads (Last 12 months)98
            • Downloads (Last 6 weeks)11

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format