Skip to main content
Log in

SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, resulting in a lack of naturalness. This paper proposes SATFace, a subject agnostic talking face generation method with natural head movement. To model the talking face’s complicated and critical features (identity, background, mouth shape, head posture, etc.), we construct SATFace by taking encoder-decoder as the primary network architecture. Then, we design a long short-time feature learning network to better reference the global and local information in audio for generating reasonable head movement. Besides, a modular training process is proposed to improve explicit and implicit features’ learning effects and efficiency. The experimental comparison results show that SATFace improves by at least about 9.8% in cumulative probability of blur detection and 8.2% in synchronization confidence compared with the mainstream methods. The mean opinion scores show that SATFace has advantages in terms of lip sync quality, head movement naturalness, and video realness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

References

  1. Nguyen T, Nguyen QVH, Nguyen DT, Nguyen DT, Huynh-The T, Nahavandi S, Nguyen TT, Pham Q, Nguyen CM (2022) Deep learning for deepfakes creation and detection: a survey. Comput Vis Image Underst 223:103525. https://doi.org/10.1016/j.cviu.2022.103525

    Article  Google Scholar 

  2. Ingemann F, Laver J (1997) Principles of Phonetics 73:172. https://doi.org/10.2307/416604

  3. Squier C, Brogden KA (2013) Human oral mucosa: development, structure and function. Wiley, New York, pp 1–168

    Google Scholar 

  4. Gao J, Wong JX, Lim JCS, Henry J, Zhou W (2015) Influence of bread structure on human oral processing. J Food Eng 167:147–155. https://doi.org/10.1016/j.jfoodeng.2015.07.022

    Article  Google Scholar 

  5. Pumarola A, Agudo A, Martínez AM, Sanfeliu A, Moreno-Noguer F (2020) Ganimation: one-shot anatomically consistent facial animation. Int J Comput Vis 128(3):698–713. https://doi.org/10.1007/s11263-019-01210-3

    Article  Google Scholar 

  6. Ezzat T (2002) Trainable videorealistic speech animation. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (2002). https://hdl.handle.net/1721.1/8020

  7. Dale K, Sunkavalli K, Johnson MK, Vlasic D, Matusik W, Pfister H (2011) Video face replacement. ACM Trans Graph 30(6):130. https://doi.org/10.1145/2070781.2024164

    Article  Google Scholar 

  8. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: an overview. IEEE Signal Process Mag 35(1):53–65. https://doi.org/10.1109/MSP.2017.2765202. arXiv:1710.07035

    Article  Google Scholar 

  9. Wang K, Gou C, Duan Y, Lin Y, Zheng X, Wang FY (2017) Generative adversarial networks: introduction and outlook. IEEE/CAA J Automat Sinica 4(4):588–598. https://doi.org/10.1109/JAS.2017.7510583

    Article  MathSciNet  Google Scholar 

  10. Chen Z, Xie L, Pang S, He Y, Zhang B (2021) MagDr: mask-guided detection and reconstruction for defending deepfakes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 9010–9019. https://doi.org/10.1109/CVPR46437.2021.00890. arXiv:2103.14211

  11. Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) FACIAL: synthesizing dynamic talking face with implicit attribute learning. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp 3847–3856. https://doi.org/10.1109/ICCV48922.2021.00384

  12. Chen L, Maddox RK, Duan Z, Xu C (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 7832–7841. https://doi.org/10.1109/CVPR.2019.00802. http://openaccess.thecvf.com/content_CVPR_2019/html/Chen_Hierarchical_Cross-Modal_Talking_Face_Generation_With_Dynamic_Pixel-Wise_Loss_CVPR_2019_paper.html

  13. Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing Obama: learning lip sync from audio. ACM Trans Graph 36(4):95–19513. https://doi.org/10.1145/3072959.3073640

    Article  Google Scholar 

  14. Zhang C, Ni S, Fan Z, Li H, Zeng M, Budagavi M, Guo X (2021) 3d talking face with personalized pose dynamics. IEEE Trans Vis Comput Graph

  15. Zhou H, Sun Y, Wu W, Loy CC, Wang X, Liu Z (2021) Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19-25, 2021, pp 4176–4186. https://doi.org/10.1109/CVPR46437.2021.00416. https://openaccess.thecvf.com/content/CVPR2021/html/Zhou_Pose-Controllable_Talking_Face_Generation_by_Implicitly_Modularized_Audio-Visual_Representation_CVPR_2021_paper.html

  16. Lu Y, Chai J, Cao X (2021) Live speech portraits: real-time photorealistic talking-head animation. ACM Trans Graph 40(6):220–122017. https://doi.org/10.1145/3478513.3480484

    Article  Google Scholar 

  17. Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) FACIAL: Synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE international conference on computer vision, pp 3847–3856. https://doi.org/10.1109/ICCV48922.2021.00384. arXiv:2108.07938

  18. Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds.) Computer Vision - ECCV 2020 - 16th European conference, glasgow, UK, August 23-28, 2020, Proceedings, Part IX. Lecture Notes in Computer Science, vol 12354, pp 35–51. https://doi.org/10.1007/978-3-030-58545-7_3

  19. Zhou Y, Li D, Han X, Kalogerakis E, Shechtman E, Echevarria J (2020) Makeittalk: Speaker-aware talking head animation. CoRR abs/2004.12992 (2020)

  20. Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270. https://doi.org/10.1162/neco_a_01199

    Article  MathSciNet  MATH  Google Scholar 

  21. Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys D Nonlinear Phenom 404:132306. https://doi.org/10.1016/j.physd.2019.132306

    Article  MathSciNet  MATH  Google Scholar 

  22. Kang WC, McAuley J (2018) Self-Attentive Sequential Recommendation. In: Proceedings - IEEE international conference on data mining, ICDM 2018-November, pp 197–206. arXiv:1808.09781. https://doi.org/10.1109/ICDM.2018.00035

  23. Zhang Q, Lipani A, Kirnap O, Yilmaz E (2020) Self-Attentive hawkes process. In: 37th international conference on machine learning, ICML 2020 PartF168147-15, pp 11117–11127 (2020) arXiv:1907.07561

  24. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp 8107–8116. https://doi.org/10.1109/CVPR42600.2020.00813. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html

  25. Imai S (1983) Cepstral analysis synthesis on the Mel frequency scale. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983, pp 93–96. DOI: https://doi.org/10.1109/ICASSP.1983.1172250

  26. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 770–778. DOI: https://doi.org/10.1109/CVPR.2016.90

  27. Guo J, Zhu X, Yang, Y, Yang F, Lei Z, Li, SZ (2020) Towards fast, accurate and stable 3d dense face alignment. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds.) Computer Vision - ECCV 2020 - 16th European conference, glasgow, UK, August 23-28, 2020, Proceedings, Part XIX. Lecture Notes in Computer Science, vol 12364, pp 152–168. https://doi.org/10.1007/978-3-030-58529-7_10

  28. Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Leibe B, Matas J, Sebe N, Welling M (eds.) Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. Lecture Notes in Computer Science, vol 9906, pp 694–711. https://doi.org/10.1007/978-3-319-46475-6_43

  29. Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: A large-scale speaker identification dataset. In: Lacerda F (ed.) Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20-24, 2017, pp 2616–2620. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0950.html

  30. Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 1021–1030. DOI: https://doi.org/10.1109/ICCV.2017.116

  31. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980

  32. Hane C, Tulsiani S, Malik J (2017) Hierarchical surface prediction for 3d object reconstruction. In: 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp 412–420. https://doi.org/10.1109/3DV.2017.00054

  33. Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV (2020) A lip sync expert is all you need for speech to lip generation in the wild. In Chen CW, Cucchiara R, Hua X, Qi G, Ricci E, Zhang Z, Zimmermann R (eds.) MM ’20: The 28th ACM international conference on multimedia, virtual event / seattle, WA, USA, October 12-16, 2020, pp 484–492. DOI: https://doi.org/10.1145/3394171.3413532

  34. Narvekar ND, Karam LJ (2009) A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 international workshop on quality of multimedia experience, pp 87–91. IEEE

  35. Chen L, Li Z, Maddox RK, Duan Z, Xu C (2018) Lip movements generation at a glance. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds.) Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII. Lecture Notes in Computer Science, vol 11211, pp 538–553. DOI: https://doi.org/10.1007/978-3-030-01234-2_32

  36. Chung JS, Zisserman A (2016) Out of time: automated lip sync in the wild. In: Chen C, Lu J, Ma K (eds.) Computer Vision - ACCV 2016 Workshops - ACCV 2016 international workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II. Lecture Notes in Computer Science, vol 10117, pp 251–263. https://doi.org/10.1007/978-3-319-54427-4_19

  37. Mallick S (2016) Head pose estimation using OpenCV and Dlib. https://www.learnopencv.com/head-pose-estimation-using-opencv-and-dlib

  38. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861

    Article  Google Scholar 

  39. Huynh-Thu Q, Ghanbari M (2008) Scope of validity of PSNR in image/video quality assessment. Electron Lett 44(13):800–801

    Article  Google Scholar 

  40. Vougioukas K, Petridis S, Pantic M (2020) Realistic speech-driven facial animation with GANS. Int J Comput Vis 128(5):1398–1413. https://doi.org/10.1007/s11263-019-01251-8

    Article  Google Scholar 

  41. Kim H, Garrido P, Tewari A, Xu W, Thies J, Nießner M, Pérez P, Richardt C, Zollhöfer M, Theobalt C (2018) Deep video portraits. ACM Trans Graph 37(4):163. https://doi.org/10.1145/3197517.3201283

    Article  Google Scholar 

  42. Ji X, Zhou H, Wang K, Wu W, Loy CC, Cao X, Xu F (2021) Audio-driven emotional video portraits. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19-25, 2021, pp 14080–14089. https://doi.org/10.1109/CVPR46437.2021.01386. https://openaccess.thecvf.com/content/CVPR2021/html/Ji_Audio-Driven_Emotional_Video_Portraits_CVPR_2021_paper.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Chen.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, S., Qiao, K., Shi, S. et al. SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement. Neural Process Lett 55, 7529–7542 (2023). https://doi.org/10.1007/s11063-023-11272-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11272-7

Keywords

Navigation