Abstract
Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, resulting in a lack of naturalness. This paper proposes SATFace, a subject agnostic talking face generation method with natural head movement. To model the talking face’s complicated and critical features (identity, background, mouth shape, head posture, etc.), we construct SATFace by taking encoder-decoder as the primary network architecture. Then, we design a long short-time feature learning network to better reference the global and local information in audio for generating reasonable head movement. Besides, a modular training process is proposed to improve explicit and implicit features’ learning effects and efficiency. The experimental comparison results show that SATFace improves by at least about 9.8% in cumulative probability of blur detection and 8.2% in synchronization confidence compared with the mainstream methods. The mean opinion scores show that SATFace has advantages in terms of lip sync quality, head movement naturalness, and video realness.
Similar content being viewed by others
Data Availability
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
References
Nguyen T, Nguyen QVH, Nguyen DT, Nguyen DT, Huynh-The T, Nahavandi S, Nguyen TT, Pham Q, Nguyen CM (2022) Deep learning for deepfakes creation and detection: a survey. Comput Vis Image Underst 223:103525. https://doi.org/10.1016/j.cviu.2022.103525
Ingemann F, Laver J (1997) Principles of Phonetics 73:172. https://doi.org/10.2307/416604
Squier C, Brogden KA (2013) Human oral mucosa: development, structure and function. Wiley, New York, pp 1–168
Gao J, Wong JX, Lim JCS, Henry J, Zhou W (2015) Influence of bread structure on human oral processing. J Food Eng 167:147–155. https://doi.org/10.1016/j.jfoodeng.2015.07.022
Pumarola A, Agudo A, Martínez AM, Sanfeliu A, Moreno-Noguer F (2020) Ganimation: one-shot anatomically consistent facial animation. Int J Comput Vis 128(3):698–713. https://doi.org/10.1007/s11263-019-01210-3
Ezzat T (2002) Trainable videorealistic speech animation. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (2002). https://hdl.handle.net/1721.1/8020
Dale K, Sunkavalli K, Johnson MK, Vlasic D, Matusik W, Pfister H (2011) Video face replacement. ACM Trans Graph 30(6):130. https://doi.org/10.1145/2070781.2024164
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: an overview. IEEE Signal Process Mag 35(1):53–65. https://doi.org/10.1109/MSP.2017.2765202. arXiv:1710.07035
Wang K, Gou C, Duan Y, Lin Y, Zheng X, Wang FY (2017) Generative adversarial networks: introduction and outlook. IEEE/CAA J Automat Sinica 4(4):588–598. https://doi.org/10.1109/JAS.2017.7510583
Chen Z, Xie L, Pang S, He Y, Zhang B (2021) MagDr: mask-guided detection and reconstruction for defending deepfakes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 9010–9019. https://doi.org/10.1109/CVPR46437.2021.00890. arXiv:2103.14211
Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) FACIAL: synthesizing dynamic talking face with implicit attribute learning. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp 3847–3856. https://doi.org/10.1109/ICCV48922.2021.00384
Chen L, Maddox RK, Duan Z, Xu C (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 7832–7841. https://doi.org/10.1109/CVPR.2019.00802. http://openaccess.thecvf.com/content_CVPR_2019/html/Chen_Hierarchical_Cross-Modal_Talking_Face_Generation_With_Dynamic_Pixel-Wise_Loss_CVPR_2019_paper.html
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing Obama: learning lip sync from audio. ACM Trans Graph 36(4):95–19513. https://doi.org/10.1145/3072959.3073640
Zhang C, Ni S, Fan Z, Li H, Zeng M, Budagavi M, Guo X (2021) 3d talking face with personalized pose dynamics. IEEE Trans Vis Comput Graph
Zhou H, Sun Y, Wu W, Loy CC, Wang X, Liu Z (2021) Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19-25, 2021, pp 4176–4186. https://doi.org/10.1109/CVPR46437.2021.00416. https://openaccess.thecvf.com/content/CVPR2021/html/Zhou_Pose-Controllable_Talking_Face_Generation_by_Implicitly_Modularized_Audio-Visual_Representation_CVPR_2021_paper.html
Lu Y, Chai J, Cao X (2021) Live speech portraits: real-time photorealistic talking-head animation. ACM Trans Graph 40(6):220–122017. https://doi.org/10.1145/3478513.3480484
Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) FACIAL: Synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE international conference on computer vision, pp 3847–3856. https://doi.org/10.1109/ICCV48922.2021.00384. arXiv:2108.07938
Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds.) Computer Vision - ECCV 2020 - 16th European conference, glasgow, UK, August 23-28, 2020, Proceedings, Part IX. Lecture Notes in Computer Science, vol 12354, pp 35–51. https://doi.org/10.1007/978-3-030-58545-7_3
Zhou Y, Li D, Han X, Kalogerakis E, Shechtman E, Echevarria J (2020) Makeittalk: Speaker-aware talking head animation. CoRR abs/2004.12992 (2020)
Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270. https://doi.org/10.1162/neco_a_01199
Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys D Nonlinear Phenom 404:132306. https://doi.org/10.1016/j.physd.2019.132306
Kang WC, McAuley J (2018) Self-Attentive Sequential Recommendation. In: Proceedings - IEEE international conference on data mining, ICDM 2018-November, pp 197–206. arXiv:1808.09781. https://doi.org/10.1109/ICDM.2018.00035
Zhang Q, Lipani A, Kirnap O, Yilmaz E (2020) Self-Attentive hawkes process. In: 37th international conference on machine learning, ICML 2020 PartF168147-15, pp 11117–11127 (2020) arXiv:1907.07561
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp 8107–8116. https://doi.org/10.1109/CVPR42600.2020.00813. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html
Imai S (1983) Cepstral analysis synthesis on the Mel frequency scale. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983, pp 93–96. DOI: https://doi.org/10.1109/ICASSP.1983.1172250
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 770–778. DOI: https://doi.org/10.1109/CVPR.2016.90
Guo J, Zhu X, Yang, Y, Yang F, Lei Z, Li, SZ (2020) Towards fast, accurate and stable 3d dense face alignment. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds.) Computer Vision - ECCV 2020 - 16th European conference, glasgow, UK, August 23-28, 2020, Proceedings, Part XIX. Lecture Notes in Computer Science, vol 12364, pp 152–168. https://doi.org/10.1007/978-3-030-58529-7_10
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Leibe B, Matas J, Sebe N, Welling M (eds.) Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. Lecture Notes in Computer Science, vol 9906, pp 694–711. https://doi.org/10.1007/978-3-319-46475-6_43
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: A large-scale speaker identification dataset. In: Lacerda F (ed.) Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20-24, 2017, pp 2616–2620. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0950.html
Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 1021–1030. DOI: https://doi.org/10.1109/ICCV.2017.116
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980
Hane C, Tulsiani S, Malik J (2017) Hierarchical surface prediction for 3d object reconstruction. In: 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp 412–420. https://doi.org/10.1109/3DV.2017.00054
Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV (2020) A lip sync expert is all you need for speech to lip generation in the wild. In Chen CW, Cucchiara R, Hua X, Qi G, Ricci E, Zhang Z, Zimmermann R (eds.) MM ’20: The 28th ACM international conference on multimedia, virtual event / seattle, WA, USA, October 12-16, 2020, pp 484–492. DOI: https://doi.org/10.1145/3394171.3413532
Narvekar ND, Karam LJ (2009) A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 international workshop on quality of multimedia experience, pp 87–91. IEEE
Chen L, Li Z, Maddox RK, Duan Z, Xu C (2018) Lip movements generation at a glance. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds.) Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII. Lecture Notes in Computer Science, vol 11211, pp 538–553. DOI: https://doi.org/10.1007/978-3-030-01234-2_32
Chung JS, Zisserman A (2016) Out of time: automated lip sync in the wild. In: Chen C, Lu J, Ma K (eds.) Computer Vision - ACCV 2016 Workshops - ACCV 2016 international workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II. Lecture Notes in Computer Science, vol 10117, pp 251–263. https://doi.org/10.1007/978-3-319-54427-4_19
Mallick S (2016) Head pose estimation using OpenCV and Dlib. https://www.learnopencv.com/head-pose-estimation-using-opencv-and-dlib
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861
Huynh-Thu Q, Ghanbari M (2008) Scope of validity of PSNR in image/video quality assessment. Electron Lett 44(13):800–801
Vougioukas K, Petridis S, Pantic M (2020) Realistic speech-driven facial animation with GANS. Int J Comput Vis 128(5):1398–1413. https://doi.org/10.1007/s11263-019-01251-8
Kim H, Garrido P, Tewari A, Xu W, Thies J, Nießner M, Pérez P, Richardt C, Zollhöfer M, Theobalt C (2018) Deep video portraits. ACM Trans Graph 37(4):163. https://doi.org/10.1145/3197517.3201283
Ji X, Zhou H, Wang K, Wu W, Loy CC, Cao X, Xu F (2021) Audio-driven emotional video portraits. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19-25, 2021, pp 14080–14089. https://doi.org/10.1109/CVPR46437.2021.01386. https://openaccess.thecvf.com/content/CVPR2021/html/Ji_Audio-Driven_Emotional_Video_Portraits_CVPR_2021_paper.html
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, S., Qiao, K., Shi, S. et al. SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement. Neural Process Lett 55, 7529–7542 (2023). https://doi.org/10.1007/s11063-023-11272-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11272-7