SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Yang, Shuai; Qiao, Kai; Shi, Shuhao; Yang, Jie; Ma, Dekui; Hu, Guoen; Yan, Bin; Chen, Jian

doi:10.1007/s11063-023-11272-7

SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Published: 11 April 2023

Volume 55, pages 7529–7542, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Shuai Yang¹,
Kai Qiao¹,
Shuhao Shi¹,
Jie Yang¹,
Dekui Ma¹,
Guoen Hu¹,
Bin Yan¹ &
…
Jian Chen¹

180 Accesses
1 Altmetric
Explore all metrics

Abstract

Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, resulting in a lack of naturalness. This paper proposes SATFace, a subject agnostic talking face generation method with natural head movement. To model the talking face’s complicated and critical features (identity, background, mouth shape, head posture, etc.), we construct SATFace by taking encoder-decoder as the primary network architecture. Then, we design a long short-time feature learning network to better reference the global and local information in audio for generating reasonable head movement. Besides, a modular training process is proposed to improve explicit and implicit features’ learning effects and efficiency. The experimental comparison results show that SATFace improves by at least about 9.8% in cumulative probability of blur detection and 8.2% in synchronization confidence compared with the mainstream methods. The mean opinion scores show that SATFace has advantages in terms of lip sync quality, head movement naturalness, and video realness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Talking-Head Generation with Rhythmic Head Motion

Shallow Diffusion Motion Model for Talking Face Generation from Speech

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Data Availability

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

References

Nguyen T, Nguyen QVH, Nguyen DT, Nguyen DT, Huynh-The T, Nahavandi S, Nguyen TT, Pham Q, Nguyen CM (2022) Deep learning for deepfakes creation and detection: a survey. Comput Vis Image Underst 223:103525. https://doi.org/10.1016/j.cviu.2022.103525
Article Google Scholar
Ingemann F, Laver J (1997) Principles of Phonetics 73:172. https://doi.org/10.2307/416604
Squier C, Brogden KA (2013) Human oral mucosa: development, structure and function. Wiley, New York, pp 1–168
Google Scholar
Gao J, Wong JX, Lim JCS, Henry J, Zhou W (2015) Influence of bread structure on human oral processing. J Food Eng 167:147–155. https://doi.org/10.1016/j.jfoodeng.2015.07.022
Article Google Scholar
Pumarola A, Agudo A, Martínez AM, Sanfeliu A, Moreno-Noguer F (2020) Ganimation: one-shot anatomically consistent facial animation. Int J Comput Vis 128(3):698–713. https://doi.org/10.1007/s11263-019-01210-3
Article Google Scholar
Ezzat T (2002) Trainable videorealistic speech animation. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (2002). https://hdl.handle.net/1721.1/8020
Dale K, Sunkavalli K, Johnson MK, Vlasic D, Matusik W, Pfister H (2011) Video face replacement. ACM Trans Graph 30(6):130. https://doi.org/10.1145/2070781.2024164
Article Google Scholar
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: an overview. IEEE Signal Process Mag 35(1):53–65. https://doi.org/10.1109/MSP.2017.2765202. arXiv:1710.07035
Article Google Scholar
Wang K, Gou C, Duan Y, Lin Y, Zheng X, Wang FY (2017) Generative adversarial networks: introduction and outlook. IEEE/CAA J Automat Sinica 4(4):588–598. https://doi.org/10.1109/JAS.2017.7510583
Article MathSciNet Google Scholar
Chen Z, Xie L, Pang S, He Y, Zhang B (2021) MagDr: mask-guided detection and reconstruction for defending deepfakes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 9010–9019. https://doi.org/10.1109/CVPR46437.2021.00890. arXiv:2103.14211
Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) FACIAL: synthesizing dynamic talking face with implicit attribute learning. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp 3847–3856. https://doi.org/10.1109/ICCV48922.2021.00384
Chen L, Maddox RK, Duan Z, Xu C (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 7832–7841. https://doi.org/10.1109/CVPR.2019.00802. http://openaccess.thecvf.com/content_CVPR_2019/html/Chen_Hierarchical_Cross-Modal_Talking_Face_Generation_With_Dynamic_Pixel-Wise_Loss_CVPR_2019_paper.html
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing Obama: learning lip sync from audio. ACM Trans Graph 36(4):95–19513. https://doi.org/10.1145/3072959.3073640
Article Google Scholar
Zhang C, Ni S, Fan Z, Li H, Zeng M, Budagavi M, Guo X (2021) 3d talking face with personalized pose dynamics. IEEE Trans Vis Comput Graph
Zhou H, Sun Y, Wu W, Loy CC, Wang X, Liu Z (2021) Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19-25, 2021, pp 4176–4186. https://doi.org/10.1109/CVPR46437.2021.00416. https://openaccess.thecvf.com/content/CVPR2021/html/Zhou_Pose-Controllable_Talking_Face_Generation_by_Implicitly_Modularized_Audio-Visual_Representation_CVPR_2021_paper.html
Lu Y, Chai J, Cao X (2021) Live speech portraits: real-time photorealistic talking-head animation. ACM Trans Graph 40(6):220–122017. https://doi.org/10.1145/3478513.3480484
Article Google Scholar
Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) FACIAL: Synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE international conference on computer vision, pp 3847–3856. https://doi.org/10.1109/ICCV48922.2021.00384. arXiv:2108.07938
Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds.) Computer Vision - ECCV 2020 - 16th European conference, glasgow, UK, August 23-28, 2020, Proceedings, Part IX. Lecture Notes in Computer Science, vol 12354, pp 35–51. https://doi.org/10.1007/978-3-030-58545-7_3
Zhou Y, Li D, Han X, Kalogerakis E, Shechtman E, Echevarria J (2020) Makeittalk: Speaker-aware talking head animation. CoRR abs/2004.12992 (2020)
Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270. https://doi.org/10.1162/neco_a_01199
Article MathSciNet MATH Google Scholar
Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys D Nonlinear Phenom 404:132306. https://doi.org/10.1016/j.physd.2019.132306
Article MathSciNet MATH Google Scholar
Kang WC, McAuley J (2018) Self-Attentive Sequential Recommendation. In: Proceedings - IEEE international conference on data mining, ICDM 2018-November, pp 197–206. arXiv:1808.09781. https://doi.org/10.1109/ICDM.2018.00035
Zhang Q, Lipani A, Kirnap O, Yilmaz E (2020) Self-Attentive hawkes process. In: 37th international conference on machine learning, ICML 2020 PartF168147-15, pp 11117–11127 (2020) arXiv:1907.07561
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp 8107–8116. https://doi.org/10.1109/CVPR42600.2020.00813. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html
Imai S (1983) Cepstral analysis synthesis on the Mel frequency scale. In: IEEE international conference on acoustics, speech, and signal processing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983, pp 93–96. DOI: https://doi.org/10.1109/ICASSP.1983.1172250
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 770–778. DOI: https://doi.org/10.1109/CVPR.2016.90
Guo J, Zhu X, Yang, Y, Yang F, Lei Z, Li, SZ (2020) Towards fast, accurate and stable 3d dense face alignment. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds.) Computer Vision - ECCV 2020 - 16th European conference, glasgow, UK, August 23-28, 2020, Proceedings, Part XIX. Lecture Notes in Computer Science, vol 12364, pp 152–168. https://doi.org/10.1007/978-3-030-58529-7_10
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Leibe B, Matas J, Sebe N, Welling M (eds.) Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. Lecture Notes in Computer Science, vol 9906, pp 694–711. https://doi.org/10.1007/978-3-319-46475-6_43
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: A large-scale speaker identification dataset. In: Lacerda F (ed.) Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20-24, 2017, pp 2616–2620. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0950.html
Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 1021–1030. DOI: https://doi.org/10.1109/ICCV.2017.116
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980
Hane C, Tulsiani S, Malik J (2017) Hierarchical surface prediction for 3d object reconstruction. In: 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp 412–420. https://doi.org/10.1109/3DV.2017.00054
Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV (2020) A lip sync expert is all you need for speech to lip generation in the wild. In Chen CW, Cucchiara R, Hua X, Qi G, Ricci E, Zhang Z, Zimmermann R (eds.) MM ’20: The 28th ACM international conference on multimedia, virtual event / seattle, WA, USA, October 12-16, 2020, pp 484–492. DOI: https://doi.org/10.1145/3394171.3413532
Narvekar ND, Karam LJ (2009) A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 international workshop on quality of multimedia experience, pp 87–91. IEEE
Chen L, Li Z, Maddox RK, Duan Z, Xu C (2018) Lip movements generation at a glance. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds.) Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII. Lecture Notes in Computer Science, vol 11211, pp 538–553. DOI: https://doi.org/10.1007/978-3-030-01234-2_32
Chung JS, Zisserman A (2016) Out of time: automated lip sync in the wild. In: Chen C, Lu J, Ma K (eds.) Computer Vision - ACCV 2016 Workshops - ACCV 2016 international workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II. Lecture Notes in Computer Science, vol 10117, pp 251–263. https://doi.org/10.1007/978-3-319-54427-4_19
Mallick S (2016) Head pose estimation using OpenCV and Dlib. https://www.learnopencv.com/head-pose-estimation-using-opencv-and-dlib
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861
Article Google Scholar
Huynh-Thu Q, Ghanbari M (2008) Scope of validity of PSNR in image/video quality assessment. Electron Lett 44(13):800–801
Article Google Scholar
Vougioukas K, Petridis S, Pantic M (2020) Realistic speech-driven facial animation with GANS. Int J Comput Vis 128(5):1398–1413. https://doi.org/10.1007/s11263-019-01251-8
Article Google Scholar
Kim H, Garrido P, Tewari A, Xu W, Thies J, Nießner M, Pérez P, Richardt C, Zollhöfer M, Theobalt C (2018) Deep video portraits. ACM Trans Graph 37(4):163. https://doi.org/10.1145/3197517.3201283
Article Google Scholar
Ji X, Zhou H, Wang K, Wu W, Loy CC, Cao X, Xu F (2021) Audio-driven emotional video portraits. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19-25, 2021, pp 14080–14089. https://doi.org/10.1109/CVPR46437.2021.01386. https://openaccess.thecvf.com/content/CVPR2021/html/Ji_Audio-Driven_Emotional_Video_Portraits_CVPR_2021_paper.html

Download references

Author information

Authors and Affiliations

Henan Key Laboratory of Imaging and Intelligence Processing, PLA Strategy Support Force Information Engineering University, Science Avenue, Zhengzhou, 450001, Henan, China
Shuai Yang, Kai Qiao, Shuhao Shi, Jie Yang, Dekui Ma, Guoen Hu, Bin Yan & Jian Chen

Authors

Shuai Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Shuhao Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yang
View author publications
You can also search for this author in PubMed Google Scholar
Dekui Ma
View author publications
You can also search for this author in PubMed Google Scholar
Guoen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jian Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Chen.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, S., Qiao, K., Shi, S. et al. SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement. Neural Process Lett 55, 7529–7542 (2023). https://doi.org/10.1007/s11063-023-11272-7

Download citation

Accepted: 27 March 2023
Published: 11 April 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11272-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Abstract

Access this article

Similar content being viewed by others

Talking-Head Generation with Rhythmic Head Motion

Shallow Diffusion Motion Model for Talking Face Generation from Speech

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Abstract

Access this article

Similar content being viewed by others

Talking-Head Generation with Rhythmic Head Motion

Shallow Diffusion Motion Model for Talking Face Generation from Speech

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation