Skip to main content

Advertisement

Talking-head video generation with long short-term contextual semantics

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The data supporting the findings of this study are available in public repositories.

Code Availability

Any code used in this research is available upon request.

References

  1. Sha T, Zhang W, Shen T, Li Z, Mei T (2023) Deep person generation: A survey from the perspective of face, pose, and cloth synthesis. ACM Comput Surv 55(12). https://doi.org/10.1145/3575656

  2. Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) Animating arbitrary objects via deep motion transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2377–2386

  3. Xue H, Ling J, Tang A, Song L, Xie R, Zhang W (2023) High-fidelity face reenactment via identity-matched correspondence learning. ACM Trans Multimed Comput Commun Appl 19(3). https://doi.org/10.1145/3571857

  4. Nirkin Y, Keller Y, Hassner T (2023) Fsganv 2: Improved subject agnostic face swapping and reenactment. IEEE Trans Pattern Anal Mach Intell 45(1):560–575. https://doi.org/10.1109/TPAMI.2022.3155571

    Article  MATH  Google Scholar 

  5. Tao J, Wang B, Xu B, Ge T, Jiang Y, Li W, Duan L (2022) Structure-aware motion transfer with deformable anchor model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3637–3646

  6. Hong F-T, Shen L, Xu D (2023) Dagan++: Depth-aware generative adversarial network for talking head video generation. IEEE Trans Pattern Anal Mach Intell

  7. Rochow A, Schwarz M, Behnke S (2024) Fsrt: Facial scene representation transformer for face reenactment from factorized appearance head-pose and facial expression features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7716–7726

  8. Sheng Z, Nie L, Zhang M, Chang X, Yan Y (2024) Stochastic latent talking face generation toward emotional expressions and head poses. IEEE Trans Circ Syst Video Technol 34(4):2734–2748. https://doi.org/10.1109/TCSVT.2023.3311039

    Article  MATH  Google Scholar 

  9. Bounareli S, Tzelepis C, Argyriou V, Patras I, Tzimiropoulos G (2024) One-shot neural face reenactment via finding directions in gan’s latent space. Int J Comput Vis. https://doi.org/10.1007/s11263-024-02018-6

  10. Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) First order motion model for image animation. In: Proceedings of the 33rd international conference on neural information processing systems, pp 7137–7147

  11. Wang T-C, Mallya A, Liu M-Y (2021) One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10039–10049

  12. Gao Y, Zhou Y, Wang J, Li X, Ming X, Lu Y: High-fidelity and freely controllable talking head video generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5609–5619

  13. Gui J, Sun Z, Wen Y, Tao D, Ye J (2023) A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans Knowl Data Eng 35(4):3313–3332. https://doi.org/10.1109/TKDE.2021.3130191

    Article  MATH  Google Scholar 

  14. Zhang Y, Yu L, Sun B, He J (2022) Eng-face: cross-domain heterogeneous face synthesis with enhanced asymmetric cyclegan. Appl Intell 52(13):15295–15307

  15. Aldausari N, Sowmya A, Marcus N, Mohammadi G (2022) Video generative adversarial networks: a review. ACM Comput Surv (CSUR) 55(2):1–25

    Article  MATH  Google Scholar 

  16. Tulyakov S, Liu M-Y, Yang X, Kautz J (2018) Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1526–1535

  17. Zhao M, Wang W, Chen T, Zhang R, Li R (2024) Ta2v: Text-audio guided video generation. IEEE Trans Multimed 26:7250–7264. https://doi.org/10.1109/TMM.2024.3362149

    Article  Google Scholar 

  18. Zhu J, Ma H, Chen J, Yuan J (2023) Motionvideogan: A novel video generator based on the motion space learned from image pairs. IEEE Trans Multimed 25:9370–9382. https://doi.org/10.1109/TMM.2023.3251095

    Article  MATH  Google Scholar 

  19. Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Proceedings of the 32nd international conference on neural information processing systems, pp 1152–1164

  20. Wang T-C, Liu M-Y, Tao A, Liu G, Kautz J, Catanzaro B (2019) Few-shot video-to-video synthesis. In: Proceedings of the 33rd international conference on neural information processing systems, pp 5013–5024

  21. Pan J, Wang C, Jia X, Shao J, Sheng L, Yan J, Wang X (2019) Video generation from single semantic label map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3733–3742

  22. Wan W, Yang Y, Huang S, Gan L (2023) Fran: feature-filtered residual attention network for realistic face sketch-to-photo transformation. Appl Intell 53(12):15946–15956

    Article  Google Scholar 

  23. Grassal P-W, Prinzler M, Leistner T, Rother C, Nießner M, Thies J (2022) Neural head avatars from monocular rgb videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18653–18664

  24. Wiles O, Koepke A, Zisserman A (2018) X2face: A network for controlling face generation using images, audio, and pose codes. In: Proceedings of the european conference on computer vision (ECCV), pp 670–686

  25. Zakharov E, Ivakhnenko A, Shysheya A, Lempitsky V (2020) Fast bi-layer neural synthesis of one-shot realistic head avatars. In: Computer vision–ECCV 2020: 16th european conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp 524–540. Springer

  26. Song L, Yin G, Liu B, Zhang Y, Yu N (2021) Fsft-net: face transfer video generation with few-shot views. In: 2021 IEEE international conference on image processing (ICIP), pp 3582–3586. IEEE

  27. Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI’81: 7th international joint conference on artificial intelligence, vol 2, pp 674–679

  28. Jakab T, Gupta A, Bilen H, Vedaldi A (2018) Unsupervised learning of object landmarks through conditional image generation. In: Proceedings of the 32nd international conference on neural information processing systems, pp 4020–4031

  29. Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Computer vision–ECCV 2016: 14th european conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp 694–711. Springer

  30. Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: A large-scale speaker identification dataset. Interspeech 2017

  31. Chung J, Nagrani A, Zisserman A (2018) Voxceleb2: Deep speaker recognition. Interspeech 2018

  32. Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595

  33. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st international conference on neural information processing systems, pp 6629–6640

  34. Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE international conference on computer vision, pp 1021–1030

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zhao Jing, Hongxia Bie and Jiali Wang. The first draft of the manuscript was written by Zhao Jing and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hongxia Bie.

Ethics declarations

Competing of interest

The authors declare no competing interests relevant to the content of this article.

Consent

Informed consent was obtained from all participants involved in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jing, Z., Bie, H., Wang, J. et al. Talking-head video generation with long short-term contextual semantics. Appl Intell 55, 120 (2025). https://doi.org/10.1007/s10489-024-06010-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06010-y

Keywords