Talking-head video generation with long short-term contextual semantics

Jing, Zhao; Bie, Hongxia; Wang, Jiali; Bie, Zhisong; Li, Jinxin; Ren, Jianwei; Zhi, Yichen

doi:10.1007/s10489-024-06010-y

Talking-head video generation with long short-term contextual semantics

Published: 10 December 2024

Volume 55, article number 120, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Zhao Jing¹,
Hongxia Bie ORCID: orcid.org/0000-0002-3675-8476¹,
Jiali Wang¹,
Zhisong Bie¹,
Jinxin Li¹,
Jianwei Ren¹ &
…
Yichen Zhi¹

105 Accesses
Explore all metrics

Abstract

One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Talking-Head Generation with Rhythmic Head Motion

Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head

Fine-grained talking face generation with video reinterpretation

Article 22 September 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The data supporting the findings of this study are available in public repositories.

Code Availability

Any code used in this research is available upon request.

References

Sha T, Zhang W, Shen T, Li Z, Mei T (2023) Deep person generation: A survey from the perspective of face, pose, and cloth synthesis. ACM Comput Surv 55(12). https://doi.org/10.1145/3575656
Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) Animating arbitrary objects via deep motion transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2377–2386
Xue H, Ling J, Tang A, Song L, Xie R, Zhang W (2023) High-fidelity face reenactment via identity-matched correspondence learning. ACM Trans Multimed Comput Commun Appl 19(3). https://doi.org/10.1145/3571857
Nirkin Y, Keller Y, Hassner T (2023) Fsganv 2: Improved subject agnostic face swapping and reenactment. IEEE Trans Pattern Anal Mach Intell 45(1):560–575. https://doi.org/10.1109/TPAMI.2022.3155571
Article MATH Google Scholar
Tao J, Wang B, Xu B, Ge T, Jiang Y, Li W, Duan L (2022) Structure-aware motion transfer with deformable anchor model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3637–3646
Hong F-T, Shen L, Xu D (2023) Dagan++: Depth-aware generative adversarial network for talking head video generation. IEEE Trans Pattern Anal Mach Intell
Rochow A, Schwarz M, Behnke S (2024) Fsrt: Facial scene representation transformer for face reenactment from factorized appearance head-pose and facial expression features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7716–7726
Sheng Z, Nie L, Zhang M, Chang X, Yan Y (2024) Stochastic latent talking face generation toward emotional expressions and head poses. IEEE Trans Circ Syst Video Technol 34(4):2734–2748. https://doi.org/10.1109/TCSVT.2023.3311039
Article MATH Google Scholar
Bounareli S, Tzelepis C, Argyriou V, Patras I, Tzimiropoulos G (2024) One-shot neural face reenactment via finding directions in gan’s latent space. Int J Comput Vis. https://doi.org/10.1007/s11263-024-02018-6
Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) First order motion model for image animation. In: Proceedings of the 33rd international conference on neural information processing systems, pp 7137–7147
Wang T-C, Mallya A, Liu M-Y (2021) One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10039–10049
Gao Y, Zhou Y, Wang J, Li X, Ming X, Lu Y: High-fidelity and freely controllable talking head video generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5609–5619
Gui J, Sun Z, Wen Y, Tao D, Ye J (2023) A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans Knowl Data Eng 35(4):3313–3332. https://doi.org/10.1109/TKDE.2021.3130191
Article MATH Google Scholar
Zhang Y, Yu L, Sun B, He J (2022) Eng-face: cross-domain heterogeneous face synthesis with enhanced asymmetric cyclegan. Appl Intell 52(13):15295–15307
Aldausari N, Sowmya A, Marcus N, Mohammadi G (2022) Video generative adversarial networks: a review. ACM Comput Surv (CSUR) 55(2):1–25
Article MATH Google Scholar
Tulyakov S, Liu M-Y, Yang X, Kautz J (2018) Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1526–1535
Zhao M, Wang W, Chen T, Zhang R, Li R (2024) Ta2v: Text-audio guided video generation. IEEE Trans Multimed 26:7250–7264. https://doi.org/10.1109/TMM.2024.3362149
Article Google Scholar
Zhu J, Ma H, Chen J, Yuan J (2023) Motionvideogan: A novel video generator based on the motion space learned from image pairs. IEEE Trans Multimed 25:9370–9382. https://doi.org/10.1109/TMM.2023.3251095
Article MATH Google Scholar
Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Proceedings of the 32nd international conference on neural information processing systems, pp 1152–1164
Wang T-C, Liu M-Y, Tao A, Liu G, Kautz J, Catanzaro B (2019) Few-shot video-to-video synthesis. In: Proceedings of the 33rd international conference on neural information processing systems, pp 5013–5024
Pan J, Wang C, Jia X, Shao J, Sheng L, Yan J, Wang X (2019) Video generation from single semantic label map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3733–3742
Wan W, Yang Y, Huang S, Gan L (2023) Fran: feature-filtered residual attention network for realistic face sketch-to-photo transformation. Appl Intell 53(12):15946–15956
Article Google Scholar
Grassal P-W, Prinzler M, Leistner T, Rother C, Nießner M, Thies J (2022) Neural head avatars from monocular rgb videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18653–18664
Wiles O, Koepke A, Zisserman A (2018) X2face: A network for controlling face generation using images, audio, and pose codes. In: Proceedings of the european conference on computer vision (ECCV), pp 670–686
Zakharov E, Ivakhnenko A, Shysheya A, Lempitsky V (2020) Fast bi-layer neural synthesis of one-shot realistic head avatars. In: Computer vision–ECCV 2020: 16th european conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp 524–540. Springer
Song L, Yin G, Liu B, Zhang Y, Yu N (2021) Fsft-net: face transfer video generation with few-shot views. In: 2021 IEEE international conference on image processing (ICIP), pp 3582–3586. IEEE
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI’81: 7th international joint conference on artificial intelligence, vol 2, pp 674–679
Jakab T, Gupta A, Bilen H, Vedaldi A (2018) Unsupervised learning of object landmarks through conditional image generation. In: Proceedings of the 32nd international conference on neural information processing systems, pp 4020–4031
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Computer vision–ECCV 2016: 14th european conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp 694–711. Springer
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: A large-scale speaker identification dataset. Interspeech 2017
Chung J, Nagrani A, Zisserman A (2018) Voxceleb2: Deep speaker recognition. Interspeech 2018
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st international conference on neural information processing systems, pp 6629–6640
Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE international conference on computer vision, pp 1021–1030

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Zhao Jing, Hongxia Bie, Jiali Wang, Zhisong Bie, Jinxin Li, Jianwei Ren & Yichen Zhi

Authors

Zhao Jing
View author publications
You can also search for this author in PubMed Google Scholar
Hongxia Bie
View author publications
You can also search for this author in PubMed Google Scholar
Jiali Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhisong Bie
View author publications
You can also search for this author in PubMed Google Scholar
Jinxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Ren
View author publications
You can also search for this author in PubMed Google Scholar
Yichen Zhi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zhao Jing, Hongxia Bie and Jiali Wang. The first draft of the manuscript was written by Zhao Jing and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hongxia Bie.

Ethics declarations

Competing of interest

The authors declare no competing interests relevant to the content of this article.

Consent

Informed consent was obtained from all participants involved in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jing, Z., Bie, H., Wang, J. et al. Talking-head video generation with long short-term contextual semantics. Appl Intell 55, 120 (2025). https://doi.org/10.1007/s10489-024-06010-y

Download citation

Accepted: 30 September 2024
Published: 10 December 2024
DOI: https://doi.org/10.1007/s10489-024-06010-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Talking-head video generation with long short-term contextual semantics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Talking-Head Generation with Rhythmic Head Motion

Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head

Fine-grained talking face generation with video reinterpretation

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing of interest

Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Talking-head video generation with long short-term contextual semantics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Talking-Head Generation with Rhythmic Head Motion

Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head

Fine-grained talking face generation with video reinterpretation

Explore related subjects

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing of interest

Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation