Abstract
One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data supporting the findings of this study are available in public repositories.
Code Availability
Any code used in this research is available upon request.
References
Sha T, Zhang W, Shen T, Li Z, Mei T (2023) Deep person generation: A survey from the perspective of face, pose, and cloth synthesis. ACM Comput Surv 55(12). https://doi.org/10.1145/3575656
Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) Animating arbitrary objects via deep motion transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2377–2386
Xue H, Ling J, Tang A, Song L, Xie R, Zhang W (2023) High-fidelity face reenactment via identity-matched correspondence learning. ACM Trans Multimed Comput Commun Appl 19(3). https://doi.org/10.1145/3571857
Nirkin Y, Keller Y, Hassner T (2023) Fsganv 2: Improved subject agnostic face swapping and reenactment. IEEE Trans Pattern Anal Mach Intell 45(1):560–575. https://doi.org/10.1109/TPAMI.2022.3155571
Tao J, Wang B, Xu B, Ge T, Jiang Y, Li W, Duan L (2022) Structure-aware motion transfer with deformable anchor model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3637–3646
Hong F-T, Shen L, Xu D (2023) Dagan++: Depth-aware generative adversarial network for talking head video generation. IEEE Trans Pattern Anal Mach Intell
Rochow A, Schwarz M, Behnke S (2024) Fsrt: Facial scene representation transformer for face reenactment from factorized appearance head-pose and facial expression features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7716–7726
Sheng Z, Nie L, Zhang M, Chang X, Yan Y (2024) Stochastic latent talking face generation toward emotional expressions and head poses. IEEE Trans Circ Syst Video Technol 34(4):2734–2748. https://doi.org/10.1109/TCSVT.2023.3311039
Bounareli S, Tzelepis C, Argyriou V, Patras I, Tzimiropoulos G (2024) One-shot neural face reenactment via finding directions in gan’s latent space. Int J Comput Vis. https://doi.org/10.1007/s11263-024-02018-6
Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N (2019) First order motion model for image animation. In: Proceedings of the 33rd international conference on neural information processing systems, pp 7137–7147
Wang T-C, Mallya A, Liu M-Y (2021) One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10039–10049
Gao Y, Zhou Y, Wang J, Li X, Ming X, Lu Y: High-fidelity and freely controllable talking head video generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5609–5619
Gui J, Sun Z, Wen Y, Tao D, Ye J (2023) A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans Knowl Data Eng 35(4):3313–3332. https://doi.org/10.1109/TKDE.2021.3130191
Zhang Y, Yu L, Sun B, He J (2022) Eng-face: cross-domain heterogeneous face synthesis with enhanced asymmetric cyclegan. Appl Intell 52(13):15295–15307
Aldausari N, Sowmya A, Marcus N, Mohammadi G (2022) Video generative adversarial networks: a review. ACM Comput Surv (CSUR) 55(2):1–25
Tulyakov S, Liu M-Y, Yang X, Kautz J (2018) Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1526–1535
Zhao M, Wang W, Chen T, Zhang R, Li R (2024) Ta2v: Text-audio guided video generation. IEEE Trans Multimed 26:7250–7264. https://doi.org/10.1109/TMM.2024.3362149
Zhu J, Ma H, Chen J, Yuan J (2023) Motionvideogan: A novel video generator based on the motion space learned from image pairs. IEEE Trans Multimed 25:9370–9382. https://doi.org/10.1109/TMM.2023.3251095
Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Proceedings of the 32nd international conference on neural information processing systems, pp 1152–1164
Wang T-C, Liu M-Y, Tao A, Liu G, Kautz J, Catanzaro B (2019) Few-shot video-to-video synthesis. In: Proceedings of the 33rd international conference on neural information processing systems, pp 5013–5024
Pan J, Wang C, Jia X, Shao J, Sheng L, Yan J, Wang X (2019) Video generation from single semantic label map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3733–3742
Wan W, Yang Y, Huang S, Gan L (2023) Fran: feature-filtered residual attention network for realistic face sketch-to-photo transformation. Appl Intell 53(12):15946–15956
Grassal P-W, Prinzler M, Leistner T, Rother C, Nießner M, Thies J (2022) Neural head avatars from monocular rgb videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18653–18664
Wiles O, Koepke A, Zisserman A (2018) X2face: A network for controlling face generation using images, audio, and pose codes. In: Proceedings of the european conference on computer vision (ECCV), pp 670–686
Zakharov E, Ivakhnenko A, Shysheya A, Lempitsky V (2020) Fast bi-layer neural synthesis of one-shot realistic head avatars. In: Computer vision–ECCV 2020: 16th european conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp 524–540. Springer
Song L, Yin G, Liu B, Zhang Y, Yu N (2021) Fsft-net: face transfer video generation with few-shot views. In: 2021 IEEE international conference on image processing (ICIP), pp 3582–3586. IEEE
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI’81: 7th international joint conference on artificial intelligence, vol 2, pp 674–679
Jakab T, Gupta A, Bilen H, Vedaldi A (2018) Unsupervised learning of object landmarks through conditional image generation. In: Proceedings of the 32nd international conference on neural information processing systems, pp 4020–4031
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Computer vision–ECCV 2016: 14th european conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp 694–711. Springer
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: A large-scale speaker identification dataset. Interspeech 2017
Chung J, Nagrani A, Zisserman A (2018) Voxceleb2: Deep speaker recognition. Interspeech 2018
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st international conference on neural information processing systems, pp 6629–6640
Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE international conference on computer vision, pp 1021–1030
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zhao Jing, Hongxia Bie and Jiali Wang. The first draft of the manuscript was written by Zhao Jing and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing of interest
The authors declare no competing interests relevant to the content of this article.
Consent
Informed consent was obtained from all participants involved in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jing, Z., Bie, H., Wang, J. et al. Talking-head video generation with long short-term contextual semantics. Appl Intell 55, 120 (2025). https://doi.org/10.1007/s10489-024-06010-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06010-y