skip to main content
research-article

Progressive Transformer Machine for Natural Character Reenactment

Published: 17 February 2023 Publication History

Abstract

Character reenactment aims to control a target person’s full-head movement by a driving monocular sequence that is made up of the driving character video. Current algorithms utilize convolution neural networks in generative adversarial networks, which extract historical and geometric information to iteratively generate video frames. However, convolution neural networks can merely capture local information with limited receptive fields and ignore global dependencies that play a crucial role in face synthesis, leading to generating unnatural video frames. In this work, we design a progressive transformer module that introduces multi-head self-attention with convolution refinement to simultaneously capture global-local dependencies. Specifically, we utilize the non-lapping window-based multi-head self-attention mechanism with hierarchical architecture to obtain the larger receptive fields at low-resolution feature map and thus extract global information. To better model local dependencies, we introduce the convolution operation to further refine the attentional weight in the multi-head self-attention mechanism. Finally, we use several stacked progressive transformer modules with the down-sampling operation to encode information of appearance information of previously generated frames and parameterized 3D face information of the current frame. Similarly, we use several stacked progressive transformer modules with the up-sampling operation to iteratively generate video frames. In this way, it can capture global-local information to facilitate generating video frames that are globally natural while preserving sharp outlines and rich detail information. Extensive experiments on several standard benchmarks suggest that the proposed method outperforms current leading algorithms.

References

[1]
Jonathan T. Barron and Jitendra Malik. 2014. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2014), 1670–1687.
[2]
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 187–194.
[3]
James Booth, Anastasios Roussos, Evangelos Ververas, Epameinondas Antonakos, Stylianos Ploumpis, Yannis Panagakis, and Stefanos Zafeiriou. 2018. 3D reconstruction of “in-the-wild” faces in images and videos. IEEE Trans. Pattern Anal. Mach. Intell. 40, 11 (2018), 2638–2652.
[4]
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D and 3D face alignment problem? (And a dataset of 230,000 3D facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 1021–1030.
[5]
Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3 (2020), 1371–1384. DOI:
[6]
Tianshui Chen, Tao Pu, Lingbo Liu, Yukai Shi, Zhijing Yang, and Liang Lin. 2022. Heterogeneous semantic transfer for multi-label recognition with partial labels. Retrieved from https://arXiv:2205.11131.
[7]
Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, Lingbo Liu, and Liang Lin. 2021. Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1. DOI:
[8]
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
[9]
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. Retrieved from https://arXiv:2103.03404.
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. Retrieved from https://arXiv:2010.11929.
[11]
Michail Christos Doukas, Mohammad Rami Koujan, Viktoriia Sharmanska, Anastasios Roussos, and Stefanos Zafeiriou. 2021. Head2Head++: Deep facial attributes re-targeting. IEEE Trans. Biometr., Behav., Identity Sci. 3, 1 (2021), 31–43.
[12]
Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormahlen, Patrick Perez, and Christian Theobalt. 2014. Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4217–4224.
[13]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Adv. Neural Info. Process. Syst. 27 (2014).
[14]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Info. Process. Syst. 30 (2017).
[15]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2016. FlowNet 2.0: Evolution of optical flow estimation with deep networks. Retrieved from http://arxiv.org/abs/1612.01925.
[16]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).
[17]
Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. Transgan: Two transformers can make one strong gan. Retrieved from https://arXiv:2102.07074.
[18]
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018. Deep video portraits. ACM Trans. Graph. 37, 4 (2018), 1–14.
[19]
Davis E. King. 2009. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10 (2009), 1755–1758.
[20]
Mohammad Rami Koujan, Michail Christos Doukas, Anastasios Roussos, and Stefanos Zafeiriou. 2020. Head2head: Video-based neural head synthesis. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG’20). IEEE, 16–23.
[21]
Mohammad Rami Koujan and Anastasios Roussos. 2018. Combining dense nonrigid structure from motion and 3d morphable models for monocular 4d face reconstruction. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production. 1–9.
[22]
Hongfei Li, Chuandong Li, Deqiang Ouyang, and Sing Kiong Nguang. 2021. Impulsive synchronization of unbounded delayed inertial neural networks with actuator saturation and sampled-data control and its application to image encryption. IEEE Trans. Neural Netw. Learn. Syst. 32, 4 (2021), 1460–1473. DOI:
[23]
Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. 2021. Localvit: Bringing locality to vision transformers. Retrieved from https://arXiv:2104.05707.
[24]
Shiguang Liu, Huixin Wang, and Min Pei. 2022. Facial-expression-aware emotional color transfer based on convolutional neural network. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1 (2022), 1–19.
[25]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Retrieved from https://arXiv:2103.14030.
[26]
Zicheng Liu, Ying Shan, and Zhengyou Zhang. 2001. Expressive expression mapping with ratio images. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. 271–276.
[27]
Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in Adam. https://openreview.net/forum?id=rk6qdGgCZ.
[28]
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4905–4913.
[29]
Xin Man, Deqiang Ouyang, Xiangpeng Li, Jingkuan Song, and Jie Shao. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput., Commun. Appl. 18, 4 (2022), 1–17.
[30]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput., Commun. Appl. 17, 4 (2021), 1–23.
[31]
Deqiang Ouyang, Jie Shao, and Cheng Hu. 2020. Stability property of impulsive inertial neural networks with unbounded time delay and saturating actuators. Neural Comput. Appl. 32, 11 (2020), 6571–6580.
[32]
Deqiang Ouyang, Jie Shao, Haijun Jiang, Shiping Wen, and Sing Kiong Nguang. 2021. Finite-time stability of coupled impulsive neural networks with time-varying delays and saturating actuators. Neurocomputing 453 (2021), 590–598.
[33]
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 12179–12188.
[34]
Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. 2011. Real-time avatar animation from a single image. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG’11). IEEE, 117–124.
[35]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. Retrieved from https://arXiv:1803.02155.
[36]
Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).
[37]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates.
[38]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.
[39]
Danica J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. 2016. Generative models and model criticism via optimized maximum mean discrepancy. Retrieved from https://arXiv:1611.04488.
[40]
Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian Theobalt. 2018. Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2549–2559.
[41]
Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. 38, 4 (2019), 1–12.
[42]
Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. 2015. Real-time expression transfer for facial reenactment.ACM Trans. Graph. 34, 6 (2015), 183–1.
[43]
Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2387–2395.
[44]
Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias Niessner. 2018. HeadOn: Real-time reenactment of human portrait videos. ACM Trans. Graph. 37, 4, Article 164 (July 2018), 13 pages. DOI:
[45]
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. 2021. Going deeper with image transformers. Retrieved from https://arXiv:2103.17239.
[46]
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2019. FVD: A new metric for video generation.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Vol. 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[48]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. Retrieved from https://arXiv:1808.06601.
[49]
Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. 2021. Towards real-world blind face restoration with generative facial prior. Retrieved from https://arxiv.org/abs/2101.04061.
[50]
Xueping Wang, Yunhong Wang, and Weixin Li. 2019. U-Net conditional GANs for photo-realistic and identity-preserving facial expression synthesis. ACM Trans. Multimedia Comput., Commun. Appl. 15, 3s (2019), 1–23.
[51]
Olivia Wiles, A. Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670–686.
[52]
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing convolutions to vision transformers. Retrieved from https://arXiv:2103.15808 (2021).
[53]
Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating convolution designs into visual transformers. Retrieved from https://arXiv:2103.11816.
[54]
Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating convolution designs into visual transformers. Retrieved from https://arxiv.org/abs/2103.11816.
[55]
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9459–9468.
[56]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 10 (2016), 1499–1503.
[57]
Liruo Zhang, Sing Kiong Nguang, Deqiang Ouyang, and Shen Yan. 2020. Synchronization of delayed neural networks via integral-based event-triggered scheme. IEEE Trans. Neural Netw. Learn. Syst. 31, 12 (2020), 5092–5102. DOI:
[58]
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586–595.
[59]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, et al. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6881–6890.
[60]
Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. 2021. Deepvit: Towards deeper vision transformer. Retrieved from https://arXiv:2103.11886.
[61]
Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State of the art on monocular 3D face reconstruction, tracking, and applications. In Proceedings of the Computer Graphics Forum, Vol. 37. Wiley Online Library, 523–550.

Cited By

View all
  • (2025)Unambiguous granularity distillation for asymmetric image retrievalNeural Networks10.1016/j.neunet.2025.107303(107303)Online publication date: Feb-2025
  • (2025)Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression ManipulationInternational Journal of Computer Vision10.1007/s11263-025-02358-xOnline publication date: 4-Feb-2025
  • (2024)Domain-aware Multimodal Dialog Systems with Distribution-based User Characteristic ModelingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370481121:2(1-22)Online publication date: 19-Nov-2024
  • Show More Cited By

Index Terms

  1. Progressive Transformer Machine for Natural Character Reenactment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
    April 2023
    545 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3572861
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 February 2023
    Online AM: 25 August 2022
    Accepted: 19 August 2022
    Revised: 25 July 2022
    Received: 24 May 2022
    Published in TOMM Volume 19, Issue 2s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Character reenactment
    2. full-head reenactment
    3. neural rendering
    4. video render
    5. 3DMM
    6. transformer

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Science and Technology Project of Guangdong Province
    • Guangzhou Science and Technology Plan Project
    • Guangdong Provincial Key Laboratory of Human Digital Twin

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Unambiguous granularity distillation for asymmetric image retrievalNeural Networks10.1016/j.neunet.2025.107303(107303)Online publication date: Feb-2025
    • (2025)Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression ManipulationInternational Journal of Computer Vision10.1007/s11263-025-02358-xOnline publication date: 4-Feb-2025
    • (2024)Domain-aware Multimodal Dialog Systems with Distribution-based User Characteristic ModelingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370481121:2(1-22)Online publication date: 19-Nov-2024
    • (2024)Multi-Modal Driven Pose-Controllable Talking Head GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367390120:12(1-23)Online publication date: 10-Aug-2024
    • (2024)Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness RefinementACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366182320:8(1-21)Online publication date: 12-Jun-2024
    • (2024)Efficient Brain Tumor Segmentation with Lightweight Separable Spatial Convolutional NetworkACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365371520:7(1-19)Online publication date: 16-May-2024
    • (2024)A Novel Framework for Joint Learning of City Region Partition and RepresentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365285720:7(1-23)Online publication date: 16-May-2024
    • (2024)Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual SurveillanceACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363262420:6(1-15)Online publication date: 8-Mar-2024
    • (2024)Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00694(7267-7276)Online publication date: 16-Jun-2024
    • (2024)A pure MLP-Mixer-based GAN framework for guided image translationPattern Recognition10.1016/j.patcog.2024.110894(110894)Online publication date: Aug-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media