research-article

Progressive Transformer Machine for Natural Character Reenactment

Authors:

Yongzong Xu,

Zhijing Yang,

Tianshui Chen,

Kai Li,

Chunmei QingAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 2s

Article No.: 92, Pages 1 - 22

https://doi.org/10.1145/3559107

Published: 17 February 2023 Publication History

Get Access

Abstract

Character reenactment aims to control a target person’s full-head movement by a driving monocular sequence that is made up of the driving character video. Current algorithms utilize convolution neural networks in generative adversarial networks, which extract historical and geometric information to iteratively generate video frames. However, convolution neural networks can merely capture local information with limited receptive fields and ignore global dependencies that play a crucial role in face synthesis, leading to generating unnatural video frames. In this work, we design a progressive transformer module that introduces multi-head self-attention with convolution refinement to simultaneously capture global-local dependencies. Specifically, we utilize the non-lapping window-based multi-head self-attention mechanism with hierarchical architecture to obtain the larger receptive fields at low-resolution feature map and thus extract global information. To better model local dependencies, we introduce the convolution operation to further refine the attentional weight in the multi-head self-attention mechanism. Finally, we use several stacked progressive transformer modules with the down-sampling operation to encode information of appearance information of previously generated frames and parameterized 3D face information of the current frame. Similarly, we use several stacked progressive transformer modules with the up-sampling operation to iteratively generate video frames. In this way, it can capture global-local information to facilitate generating video frames that are globally natural while preserving sharp outlines and rich detail information. Extensive experiments on several standard benchmarks suggest that the proposed method outperforms current leading algorithms.

References

[1]

Jonathan T. Barron and Jitendra Malik. 2014. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2014), 1670–1687.

Abstract

References

Cited By

Index Terms

Recommendations

Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition

Deep 3D morphable model refinement via progressive growing of conditional Generative Adversarial Networks

A New Coarse-To-Fine 3D Face Reconstruction Method Based On 3DMM Flame and Transformer: CoFiT-3D FaRe

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Share

Share this Publication link

Share on social media

Affiliations