Fast Registration of Photorealistic Avatars for VR Facial Animation

Patel, Chaitanya; Bai, Shaojie; Wang, Te-Li; Saragih, Jason; Wei, Shih-En

doi:10.1007/978-3-031-73033-7_23

Chaitanya Patel¹³,
Shaojie Bai¹⁴,
Te-Li Wang¹⁴,
Jason Saragih¹⁴ &
…
Shih-En Wei¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15120))

Included in the following conference series:

European Conference on Computer Vision

487 Accesses

Abstract

Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Expressive Telepresence via Modular Codec Avatars

Stable Video Portraits

Tri $$^{2}$$ -plane: Thinking Head Avatar via Feature Pyramid

Notes

1.
In this work we differentiate between unseen identities for avatar generation vs. unseen identities for HMC driving. We always assume an avatar for a new identity is already available through prior works, and evaluate the performance of expression estimation methods on unseen HMC images of that identity.

References

An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: unbiased image style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 862–871 (2021)
Google Scholar
Apple Inc.: Apple Vision Pro (2024). https://www.apple.com/apple-vision-pro/
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402 (2023)
Google Scholar
Browatzki, B., Wallraven, C.: 3FabRec: fast few-shot face alignment by reconstruction. In: CVPR (2020)
Google Scholar
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph. 33(4) (2014). https://doi.org/10.1145/2601097.2601204
Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41(4) (2022). https://doi.org/10.1145/3528223.3530143
Chen, H., et al.: Artistic style transfer with internal-external learning and contrastive learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 26561–26573 (2021)
Google Scholar
Chen, L., Cao, C., la Torre, F.D., Saragih, J., Xu, C., Sheikh, Y.: High-fidelity face tracking for AR/VR via deep lighting adaptation (2021)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Deng, Y., et al.: Stytr$^2$: image style transfer with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Dollár, P., Welinder, P., Perona, P.: Cascaded pose regression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1078–1085 (2010). https://doi.org/10.1109/CVPR.2010.5540094
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv abs/2010.11929 (2020). https://api.semanticscholar.org/CorpusID:225039882
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Giebenhain, S., Kirschstein, T., Georgopoulos, M., Rünz, M., Agapito, L., Nießner, M.: Mononphm: dynamic head reconstruction from monocular videos. arXiv preprint arXiv:2312.06740 (2023)
Guo, J., Zhu, X., Zhao, C., Cao, D., Lei, Z., Li, S.Z.: Learning meta face recognition in unseen domains. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6162–6171 (2020). https://doi.org/10.1109/CVPR42600.2020.00620
Howard, A., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)
Google Scholar
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Li, H., et al.: Facial performance sensing head-mounted display. ACM Trans. Graph. (TOG) 34(4), 47:1–47:9 (2015)
Google Scholar
Liu, S., et al.: Adaattn: revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6649–6658 (2021)
Google Scholar
Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. 37(4), 68:1–68:13 (2018)
Google Scholar
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751 (2019)
Meta Inc.: Meta Quest Pro: Premium Mixed Reality (2023). https://www.meta.com/ie/quest/quest-pro/
Olszewski, K., Lim, J.J., Saito, S., Li, H.: High-fidelity facial and speech animation for VR HMDs. ACM Trans. Graph. (TOG) 35(6), 1–14 (2016)
Google Scholar
Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars: photorealistic head avatars with rigged 3D gaussians. arXiv preprint arXiv:2312.02069 (2023)
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/3416a75f4cea9109507cacd8e2f2aefc-Paper.pdf
Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014). https://doi.org/10.1109/CVPR.2014.218
Saragih, J., Goecke, R.: Iterative error bound minimisation for AAM alignment. In: Proceedings of the 18th International Conference on Pattern Recognition - Volume 02, ICPR 2006, pp. 1196–1195. IEEE Computer Society, USA (2006). https://doi.org/10.1109/ICPR.2006.730
Schwartz, G., et al.: The eyes have it: an integrated eye and face model for photorealistic facial animation. ACM Trans. Graph. 39(4) (2020). https://doi.org/10.1145/3386569.3392493
Shysheya, A., et al.: Textured neural avatars. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2382–2392 (2019). https://api.semanticscholar.org/CorpusID:160009798
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Niessner, M.: Facevr: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. (TOG) 37(2), 25:1–25:15 (2018)
Google Scholar
Wei, S.E., et al.: VR facial animation via multiview image translation. ACM Trans. Graph. 38(4) (2019). https://doi.org/10.1145/3306346.3323030
Wu, X., Hu, Z., Sheng, L., Xu, D.: Styleformer: real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14618–14627 (2021)
Google Scholar
Xia, J., Qu, W., Huang, W., Zhang, J., Wang, X., Xu, M.: Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4042–4051 (2022). https://doi.org/10.1109/CVPR52688.2022.00402
Xiong, X., la Torre, F.D.: Supervised descent method and its applications to face alignment. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 532–539 (2013). https://api.semanticscholar.org/CorpusID:608055
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Chaitanya Patel
Meta Reality Labs, Pittsburgh, USA
Shaojie Bai, Te-Li Wang, Jason Saragih & Shih-En Wei

Authors

Chaitanya Patel
View author publications
You can also search for this author in PubMed Google Scholar
Shaojie Bai
View author publications
You can also search for this author in PubMed Google Scholar
Te-Li Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jason Saragih
View author publications
You can also search for this author in PubMed Google Scholar
Shih-En Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaitanya Patel .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 15482 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patel, C., Bai, S., Wang, TL., Saragih, J., Wei, SE. (2025). Fast Registration of Photorealistic Avatars for VR Facial Animation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-73033-7_23
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73032-0
Online ISBN: 978-3-031-73033-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Registration of Photorealistic Avatars for VR Facial Animation