Skip to main content

Fast Registration of Photorealistic Avatars for VR Facial Animation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15120))

Included in the following conference series:

  • 487 Accesses

Abstract

Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In this work we differentiate between unseen identities for avatar generation vs. unseen identities for HMC driving. We always assume an avatar for a new identity is already available through prior works, and evaluate the performance of expression estimation methods on unseen HMC images of that identity.

References

  1. An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: unbiased image style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 862–871 (2021)

    Google Scholar 

  2. Apple Inc.: Apple Vision Pro (2024). https://www.apple.com/apple-vision-pro/

  3. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402 (2023)

    Google Scholar 

  4. Browatzki, B., Wallraven, C.: 3FabRec: fast few-shot face alignment by reconstruction. In: CVPR (2020)

    Google Scholar 

  5. Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  6. Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph. 33(4) (2014). https://doi.org/10.1145/2601097.2601204

  7. Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41(4) (2022). https://doi.org/10.1145/3528223.3530143

  8. Chen, H., et al.: Artistic style transfer with internal-external learning and contrastive learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 26561–26573 (2021)

    Google Scholar 

  9. Chen, L., Cao, C., la Torre, F.D., Saragih, J., Xu, C., Sheikh, Y.: High-fidelity face tracking for AR/VR via deep lighting adaptation (2021)

    Google Scholar 

  10. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  11. Deng, Y., et al.: Stytr\(^2\): image style transfer with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  12. Dollár, P., Welinder, P., Perona, P.: Cascaded pose regression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1078–1085 (2010). https://doi.org/10.1109/CVPR.2010.5540094

  13. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv abs/2010.11929 (2020). https://api.semanticscholar.org/CorpusID:225039882

  14. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  15. Giebenhain, S., Kirschstein, T., Georgopoulos, M., Rünz, M., Agapito, L., Nießner, M.: Mononphm: dynamic head reconstruction from monocular videos. arXiv preprint arXiv:2312.06740 (2023)

  16. Guo, J., Zhu, X., Zhao, C., Cao, D., Lei, Z., Li, S.Z.: Learning meta face recognition in unseen domains. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6162–6171 (2020). https://doi.org/10.1109/CVPR42600.2020.00620

  17. Howard, A., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)

    Google Scholar 

  18. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)

    Google Scholar 

  19. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

    Google Scholar 

  20. Li, H., et al.: Facial performance sensing head-mounted display. ACM Trans. Graph. (TOG) 34(4), 47:1–47:9 (2015)

    Google Scholar 

  21. Liu, S., et al.: Adaattn: revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6649–6658 (2021)

    Google Scholar 

  22. Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. 37(4), 68:1–68:13 (2018)

    Google Scholar 

  23. Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751 (2019)

  24. Meta Inc.: Meta Quest Pro: Premium Mixed Reality (2023). https://www.meta.com/ie/quest/quest-pro/

  25. Olszewski, K., Lim, J.J., Saito, S., Li, H.: High-fidelity facial and speech animation for VR HMDs. ACM Trans. Graph. (TOG) 35(6), 1–14 (2016)

    Google Scholar 

  26. Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars: photorealistic head avatars with rigged 3D gaussians. arXiv preprint arXiv:2312.02069 (2023)

  27. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/3416a75f4cea9109507cacd8e2f2aefc-Paper.pdf

  28. Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014). https://doi.org/10.1109/CVPR.2014.218

  29. Saragih, J., Goecke, R.: Iterative error bound minimisation for AAM alignment. In: Proceedings of the 18th International Conference on Pattern Recognition - Volume 02, ICPR 2006, pp. 1196–1195. IEEE Computer Society, USA (2006). https://doi.org/10.1109/ICPR.2006.730

  30. Schwartz, G., et al.: The eyes have it: an integrated eye and face model for photorealistic facial animation. ACM Trans. Graph. 39(4) (2020). https://doi.org/10.1145/3386569.3392493

  31. Shysheya, A., et al.: Textured neural avatars. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2382–2392 (2019). https://api.semanticscholar.org/CorpusID:160009798

  32. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Niessner, M.: Facevr: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. (TOG) 37(2), 25:1–25:15 (2018)

    Google Scholar 

  33. Wei, S.E., et al.: VR facial animation via multiview image translation. ACM Trans. Graph. 38(4) (2019). https://doi.org/10.1145/3306346.3323030

  34. Wu, X., Hu, Z., Sheng, L., Xu, D.: Styleformer: real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14618–14627 (2021)

    Google Scholar 

  35. Xia, J., Qu, W., Huang, W., Zhang, J., Wang, X., Xu, M.: Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4042–4051 (2022). https://doi.org/10.1109/CVPR52688.2022.00402

  36. Xiong, X., la Torre, F.D.: Supervised descent method and its applications to face alignment. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 532–539 (2013). https://api.semanticscholar.org/CorpusID:608055

  37. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaitanya Patel .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 15482 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Patel, C., Bai, S., Wang, TL., Saragih, J., Wei, SE. (2025). Fast Registration of Photorealistic Avatars for VR Facial Animation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73033-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73032-0

  • Online ISBN: 978-3-031-73033-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics