Abstract
Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In this work we differentiate between unseen identities for avatar generation vs. unseen identities for HMC driving. We always assume an avatar for a new identity is already available through prior works, and evaluate the performance of expression estimation methods on unseen HMC images of that identity.
References
An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: unbiased image style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 862–871 (2021)
Apple Inc.: Apple Vision Pro (2024). https://www.apple.com/apple-vision-pro/
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402 (2023)
Browatzki, B., Wallraven, C.: 3FabRec: fast few-shot face alignment by reconstruction. In: CVPR (2020)
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph. 33(4) (2014). https://doi.org/10.1145/2601097.2601204
Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41(4) (2022). https://doi.org/10.1145/3528223.3530143
Chen, H., et al.: Artistic style transfer with internal-external learning and contrastive learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 26561–26573 (2021)
Chen, L., Cao, C., la Torre, F.D., Saragih, J., Xu, C., Sheikh, Y.: High-fidelity face tracking for AR/VR via deep lighting adaptation (2021)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Deng, Y., et al.: Stytr\(^2\): image style transfer with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Dollár, P., Welinder, P., Perona, P.: Cascaded pose regression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1078–1085 (2010). https://doi.org/10.1109/CVPR.2010.5540094
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv abs/2010.11929 (2020). https://api.semanticscholar.org/CorpusID:225039882
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Giebenhain, S., Kirschstein, T., Georgopoulos, M., Rünz, M., Agapito, L., Nießner, M.: Mononphm: dynamic head reconstruction from monocular videos. arXiv preprint arXiv:2312.06740 (2023)
Guo, J., Zhu, X., Zhao, C., Cao, D., Lei, Z., Li, S.Z.: Learning meta face recognition in unseen domains. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6162–6171 (2020). https://doi.org/10.1109/CVPR42600.2020.00620
Howard, A., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Li, H., et al.: Facial performance sensing head-mounted display. ACM Trans. Graph. (TOG) 34(4), 47:1–47:9 (2015)
Liu, S., et al.: Adaattn: revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6649–6658 (2021)
Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. 37(4), 68:1–68:13 (2018)
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751 (2019)
Meta Inc.: Meta Quest Pro: Premium Mixed Reality (2023). https://www.meta.com/ie/quest/quest-pro/
Olszewski, K., Lim, J.J., Saito, S., Li, H.: High-fidelity facial and speech animation for VR HMDs. ACM Trans. Graph. (TOG) 35(6), 1–14 (2016)
Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars: photorealistic head avatars with rigged 3D gaussians. arXiv preprint arXiv:2312.02069 (2023)
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/3416a75f4cea9109507cacd8e2f2aefc-Paper.pdf
Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014). https://doi.org/10.1109/CVPR.2014.218
Saragih, J., Goecke, R.: Iterative error bound minimisation for AAM alignment. In: Proceedings of the 18th International Conference on Pattern Recognition - Volume 02, ICPR 2006, pp. 1196–1195. IEEE Computer Society, USA (2006). https://doi.org/10.1109/ICPR.2006.730
Schwartz, G., et al.: The eyes have it: an integrated eye and face model for photorealistic facial animation. ACM Trans. Graph. 39(4) (2020). https://doi.org/10.1145/3386569.3392493
Shysheya, A., et al.: Textured neural avatars. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2382–2392 (2019). https://api.semanticscholar.org/CorpusID:160009798
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Niessner, M.: Facevr: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. (TOG) 37(2), 25:1–25:15 (2018)
Wei, S.E., et al.: VR facial animation via multiview image translation. ACM Trans. Graph. 38(4) (2019). https://doi.org/10.1145/3306346.3323030
Wu, X., Hu, Z., Sheng, L., Xu, D.: Styleformer: real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14618–14627 (2021)
Xia, J., Qu, W., Huang, W., Zhang, J., Wang, X., Xu, M.: Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4042–4051 (2022). https://doi.org/10.1109/CVPR52688.2022.00402
Xiong, X., la Torre, F.D.: Supervised descent method and its applications to face alignment. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 532–539 (2013). https://api.semanticscholar.org/CorpusID:608055
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Patel, C., Bai, S., Wang, TL., Saragih, J., Wei, SE. (2025). Fast Registration of Photorealistic Avatars for VR Facial Animation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-73033-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73032-0
Online ISBN: 978-3-031-73033-7
eBook Packages: Computer ScienceComputer Science (R0)