Abstract
Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions. In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: CVPR (2018)
Alldieck, T., Xu, H., Sminchisescu, C.: imGHUM: implicit generative models of 3D human shape and articulated pose. In: ICCV (2021)
Bagautdinov, T., et al.: Driving-signal aware full-body avatars. ACM TOG 40, 1–17 (2021)
Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation. In: NeurIPS (2023)
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)
Chen, J., et al.: Animatable neural radiance fields from monocular RGB videos. arXiv preprint arXiv:2106.13629 (2021)
Chen, Z., et al.: URhand: universal relightable hands. In: CVPR (2024)
Choi, H., Moon, G., Armando, M., Leroy, V., Lee, K.M., Rogez, G.: MonoNHR: monocular neural human renderer. In: 3DV (2022)
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 20–40. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_2
Contributors, M.: Openmmlab pose estimation toolbox and benchmark (2020). https://github.com/open-mmlab/mmpose
Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 3DV (2021)
Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM TOG 40, 1–13 (2021)
Guo, C., Jiang, T., Chen, X., Song, J., Hilliges, O.: Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. In: CVPR (2023)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Hu, L., et al.: GaussianAvatar: towards realistic human avatar modeling from a single video via animatable 3D Gaussians. arXiv preprint arXiv:2312.02134 (2023)
Jiang, T., Chen, X., Song, J., Hilliges, O.: InstantAvatar: learning avatars from monocular video in 60 seconds. In: CVPR (2023)
Jiang, W., Yi, K.M., Samei, G., Tuzel, O., Ranjan, A.: NeuMan: neural human radiance field from a single video. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13692, pp. 402–418. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_24
Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: CVPR (2018)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM TOG 42(4), 139–1 (2023)
Kocabas, M., Chang, J.H.R., Gabriel, J., Tuzel, O., Ranjan, A.: HUGS: human Gaussian splats. arXiv preprint arXiv:2311.17910 (2023)
Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. In: NeurIPS (2021)
Li, J., Bian, S., Xu, C., Chen, Z., Yang, L., Lu, C.: HybrIK-X: hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690 (2023)
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM TOG 36, 1–17 (2017)
Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)
Liu, X., et al.: GEA: reconstructing expressive 3D Gaussian avatar from monocular video. arXiv preprint arXiv:2402.16607 (2024)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)
Moon, G., Choi, H., Lee, K.M.: Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In: CVPRW (2022)
Moon, G., Shiratori, T., Lee, K.M.: DeepHandMesh: a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 440–455. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_26
Moon, G., Xu, W., Joshi, R., Wu, C., Shiratori, T.: Authentic hand avatar from a phone scan via universal hand model. In: CVPR (2024)
Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV (2021)
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)
Qian, Z., Wang, S., Mihajlovic, M., Geiger, A., Tang, S.: 3DGS-avatar: animatable avatars via deformable 3D Gaussian splatting. In: CVPR (2024)
Ravi, N., et al.: Accelerating 3D deep learning with PyTorch3D. arXiv preprint arXiv:2007.08501 (2020)
Remelli, E., et al.: Drivable volumetric avatars using texel-aligned features. In: ACM SIGGRAPH Conference Proceedings (2022)
Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: ICCVW (2021)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)
Shen, K., et al.: X-Avatar: expressive human avatars. In: CVPR (2023)
Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: HumanNeRF: free-viewpoint rendering of moving people from monocular video. In: CVPR (2022)
Xu, H., Bazavan, E.G., Zanfir, A., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: GHUM & GHUML: generative 3D human shape and articulated pose models. In: CVPR (2020)
Zhang, H., et al.: PyMAF-X: towards well-aligned full-body model regression from monocular images. TPAMI 45(10), 12287–12303 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Moon, G., Shiratori, T., Saito, S. (2025). Expressive Whole-Body 3D Gaussian Avatar. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15099. Springer, Cham. https://doi.org/10.1007/978-3-031-72940-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-72940-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72939-3
Online ISBN: 978-3-031-72940-9
eBook Packages: Computer ScienceComputer Science (R0)