KeypointNeRF: Generalizing Image-Based Volumetric Avatars Using Relative Spatial Encoding of Keypoints

Mihajlovic, Marko; Bansal, Aayush; Zollhöfer, Michael; Tang, Siyu; Saito, Shunsuke

doi:10.1007/978-3-031-19784-0_11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13675))

Included in the following conference series:

European Conference on Computer Vision

2814 Accesses

Abstract

Image-based volumetric humans using pixel-aligned features promise generalization to unseen poses and identities. Prior work leverages global spatial encodings and multi-view geometric consistency to reduce spatial ambiguity. However, global encodings often suffer from overfitting to the distribution of the training data, and it is difficult to learn multi-view consistent reconstruction from sparse views. In this work, we investigate common issues with existing spatial encodings and propose a simple yet highly effective approach to modeling high-fidelity volumetric humans from sparse views. One of the key ideas is to encode relative spatial 3D information via sparse 3D keypoints. This approach is robust to the sparsity of viewpoints and cross-dataset domain gap. Our approach outperforms state-of-the-art methods for head reconstruction. On human body reconstruction for unseen subjects, we also achieve performance comparable to prior work that uses a parametric human body model and temporal feature aggregation. Our experiments show that a majority of errors in prior work stem from an inappropriate choice of spatial encoding and thus we suggest a new direction for high-fidelity image-based human modeling.

M. Mihajlovic—The work was primarily done during an internship at Meta.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-view Canonical Pose 3D Human Body Reconstruction Based on Volumetric TSDF

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

Reconstructing 3D Human Avatars from Monocular Images

References

Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Alldieck, T., Xu, H., Sminchisescu, C.: imGHUM: implicit generative models of 3D human shape and articulated pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Alldieck, T., Zanfir, M., Sminchisescu, C.: Photorealistic monocular 3D reconstruction of humans wearing clothing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1506–1515 (2022)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005)
Google Scholar
Athar, S., Shu, Z., Samaras, D.: Flame-in-NeRF: neural control of radiance fields for free view face animation. arXiv preprint arXiv:2108.04913 (2021)
Bansal, A., Chen, X., Russell, B., Gupta, A., Ramanan, D.: PixelNet: representation of the pixels, by the pixels, and for the pixels. arXiv preprint arXiv:1702.06506 (2017)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194 (1999)
Google Scholar
Buehler, M.C., Meka, A., Li, G., Beeler, T., Hilliges, O.: VariTex: variational neural face textures. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. (TOG) 41, 1–19 (2022)
Google Scholar
Cao, C., Wu, H., Weng, Y., Shao, T., Zhou, K.: Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35(4) (2016)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Chatziagapi, A., Athar, S., Moreno-Noguer, F., Samaras, D.: SIDER: single-image neural optimization for facial geometric detail recovery. arXiv preprint arXiv:2108.05465 (2021)
Chen, A., et al.: MVSNeRF: fast generalizable radiance field reconstruction from multi-view stereo. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Chibane, J., Bansal, A., Lazova, V., Pons-Moll, G.: Stereo radiance fields (SRF): learning view synthesis from sparse views of novel scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2021)
Google Scholar
Gafni, G., Thies, J., Zollhofer, M., Niessner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Gao, C., Shih, Y., Lai, W.S., Liang, C.K., Huang, J.B.: Portrait neural radiance fields from a single image. arXiv preprint arXiv:2012.05903 (2020)
Grassal, P.W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.: Neural head avatars from monocular RGB videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18664 (2022)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: ARCH++: animation-ready clothed human reconstruction revisited. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Hu, L., et al.: Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. (TOG) 36(6), 1–14 (2017)
Article Google Scholar
Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: ARCH: animatable reconstruction of clothed humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Ichim, A.E., Bouaziz, S., Pauly, M.: Dynamic 3D avatar creation from hand-held video input. ACM Trans. Graph. (TOG) 34(4), 1–14 (2015)
Article Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Ke, Z., Sun, J., Li, K., Yan, Q., Lau, R.W.: MODNet: real-time trimap-free portrait matting via objective decomposition. In: AAAI (2022)
Google Scholar
Kim, H., Garrido, P., Tewari, A., Xu, W., Thies, J., Nießner, M., Pérez, P., Richardt, C., Zollöfer, M., Theobalt, C.: Deep video portraits. ACM Trans. Graph. (TOG) 37(4), 163 (2018)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (2015)
Google Scholar
Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. In: Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc. (2021)
Google Scholar
Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. (TOG) 37(4), 1–13 (2018)
Article Google Scholar
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. (TOG) (2019)
Google Scholar
Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. arXiv preprint arXiv:2103.01954 (2021)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Article Google Scholar
Martin-Brualla, R., et al.: LookinGood: enhancing performance capture with real-time neural re-rendering. arXiv preprint arXiv:1811.05029 (2018)
Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based visual hulls. In: ACM SIGGRAPH, pp. 369–374 (2000)
Google Scholar
Meka, A., et al.: Deep relightable textures: volumetric performance capture with neural rendering. ACM Trans. Graph. (TOG) 39(6), 1–21 (2020)
Article Google Scholar
Mihajlovic, M., Saito, S., Bansal, A., Zollhoefer, M., Tang, S.: COAP: compositional articulated occupancy of people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Mihajlovic, M., Weder, S., Pollefeys, M., Oswald, M.R.: DeepSurfels: learning online appearance fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14524–14535 (2021)
Google Scholar
Mihajlovic, M., Zhang, Y., Black, M.J., Tang, S.: LEAP: learning articulated occupancy of people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph. (TOG) 38, 1–14 (2019)
Article Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14314–14323 (2021)
Google Scholar
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Prokudin, S., Black, M.J., Romero, J.: SMPLpix: neural avatars from 3D human models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1810–1819 (2021)
Google Scholar
Raj, A., Tanke, J., Hays, J., Vo, M., Stoll, C., Lassner, C.: ANR: articulated neural rendering for virtual avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3722–3731 (2021)
Google Scholar
Raj, A., et al.: PVA: pixel-aligned volumetric avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Ramon, E., et al.: H3D-net: few-shot high-fidelity 3D head reconstruction. arXiv preprint arXiv:2107.12512 (2021)
Rebain, D., Matthews, M., Yi, K.M., Lagun, D., Tagliasacchi, A.: LOLNeRF: learn from one look. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1558–1567 (2022)
Google Scholar
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: GRAF: generative radiance fields for 3D-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc. (2020)
Google Scholar
Shao, R., Zhang, H., Zhang, H., Cao, Y., Yu, T., Liu, Y.: DoubleField: bridging the neural surface and radiance fields for high-fidelity human rendering. arXiv preprint arXiv:2106.03798 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tewari, A., et al.: State of the art on neural rendering. In: Computer Graphics Forum, vol. 39, pp. 701–727. Wiley Online Library (2020)
Google Scholar
Tewari, A., et al.: Advances in neural rendering. arXiv preprint arXiv:2111.05849 (2021)
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR 2011, pp. 1521–1528. IEEE (2011)
Google Scholar
Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. ACM Trans. Graph. 27(3), 97 (2008)
Article Google Scholar
Vlasic, D., et al.: Dynamic shape capture using multi-view photometric stereo. ACM Trans. Graph. 28(5), 174 (2009)
Article Google Scholar
Wang, L., Chen, Z., Yu, T., Ma, C., Li, L., Liu, Y.: FaceVerse: a fine-grained and detail-controllable 3D face morphable model from a hybrid dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20333–20342 (2022)
Google Scholar
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2021)
Google Scholar
Wang, S., Mihajlovic, M., Ma, Q., Geiger, A., Tang, S.: Metaavatar: learning animatable clothed human models from few depth images. Adv. Neural Inf. Process. Syst. 34 (2021)
Google Scholar
Wang, S., Schwartz, K., Geiger, A., Tang, S.: ARAH: animatable volume rendering of articulated human SDFs. In: European conference on computer vision (2022)
Google Scholar
Wang, Z., et al.: Learning compositional radiance fields of dynamic human heads. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: HumanNeRF: free-viewpoint rendering of moving people from monocular video. arXiv preprint arXiv:2201.04127 (2022)
Xu, H., Alldieck, T., Sminchisescu, C.: H-NeRF: Neural radiance fields for rendering and temporal reconstruction of humans in motion. Adv. Neural Inf. Process. Syst. 34 (2021)
Google Scholar
Xu, H., Bazavan, E.G., Zanfir, A., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Ghum & ghuml: Generative 3D human shape and articulated pose models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6184–6193 (2020)
Google Scholar
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587 (2021)
Google Scholar
Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7287–7296 (2018)
Google Scholar
Zhao, H., et al.: High-fidelity human avatars from a single RGB camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15904–15913 (2022)
Google Scholar
Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: IM Avatar: Implicit morphable head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Zheng, Z., Huang, H., Yu, T., Zhang, H., Guo, Y., Liu, Y.: Structured local radiance fields for human avatar modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15893–15903 (2022)
Google Scholar
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: PaMIR: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar

Download references

Acknowledgments

We thank Chen Cao for the help with the in-the-wild iPhone capture. M. M. and S. T. acknowledge the SNF grant 200021 204840.

Author information

Authors and Affiliations

ETH Zürich, Zürich, Switzerland
Marko Mihajlovic & Siyu Tang
Reality Labs Research, Pittsburgh, USA
Marko Mihajlovic, Aayush Bansal, Michael Zollhöfer & Shunsuke Saito

Authors

Marko Mihajlovic
View author publications
You can also search for this author in PubMed Google Scholar
Aayush Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Michael Zollhöfer
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Saito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marko Mihajlovic .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1484 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mihajlovic, M., Bansal, A., Zollhöfer, M., Tang, S., Saito, S. (2022). KeypointNeRF: Generalizing Image-Based Volumetric Avatars Using Relative Spatial Encoding of Keypoints. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-19784-0_11
Published: 31 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19783-3
Online ISBN: 978-3-031-19784-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

KeypointNeRF: Generalizing Image-Based Volumetric Avatars Using Relative Spatial Encoding of Keypoints

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-view Canonical Pose 3D Human Body Reconstruction Based on Volumetric TSDF

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

Reconstructing 3D Human Avatars from Monocular Images

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1484 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

KeypointNeRF: Generalizing Image-Based Volumetric Avatars Using Relative Spatial Encoding of Keypoints

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-view Canonical Pose 3D Human Body Reconstruction Based on Volumetric TSDF

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

Reconstructing 3D Human Avatars from Monocular Images

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1484 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation