research-article

Neural3D: Light-weight Neural Portrait Scanning via Context-aware Correspondence Learning

Authors:
Xin Suo

ShanghaiTech University, Shanghai, China

ShanghaiTech University, Shanghai, China
View Profile

,
Minye Wu

ShanghaiTech University, Shanghai, China

ShanghaiTech University, Shanghai, China
View Profile

,
Yanshun Zhang

Dgene, Shanghai, China

Dgene, Shanghai, China
View Profile

,
Yingliang Zhang

Dgene, Shanghai, China

Dgene, Shanghai, China
View Profile

,
Lan Xu

ShanghaiTech University, Shanghai, China

ShanghaiTech University, Shanghai, China
View Profile

,
Qiang Hu

ShanghaiTech University, Shanghai, China

ShanghaiTech University, Shanghai, China
View Profile

,
Jingyi Yu

ShanghaiTech University, Shanghai, China

ShanghaiTech University, Shanghai, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 3651–3660https://doi.org/10.1145/3394171.3413734

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 3651–3660

ABSTRACT

Reconstructing a human portrait in a realistic and convenient manner is critical for human modeling and understanding. Aiming at light-weight and realistic human portrait reconstruction, in this paper we propose Neural3D: a novel neural human portrait scanning system using only a single RGB camera. In our system, to enable accurate pose estimation,we propose a context-aware correspondence learning approach which jointly models the appearance, spatial and motion information between feature pairs. To enable realistic reconstruction and suppress the geometry error, we further adopt a point-based neural rendering scheme to generate realistic and immersive portrait visualization in arbitrary virtual view-points. By introducing these learning-based technical components into the pure RGB-based human modeling framework, we can achieve both accurate camera pose estimation and realistic free-viewpoint rendering of the reconstructed human portrait. Extensive experiments on a variety of challenging capture scenarios demonstrate the robustness and effectiveness of our approach.

Supplemental Material

3394171.3413734.mp4

mp4

51.3 MB

Download

References

Kara-Ali Aliev, Dmitry Ulyanov, and Victor Lempitsky. 2019. Neural point-based graphics. arXiv preprint arXiv:1906.08240 (2019).Google Scholar
Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018a. Detailed Human Avatars from Monocular Video. In 2018 International Conference on 3D Vision (3DV). 98--109.Google ScholarCross Ref
Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018b. Video Based Reconstruction of 3D People Models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Brett Allen, Brian Curless, and Zoran Popovi?. 2003. The Space of Human Body Shapes: Reconstruction and Parameterization from Range Scans. ACM Trans. Graph., Vol. 22, 3 (July 2003), 587--594. https://doi.org/10.1145/882262.882311Google ScholarDigital Library
artec3d [n.d.]. artec3d. https://www.artec3d.com/. Accessed: 2020-05-24.Google Scholar
Daniel Barath, Jiri Matas, and Jana Noskova. 2019. MAGSAC: marginalizing sample consensus. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10197--10205.Google ScholarCross Ref
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. SURF: Speeded Up Robust Features. In Computer Vision -- ECCV 2006,, Alevs Leonardis, Horst Bischof, and Axel Pinz (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 404--417.Google Scholar
Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, and Gerard Pons-Moll. 2019. Multi-Garment Net: Learning to Dress 3D People from Images. In IEEE International Conference on Computer Vision (ICCV). IEEE.Google ScholarCross Ref
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. 2017. DSAC - Differentiable RANSAC for Camera Localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Eric Brachmann and Carsten Rother. 2019. Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses. arXiv preprint arXiv:1905.04132 (2019).Google Scholar
Michael Calonder, Vincent Lepetit, Mustafa Ozuysal, Tomasz Trzcinski, Christoph Strecha, and Pascal Fua. 2011. BRIEF: Computing a local binary descriptor very fast. IEEE transactions on pattern analysis and machine intelligence, Vol. 34, 7 (2011), 1281--1298.Google Scholar
Wei Cheng, Lan Xu, Lei Han, Yuanfang Guo, and Lu Fang. 2018. IHuman3D: Intelligent Human Body 3D Reconstruction Using a Single Flying Camera. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM '18). Association for Computing Machinery, New York, NY, USA, 1733--1741. https://doi.org/10.1145/3240508.3240600Google ScholarDigital Library
Kong Man Cheung, Simon Baker, and Takeo Kanade. 2003. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1. I--I.Google ScholarCross Ref
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality streamable free-viewpoint video. ACM Transactions on Graphics (TOG), Vol. 34, 4 (2015), 69.Google ScholarDigital Library
Yan Cui and Didier Stricker. 2011. 3D Shape Scanning with a Kinect. In ACM SIGGRAPH 2011 Posters (Vancouver, British Columbia, Canada) (SIGGRAPH '11). Association for Computing Machinery, New York, NY, USA, Article 57, bibinfonumpages1 pages. https://doi.org/10.1145/2037715.2037780Google ScholarDigital Library
Brian Curless and Marc Levoy. 1996. A Volumetric Method for Building Complex Models from Range Images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '96). ACM, New York, NY, USA, 303--312. https://doi.org/10.1145/237170.237269Google ScholarDigital Library
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 224--236.Google ScholarCross Ref
Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, Vol. 24, 6 (1981), 381--395.Google ScholarDigital Library
Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. 2017. Look Into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Tony Tung. 2020. ARCH: Animatable Reconstruction of Clothed Humans. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. 2015. Panoptic Studio: A Massively Multiview System for Social Motion Capture. In Proceedings of the IEEE International Conference on Computer Vision. 3334--3342.Google ScholarDigital Library
Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In Computer Vision and Pattern Regognition (CVPR).Google Scholar
Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. 2019. Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Verica Lazova, Eldar Insafutdinov, and Gerard Pons-Moll. 2019. 360-Degree Textures of People in Clothing from a Single Image. In International Conference on 3D Vision (3DV).Google ScholarCross Ref
Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. 2019. Contextdesc: Local descriptor augmentation with cross-modality context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2527--2536.Google ScholarCross Ref
Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. 2018. Learning to find good correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2666--2674.Google ScholarCross Ref
Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo Morishima. 2019. SiCloPe: Silhouette-Based Clothed People. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In Proc. of ISMAR. 127--136.Google ScholarDigital Library
David Nistér. 2004. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern analysis and machine intelligence, Vol. 26, 6 (2004), 0756--777.Google ScholarDigital Library
PhotoScan [n.d.]. PhotoScan. http://www.agisoft.com/. Accessed: 2020-05-23.Google Scholar
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.Google Scholar
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems. 5099--5108.Google Scholar
Rahul Raguram, Ondrej Chum, Marc Pollefeys, Jiri Matas, and Jan-Michael Frahm. 2012. USAC: a universal framework for random sample consensus. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 8 (2012), 2022--2038.Google Scholar
Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. 2019. PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. 2020. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104--4113.Google ScholarCross Ref
Vincent Sitzmann, Michael Zollhoefer, and Gordon Wetzstein. 2019. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 1121--1132.Google Scholar
Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred Neural Rendering: Image Synthesis Using Neural Textures. ACM Trans. Graph., Vol. 38, 4, Article 66 (July 2019), bibinfonumpages12 pages. https://doi.org/10.1145/3306346.3323035Google ScholarDigital Library
Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias Nießner. 2018. IGNOR: Image-guided neural object rendering. arXiv preprint arXiv:1811.10720 (2018).Google Scholar
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM, Vol. 59, 2 (2016), 64--73.Google ScholarDigital Library
Jing Tong, Jin Zhou, Ligang Liu, Zhigeng Pan, and Hao Yan. 2012. Scanning 3D Full Human Bodies Using Kinects. IEEE Transactions on Visualization and Computer Graphics, Vol. 18, 4 (2012), 643--650.Google ScholarDigital Library
Treedy's [n.d.]. Treedy's. https://www.treedys.com/. Accessed: 2019-07-25.Google Scholar
Kangkan Wang, Guofeng Zhang, and Shihong Xia. 2017. Templateless Non-Rigid Reconstruction and Motion Tracking With a Single RGB-D Camera. IEEE Transactions on Image Processing, Vol. 26, 12 (Dec 2017), 5966--5979. https://doi.org/10.1109/TIP.2017.2740624Google ScholarDigital Library
Jianxiong Xiao, Andrew Owens, and Antonio Torralba. 2013. Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proceedings of the IEEE International Conference on Computer Vision. 1625--1632.Google ScholarDigital Library
Lan Xu, Wei Cheng, Kaiwen Guo, Lei Han, Yebin Liu, and Lu Fang. 2019 a. FlyFusion: Realtime Dynamic Scene Reconstruction Using a Flying Depth Camera. IEEE Transactions on Visualization and Computer Graphics (2019), 1--1.Google Scholar
Lan Xu, Yebin Liu, Wei Cheng, Kaiwen Guo, Guyue Zhou, Qionghai Dai, and Lu Fang. 2017. FlyCap: Markerless Motion Capture Using Multiple Autonomous Flying Cameras. IEEE Transactions on Visualization and Computer Graphics, Vol. PP, 99 (2017), 1--1. https://doi.org/10.1109/TVCG.2017.2728660Google Scholar
Lan Xu, Zhuo Su, Lei Han, Tao Yu, Yebin Liu, and Lu Fang. 2019 b. UnstructuredFusion: Realtime 4D Geometry and Texture Reconstruction using CommercialRGBD Cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1--1.Google Scholar
Yuanlu Xu, Song-Chun Zhu, and Tony Tung. 2019 c. DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. 2016. Lift: Learned invariant feature transform. In European Conference on Computer Vision. Springer, 467--483.Google ScholarCross Ref
Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2019. Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision. 4471--4480.Google ScholarCross Ref
Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. 2018. DoubleFusion: Real-Time Capture of Human Performances With Inner Body Shapes From a Single Depth Sensor. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. 2019. Learning Two-View Correspondences and Geometry Using Order-Aware Network. arXiv preprint arXiv:1908.04964 (2019).Google Scholar

Index Terms

Neural3D: Light-weight Neural Portrait Scanning via Context-aware Correspondence Learning
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Virtual reality

Recommendations

Deferred neural lighting: free-viewpoint relighting from unstructured photographs

We present deferred neural lighting, a novel method for free-viewpoint relighting from unstructured photographs of a scene captured with handheld devices. Our method leverages a scene-dependent neural rendering network for relighting a rough geometric ...
Read More
Geometry-Aware Single-Image Full-Body Human Relighting
Computer Vision – ECCV 2022
Abstract
Single-image human relighting aims to relight a target human under new lighting conditions by decomposing the input image into albedo, shape and lighting. Although plausible relighting results can be achieved, previous methods suffer from both the ...
Read More
Neural Light Transport for Relighting and View Synthesis

The light transport (LT) of a scene describes how it appears under different lighting conditions from different viewing directions, and complete knowledge of a scene’s LT enables the synthesis of novel views under arbitrary lighting. In this article, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
correspondence estimation
human scaning
neural networks
neural rendering
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 394
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Neural3D: Light-weight Neural Portrait Scanning via Context-aware Correspondence Learning

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Deferred neural lighting: free-viewpoint relighting from unstructured photographs

Geometry-Aware Single-Image Full-Body Human Relighting

Neural Light Transport for Relighting and View Synthesis