Elsevier

Displays

Volume 70, December 2021, 102100
Displays

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

https://doi.org/10.1016/j.displa.2021.102100Get rights and content

Highlights

  • Proposed a complete monocular 3D reconstruction pipeline with depth prediction.

  • Integrated the PnP algorithm and ICP algorithm into the pose estimation module.

  • Proposed a depth map refinement strategy based on uncertainty.

  • Achieved a good reconstruction effect.

Abstract

Generating large-scale and high-quality 3D scene reconstruction from monocular images is an essential technical foundation in augmented reality and robotics. However, the apparent shortcomings (e.g., scale ambiguity, dense depth estimation in texture-less areas) make applying monocular 3D reconstruction to real-world practice challenging. In this work, we combine the advantage of deep learning and multi-view geometry to propose RGB-Fusion, which effectively solves the inherent limitations of traditional monocular reconstruction. To eliminate the confinements of tracking accuracy imposed by the prediction deficiency of neural networks, we propose integrating the PnP (Perspective-n-Point) algorithm into the tracking module. We employ 3D ICP (Iterative Closest Point) matching and 2D feature matching to construct separate error terms and jointly optimize them, reducing the dependence on the accuracy of depth prediction and improving pose estimation accuracy. The approximate pose predicted by the neural network is employed as the initial optimization value to avoid the trapping of local minimums. We formulate a depth map refinement strategy based on the uncertainty of the depth value, which can naturally lead to a refined depth map. Through our method, low-uncertainty elements can significantly update the current depth value while avoiding high-uncertainty elements from adversely affecting depth estimation accuracy. Numerical qualitative and quantitative evaluation results of tracking, depth prediction, and 3D reconstruction show that RGB-Fusion exceeds most monocular 3D reconstruction systems.

Introduction

Many researchers have paid great attention to the detailed 3D dense reconstruction of indoor scenes in recent years. Simultaneous Localization and Mapping (SLAM) aims to solve the navigation and map construction problems of unknown locations. It has been proven to be a practical dense 3D reconstruction approach and is widely employed in augmented reality [1]. Since the high popularity of RGB cameras on consumer devices, some researchers have focused on the monocular SLAM methods [2], [3]. These approaches performed feature matching on consecutive adjacent frames, employed stereo matching [4] to recover image depth information, and finally reconstructed the scene. However, the uncertainty of the absolute scale limits the application prospects of these methods. Even if all steps are carried out successfully, the ambiguous absolute scale still results in an unacceptable reconstruction. A solution [5] to scale vagueness was to integrate a monocular camera with an inertial measurement unit (IMU) to estimate the absolute scale of the scene through the measurement data of the IMU. However, when the acceleration is negligible, the low-performance IMU configured for consumer devices cannot provide data support for calculating the scene scale. Some scholars have proposed employ depth cameras [6], [7] for 3D reconstruction. However, most depth cameras have an unsatisfactory detection range and are sensitive to lighting conditions, making reconstruction less precise in uneven lighting environments.

Meanwhile, deep learning caused quite a stir in the area of 3D reconstruction. After training, the neural network can realize 3D object reconstruction from a single [8], [9], stereo [10], [11], or collection of images [12], [13]. Furthermore, deep learning obtains the absolute scale of the scene from images without relying on other auxiliary information. Due to the relatively complex spatial relationship, the accuracy of the end-to-end network is difficult to surpass the geometry-based 3D reconstruction methods.

After analyzing the characteristics of deep learning and geometry-based approaches, some scholars have suggested integrating them to yield satisfactory results. As shown in [14], the end-to-end 3D reconstruction system based on deep learning can accurately reconstruct the internal points of the scene, but it is easy to face the situation of blurred edges. In contrast, pipelines based on multi-view geometry have high accuracy at the edges of scenes with rich corners but fail to rebuild inside scenes with insufficient texture. Integrating deep learning with multi-view geometry is an exciting idea, but how to combine it efficiently is still worth exploring. Some methods [12], [15] have suggested inputting the structural model obtained from the geometry-based 3D reconstruction pipeline into the neural network, and then deep learning was utilized to perform model optimization. One limitation of these methods is that as long as the system processes new geometric information, the geometric model must be re-input to the network to perform complex calculations, which consumes a lot of calculation time.

This paper proposes RGB-Fusion, a new monocular surface reconstruction system that can support large-scale, high-quality reconstruction. Fig. 1 shows an example of our reconstruction results in the fr3/long_office_household sequence of the TUM RGB-D dataset [16]. RGB-Fusion leveraged the state-of-the-art algorithm DeepV2D [17] to predict the depth and completed the 3D reconstruction through the dense SLAM methods. We validated our approach by comparing three public datasets against many excellent monocular 3D reconstruction studies, focusing on tracking, depth prediction, and reconstruction accuracy. The results demonstrate that RGB-Fusion could achieve an average of 0.105 m surface reconstruction accuracy, an average of 31.255% depth prediction accuracy, and a 0.199 m absolute trajectory (ATE) root-mean-square error (RMSE) [16]. In general, our proposed pipeline outperforms most existing monocular 3D reconstruction systems by more than 10%.

The key contributions are summarized as follows:

  • (1)

    We integrate dense SLAM and deep learning into the new monocular 3D reconstruction pipeline, which can perform high-quality 3D reconstruction of large-scale scenes using only RGB images.

  • (2)

    We jointly optimize the ICP algorithm [18] and the PnP algorithm [19] to calculate the camera pose estimation and leverage the preliminary pose predicted by the depth module to initialize the least-square method to avoid trapping local minimums.

  • (3)

    We propose a depth refinement strategy that used the uncertainty of the depth map to improve the accuracy of depth prediction.

This paper is organized as follows: We introduce the development of related works in Section 2. Section 3 elaborates the techniques and critical strategies of our proposed monocular 3D reconstruction framework. The experiments and results on public datasets are presented in Section 4. We finally introduce our conclusions and the focus of future research in Section 5.

Section snippets

Related works

This section reviews related work in the two fields of single view depth estimation and monocular 3D reconstruction system with deep learning.

Methodology

In this section, we illustrate the principle of the monocular 3D reconstruction system RGB-Fusion. As is shown in Fig. 2, our architecture contains three indispensable modules: depth prediction, pose estimation, and surface fusion.

  • Depth prediction: We adopt the network architecture of DeepV2D to obtain the depth map of RGB images and the initial value of pose estimation.

  • Pose estimation: We combine the ICP algorithm and the PnP algorithm to calculate the camera pose estimation and then employ

Evaluation

We comprehensively evaluated the tracking accuracy, depth prediction accuracy, and reconstruction accuracy of RGB-Fusion from qualitative and quantitative perspectives. The evaluation was carried out on a desktop PC with an Intel Core i9-10920X CPU at 3.5 GHz with 32 GB of RAM and an Nvidia TITAN RTX GPU with 24 GB of VRAM. The deep prediction model was trained on the Scannet dataset [44] and tested on unlearned datasets, fully demonstrating the generalization ability of our system. The

Conclusion

This paper has proposed a monocular 3D reconstruction system with learned depth prediction, reconstructing the indoor scene with high quality, and proves that combining deep learning and SLAM is an effective method to solve the limitations of traditional 3D reconstruction. The proposed joint optimization method combines ICP and PnP algorithms to solve the problem of pose estimation when the accuracy of depth estimation is limited and effectively improves the accuracy and robustness of the

CRediT authorship contribution statement

ZhiMin Duan: Methodology, Software, Investigation, Formal analysis, Writing – original draft. YingWen Chen: Conceptualization, Resources, Funding acquisition, Supervision, Writing – review & editing, Software. HuJie Yu: Writing – review & editing, Visualization. BoWen Hu: Data curation,Visualization. Chen Chen: Validation, Software.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by the National Key Research and Development Program of China (No. 2018YFB0204301).

References (61)

  • R. Chen, S. Han, J. Xu, H. Su, Point-based multi-view stereo network, in: Proceedings of the IEEE/CVF International...
  • A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, A. Bry, End-to-end learning of geometry and...
  • H. Zhou, B. Ummenhofer, T. Brox, Deeptam: Deep tracking and mapping, in: Proceedings of the European Conference on...
  • J.J. Park, P. Florence, J. Straub, R. Newcombe, S. Lovegrove, Deepsdf: Learning continuous signed distance functions...
  • FacilJ.M. et al.

    Single-view and multi-view depth fusion

    IEEE Robot. Autom. Lett.

    (2017)
  • N. Yang, R. Wang, J. Stuckler, D. Cremers, Deep virtual stereo odometry: Leveraging deep depth prediction for monocular...
  • SturmJ. et al.

    A benchmark for the evaluation of RGB-d SLAM systems

  • TeedZ. et al.

    DeepV2D: Video to depth with differentiable structure from motion

    (2020)
  • BeslP.J. et al.

    Method for registration of 3-D shapes

  • LepetitV. et al.

    Epnp: An accurate o (n) solution to the pnp problem

    Int. J. Comput. Vis.

    (2009)
  • EigenD. et al.

    Depth map prediction from a single image using a multi-scale deep network

    Adv. Neural Inf. Process. Syst.

    (2014)
  • LiuF. et al.

    Learning depth from single monocular images using deep convolutional neural fields

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • LainaI. et al.

    Deeper depth prediction with fully convolutional residual networks

  • FuH. et al.

    Deep ordinal regression network for monocular depth estimation

  • C. Godard, O. Mac Aodha, G.J. Brostow, Unsupervised monocular depth estimation with left-right consistency, in:...
  • ChenW. et al.

    Single-image depth perception in the wild

    Adv. Neural Inf. Process. Syst.

    (2016)
  • ShangS. et al.

    Unsupervised learning of depth and pose estimation based on continuous frame window

  • X. Chen, Y. Wang, X. Chen, W. Zeng, S2R-DepthNet: Learning a generalizable depth-specific structural representation,...
  • WeerasekeraC.S. et al.

    Dense monocular reconstruction using surface normals

  • M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, A.J. Davison, CodeSLAM—learning a compact, optimisable...
  • Cited by (14)

    • 3D spatial measurement for model reconstruction: A review

      2023, Measurement: Journal of the International Measurement Confederation
    View all citing articles on Scopus
    1

    These authors have contributed equally to this work.

    View full text