RGB-Fusion: Monocular 3D reconstruction with learned depth prediction
Introduction
Many researchers have paid great attention to the detailed 3D dense reconstruction of indoor scenes in recent years. Simultaneous Localization and Mapping (SLAM) aims to solve the navigation and map construction problems of unknown locations. It has been proven to be a practical dense 3D reconstruction approach and is widely employed in augmented reality [1]. Since the high popularity of RGB cameras on consumer devices, some researchers have focused on the monocular SLAM methods [2], [3]. These approaches performed feature matching on consecutive adjacent frames, employed stereo matching [4] to recover image depth information, and finally reconstructed the scene. However, the uncertainty of the absolute scale limits the application prospects of these methods. Even if all steps are carried out successfully, the ambiguous absolute scale still results in an unacceptable reconstruction. A solution [5] to scale vagueness was to integrate a monocular camera with an inertial measurement unit (IMU) to estimate the absolute scale of the scene through the measurement data of the IMU. However, when the acceleration is negligible, the low-performance IMU configured for consumer devices cannot provide data support for calculating the scene scale. Some scholars have proposed employ depth cameras [6], [7] for 3D reconstruction. However, most depth cameras have an unsatisfactory detection range and are sensitive to lighting conditions, making reconstruction less precise in uneven lighting environments.
Meanwhile, deep learning caused quite a stir in the area of 3D reconstruction. After training, the neural network can realize 3D object reconstruction from a single [8], [9], stereo [10], [11], or collection of images [12], [13]. Furthermore, deep learning obtains the absolute scale of the scene from images without relying on other auxiliary information. Due to the relatively complex spatial relationship, the accuracy of the end-to-end network is difficult to surpass the geometry-based 3D reconstruction methods.
After analyzing the characteristics of deep learning and geometry-based approaches, some scholars have suggested integrating them to yield satisfactory results. As shown in [14], the end-to-end 3D reconstruction system based on deep learning can accurately reconstruct the internal points of the scene, but it is easy to face the situation of blurred edges. In contrast, pipelines based on multi-view geometry have high accuracy at the edges of scenes with rich corners but fail to rebuild inside scenes with insufficient texture. Integrating deep learning with multi-view geometry is an exciting idea, but how to combine it efficiently is still worth exploring. Some methods [12], [15] have suggested inputting the structural model obtained from the geometry-based 3D reconstruction pipeline into the neural network, and then deep learning was utilized to perform model optimization. One limitation of these methods is that as long as the system processes new geometric information, the geometric model must be re-input to the network to perform complex calculations, which consumes a lot of calculation time.
This paper proposes RGB-Fusion, a new monocular surface reconstruction system that can support large-scale, high-quality reconstruction. Fig. 1 shows an example of our reconstruction results in the fr3/long_office_household sequence of the TUM RGB-D dataset [16]. RGB-Fusion leveraged the state-of-the-art algorithm DeepV2D [17] to predict the depth and completed the 3D reconstruction through the dense SLAM methods. We validated our approach by comparing three public datasets against many excellent monocular 3D reconstruction studies, focusing on tracking, depth prediction, and reconstruction accuracy. The results demonstrate that RGB-Fusion could achieve an average of 0.105 m surface reconstruction accuracy, an average of 31.255% depth prediction accuracy, and a 0.199 m absolute trajectory (ATE) root-mean-square error (RMSE) [16]. In general, our proposed pipeline outperforms most existing monocular 3D reconstruction systems by more than 10%.
The key contributions are summarized as follows:
- (1)
We integrate dense SLAM and deep learning into the new monocular 3D reconstruction pipeline, which can perform high-quality 3D reconstruction of large-scale scenes using only RGB images.
- (2)
We jointly optimize the ICP algorithm [18] and the PnP algorithm [19] to calculate the camera pose estimation and leverage the preliminary pose predicted by the depth module to initialize the least-square method to avoid trapping local minimums.
- (3)
We propose a depth refinement strategy that used the uncertainty of the depth map to improve the accuracy of depth prediction.
This paper is organized as follows: We introduce the development of related works in Section 2. Section 3 elaborates the techniques and critical strategies of our proposed monocular 3D reconstruction framework. The experiments and results on public datasets are presented in Section 4. We finally introduce our conclusions and the focus of future research in Section 5.
Section snippets
Related works
This section reviews related work in the two fields of single view depth estimation and monocular 3D reconstruction system with deep learning.
Methodology
In this section, we illustrate the principle of the monocular 3D reconstruction system RGB-Fusion. As is shown in Fig. 2, our architecture contains three indispensable modules: depth prediction, pose estimation, and surface fusion.
- •
Depth prediction: We adopt the network architecture of DeepV2D to obtain the depth map of RGB images and the initial value of pose estimation.
- •
Pose estimation: We combine the ICP algorithm and the PnP algorithm to calculate the camera pose estimation and then employ
Evaluation
We comprehensively evaluated the tracking accuracy, depth prediction accuracy, and reconstruction accuracy of RGB-Fusion from qualitative and quantitative perspectives. The evaluation was carried out on a desktop PC with an Intel Core i9-10920X CPU at 3.5 GHz with 32 GB of RAM and an Nvidia TITAN RTX GPU with 24 GB of VRAM. The deep prediction model was trained on the Scannet dataset [44] and tested on unlearned datasets, fully demonstrating the generalization ability of our system. The
Conclusion
This paper has proposed a monocular 3D reconstruction system with learned depth prediction, reconstructing the indoor scene with high quality, and proves that combining deep learning and SLAM is an effective method to solve the limitations of traditional 3D reconstruction. The proposed joint optimization method combines ICP and PnP algorithms to solve the problem of pose estimation when the accuracy of depth estimation is limited and effectively improves the accuracy and robustness of the
CRediT authorship contribution statement
ZhiMin Duan: Methodology, Software, Investigation, Formal analysis, Writing – original draft. YingWen Chen: Conceptualization, Resources, Funding acquisition, Supervision, Writing – review & editing, Software. HuJie Yu: Writing – review & editing, Visualization. BoWen Hu: Data curation,Visualization. Chen Chen: Validation, Software.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work is supported by the National Key Research and Development Program of China (No. 2018YFB0204301).
References (61)
- et al.
Autostereoscopic augmented reality visualization for depth perception in endoscopic surgery
Displays
(2017) - et al.
A local stereo matching algorithm based on weighted guided image filtering for improving the generation of depth range images
Displays
(2017) - et al.
Review of multi-view 3D object recognition methods based on deep learning
Displays
(2021) - et al.
Dtam: Dense tracking and mapping in real-time
- et al.
Using collaborative sharing on cloud for fast relocalization in keyframe-based SLAM
- et al.
Real-time large-scale dense 3D reconstruction with loop closure
- S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et...
- et al.
Elasticfusion: Real-time dense SLAM and light source estimation
Int. J. Robot. Res.
(2016) - H. Fan, H. Su, L.J. Guibas, A point set generation network for 3d object reconstruction from a single image, in:...
- G. Gkioxari, J. Malik, J. Johnson, Mesh r-cnn, in: Proceedings of the IEEE/CVF International Conference on Computer...
Single-view and multi-view depth fusion
IEEE Robot. Autom. Lett.
A benchmark for the evaluation of RGB-d SLAM systems
DeepV2D: Video to depth with differentiable structure from motion
Method for registration of 3-D shapes
Epnp: An accurate o (n) solution to the pnp problem
Int. J. Comput. Vis.
Depth map prediction from a single image using a multi-scale deep network
Adv. Neural Inf. Process. Syst.
Learning depth from single monocular images using deep convolutional neural fields
IEEE Trans. Pattern Anal. Mach. Intell.
Deeper depth prediction with fully convolutional residual networks
Deep ordinal regression network for monocular depth estimation
Single-image depth perception in the wild
Adv. Neural Inf. Process. Syst.
Unsupervised learning of depth and pose estimation based on continuous frame window
Dense monocular reconstruction using surface normals
Cited by (14)
3D spatial measurement for model reconstruction: A review
2023, Measurement: Journal of the International Measurement Confederation
- 1
These authors have contributed equally to this work.