RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

doi:10.1016/j.displa.2021.102100

Displays

Volume 70, December 2021, 102100

https://doi.org/10.1016/j.displa.2021.102100 Get rights and content

Highlights

•
Proposed a complete monocular 3D reconstruction pipeline with depth prediction.
•
Integrated the PnP algorithm and ICP algorithm into the pose estimation module.
•
Proposed a depth map refinement strategy based on uncertainty.
•
Achieved a good reconstruction effect.

Abstract

Generating large-scale and high-quality 3D scene reconstruction from monocular images is an essential technical foundation in augmented reality and robotics. However, the apparent shortcomings (e.g., scale ambiguity, dense depth estimation in texture-less areas) make applying monocular 3D reconstruction to real-world practice challenging. In this work, we combine the advantage of deep learning and multi-view geometry to propose RGB-Fusion, which effectively solves the inherent limitations of traditional monocular reconstruction. To eliminate the confinements of tracking accuracy imposed by the prediction deficiency of neural networks, we propose integrating the PnP (Perspective-n-Point) algorithm into the tracking module. We employ 3D ICP (Iterative Closest Point) matching and 2D feature matching to construct separate error terms and jointly optimize them, reducing the dependence on the accuracy of depth prediction and improving pose estimation accuracy. The approximate pose predicted by the neural network is employed as the initial optimization value to avoid the trapping of local minimums. We formulate a depth map refinement strategy based on the uncertainty of the depth value, which can naturally lead to a refined depth map. Through our method, low-uncertainty elements can significantly update the current depth value while avoiding high-uncertainty elements from adversely affecting depth estimation accuracy. Numerical qualitative and quantitative evaluation results of tracking, depth prediction, and 3D reconstruction show that RGB-Fusion exceeds most monocular 3D reconstruction systems.

Introduction

Many researchers have paid great attention to the detailed 3D dense reconstruction of indoor scenes in recent years. Simultaneous Localization and Mapping (SLAM) aims to solve the navigation and map construction problems of unknown locations. It has been proven to be a practical dense 3D reconstruction approach and is widely employed in augmented reality [1]. Since the high popularity of RGB cameras on consumer devices, some researchers have focused on the monocular SLAM methods [2], [3]. These approaches performed feature matching on consecutive adjacent frames, employed stereo matching [4] to recover image depth information, and finally reconstructed the scene. However, the uncertainty of the absolute scale limits the application prospects of these methods. Even if all steps are carried out successfully, the ambiguous absolute scale still results in an unacceptable reconstruction. A solution [5] to scale vagueness was to integrate a monocular camera with an inertial measurement unit (IMU) to estimate the absolute scale of the scene through the measurement data of the IMU. However, when the acceleration is negligible, the low-performance IMU configured for consumer devices cannot provide data support for calculating the scene scale. Some scholars have proposed employ depth cameras [6], [7] for 3D reconstruction. However, most depth cameras have an unsatisfactory detection range and are sensitive to lighting conditions, making reconstruction less precise in uneven lighting environments.

Meanwhile, deep learning caused quite a stir in the area of 3D reconstruction. After training, the neural network can realize 3D object reconstruction from a single [8], [9], stereo [10], [11], or collection of images [12], [13]. Furthermore, deep learning obtains the absolute scale of the scene from images without relying on other auxiliary information. Due to the relatively complex spatial relationship, the accuracy of the end-to-end network is difficult to surpass the geometry-based 3D reconstruction methods.

After analyzing the characteristics of deep learning and geometry-based approaches, some scholars have suggested integrating them to yield satisfactory results. As shown in [14], the end-to-end 3D reconstruction system based on deep learning can accurately reconstruct the internal points of the scene, but it is easy to face the situation of blurred edges. In contrast, pipelines based on multi-view geometry have high accuracy at the edges of scenes with rich corners but fail to rebuild inside scenes with insufficient texture. Integrating deep learning with multi-view geometry is an exciting idea, but how to combine it efficiently is still worth exploring. Some methods [12], [15] have suggested inputting the structural model obtained from the geometry-based 3D reconstruction pipeline into the neural network, and then deep learning was utilized to perform model optimization. One limitation of these methods is that as long as the system processes new geometric information, the geometric model must be re-input to the network to perform complex calculations, which consumes a lot of calculation time.

This paper proposes RGB-Fusion, a new monocular surface reconstruction system that can support large-scale, high-quality reconstruction. Fig. 1 shows an example of our reconstruction results in the fr3/long_office_household sequence of the TUM RGB-D dataset [16]. RGB-Fusion leveraged the state-of-the-art algorithm DeepV2D [17] to predict the depth and completed the 3D reconstruction through the dense SLAM methods. We validated our approach by comparing three public datasets against many excellent monocular 3D reconstruction studies, focusing on tracking, depth prediction, and reconstruction accuracy. The results demonstrate that RGB-Fusion could achieve an average of 0.105 m surface reconstruction accuracy, an average of 31.255% depth prediction accuracy, and a 0.199 m absolute trajectory (ATE) root-mean-square error (RMSE) [16]. In general, our proposed pipeline outperforms most existing monocular 3D reconstruction systems by more than 10%.

The key contributions are summarized as follows:

(1)
We integrate dense SLAM and deep learning into the new monocular 3D reconstruction pipeline, which can perform high-quality 3D reconstruction of large-scale scenes using only RGB images.
(2)
We jointly optimize the ICP algorithm [18] and the PnP algorithm [19] to calculate the camera pose estimation and leverage the preliminary pose predicted by the depth module to initialize the least-square method to avoid trapping local minimums.
(3)
We propose a depth refinement strategy that used the uncertainty of the depth map to improve the accuracy of depth prediction.

This paper is organized as follows: We introduce the development of related works in Section 2. Section 3 elaborates the techniques and critical strategies of our proposed monocular 3D reconstruction framework. The experiments and results on public datasets are presented in Section 4. We finally introduce our conclusions and the focus of future research in Section 5.

Section snippets

Related works

This section reviews related work in the two fields of single view depth estimation and monocular 3D reconstruction system with deep learning.

Methodology

In this section, we illustrate the principle of the monocular 3D reconstruction system RGB-Fusion. As is shown in Fig. 2, our architecture contains three indispensable modules: depth prediction, pose estimation, and surface fusion.

•
Depth prediction: We adopt the network architecture of DeepV2D to obtain the depth map of RGB images and the initial value of pose estimation.
•
Pose estimation: We combine the ICP algorithm and the PnP algorithm to calculate the camera pose estimation and then employ

Evaluation

We comprehensively evaluated the tracking accuracy, depth prediction accuracy, and reconstruction accuracy of RGB-Fusion from qualitative and quantitative perspectives. The evaluation was carried out on a desktop PC with an Intel Core i9-10920X CPU at 3.5 GHz with 32 GB of RAM and an Nvidia TITAN RTX GPU with 24 GB of VRAM. The deep prediction model was trained on the Scannet dataset [44] and tested on unlearned datasets, fully demonstrating the generalization ability of our system. The

Conclusion

This paper has proposed a monocular 3D reconstruction system with learned depth prediction, reconstructing the indoor scene with high quality, and proves that combining deep learning and SLAM is an effective method to solve the limitations of traditional 3D reconstruction. The proposed joint optimization method combines ICP and PnP algorithms to solve the problem of pose estimation when the accuracy of depth estimation is limited and effectively improves the accuracy and robustness of the

CRediT authorship contribution statement

ZhiMin Duan: Methodology, Software, Investigation, Formal analysis, Writing – original draft. YingWen Chen: Conceptualization, Resources, Funding acquisition, Supervision, Writing – review & editing, Software. HuJie Yu: Writing – review & editing, Visualization. BoWen Hu: Data curation,Visualization. Chen Chen: Validation, Software.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by the National Key Research and Development Program of China (No. 2018YFB0204301).

References (61)

WangR. et al.
Autostereoscopic augmented reality visualization for depth perception in endoscopic surgery
Displays
(2017)
HongG.-S. et al.
A local stereo matching algorithm based on weighted guided image filtering for improving the generation of depth range images
Displays
(2017)
QiS. et al.
Review of multi-view 3D object recognition methods based on deep learning
Displays
(2021)
NewcombeR.A. et al.
Dtam: Dense tracking and mapping in real-time
ZhangP. et al.
Using collaborative sharing on cloud for fast relocalization in keyframe-based SLAM
KählerO. et al.
Real-time large-scale dense 3D reconstruction with loop closure
S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et...
WhelanT. et al.
Elasticfusion: Real-time dense SLAM and light source estimation
Int. J. Robot. Res.
(2016)
H. Fan, H. Su, L.J. Guibas, A point set generation network for 3d object reconstruction from a single image, in:...
G. Gkioxari, J. Malik, J. Johnson, Mesh r-cnn, in: Proceedings of the IEEE/CVF International Conference on Computer...

R. Chen, S. Han, J. Xu, H. Su, Point-based multi-view stereo network, in: Proceedings of the IEEE/CVF International...

A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, A. Bry, End-to-end learning of geometry and...

H. Zhou, B. Ummenhofer, T. Brox, Deeptam: Deep tracking and mapping, in: Proceedings of the European Conference on...

J.J. Park, P. Florence, J. Straub, R. Newcombe, S. Lovegrove, Deepsdf: Learning continuous signed distance functions...

FacilJ.M. et al.

Single-view and multi-view depth fusion

IEEE Robot. Autom. Lett.

(2017)

N. Yang, R. Wang, J. Stuckler, D. Cremers, Deep virtual stereo odometry: Leveraging deep depth prediction for monocular...

SturmJ. et al.

A benchmark for the evaluation of RGB-d SLAM systems

TeedZ. et al.

DeepV2D: Video to depth with differentiable structure from motion

(2020)

BeslP.J. et al.

Method for registration of 3-D shapes

LepetitV. et al.

Epnp: An accurate o (n) solution to the pnp problem

Int. J. Comput. Vis.

(2009)

EigenD. et al.

Depth map prediction from a single image using a multi-scale deep network

Adv. Neural Inf. Process. Syst.

(2014)

LiuF. et al.

Learning depth from single monocular images using deep convolutional neural fields

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

LainaI. et al.

Deeper depth prediction with fully convolutional residual networks

FuH. et al.

Deep ordinal regression network for monocular depth estimation

C. Godard, O. Mac Aodha, G.J. Brostow, Unsupervised monocular depth estimation with left-right consistency, in:...

ChenW. et al.

Single-image depth perception in the wild

Adv. Neural Inf. Process. Syst.

(2016)

ShangS. et al.

Unsupervised learning of depth and pose estimation based on continuous frame window

X. Chen, Y. Wang, X. Chen, W. Zeng, S2R-DepthNet: Learning a generalizable depth-specific structural representation,...

WeerasekeraC.S. et al.

Dense monocular reconstruction using surface normals

M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, A.J. Davison, CodeSLAM—learning a compact, optimisable...

Cited by (14)

A contrastive learning based unsupervised multi-view stereo with multi-stage self-training strategy
2024, Displays
Recent years, unsupervised multi-view stereo (MVS) methods have achieved excellent success that can produce comparable results to earlier supervised work. However, as unsupervised MVS uses image reconstruction as pretext task, it faces two vital drawbacks: RGB value, which is the measurement of image, is not robust enough across views due to complicated environment like lighting conditions and reconstruction itself cannot reflect quality of depth estimation linearly. These problems cause the actual optimization goal to diverge from the expected optimization goal, thus could impair the training process. To enhance robustness of pretext task, we propose a contrastive learning based constraint. The constraint adds featuremetric consistency across views by forcing the features between matching points similar and the features between unmatched points opposite. To add linear properties to overall training procedure, we propose a multi-stage training strategy that uses pseudo label as supervision after unsupervised training at the beginning. On the other hand, we adopt an iterative optimizer that proven to be quite effective in supervised MVS to accelerate training. Finally, we conduct a series of experiments on the DTU dataset and Tanks and Temples dataset that demonstrate the efficiency and robustness of our method compared with the state-of-the-art methods in terms of accuracy, completeness and speed.
A dynamic detection and data association method based on probabilistic models for visual SLAM
2024, Displays
Visual Simultaneous Localization and Mapping (VSLAM) is a critical foundation in mobile robotics and augmented reality (AR). However, VSLAM faces challenges in dynamic environments since both the camera and the object are in motion, which contradicts the classical static scene assumption. Generally, multi-view geometry is employed for static features to estimate camera pose and reconstruct environment maps. Hence, dynamic feature detection and data association become key issues in dynamic VSLAM. To solve these problems, we propose an innovative probability-based approach that combines instance segmentation and nonparametric Kolmogorov–Smirnov test approaches to detect the distribution of features on an object. Furthermore, we propose a data association algorithm based on the Bayesian model that comprehensively utilizes the descriptors of feature points and their spatial information. Experiments on the KITTI public dataset and the Oxford Multi-motion dataset validate the effectiveness of our method.
Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
2023, Displays
Audio-visual cross-modality generation refers to the generation of audio or visual content based on input from another modality. One of the key tasks in this field is the generation of realistic talking facial videos from audio and head pose information, which has significant applications in human–computer interaction, virtual reality, and video production. However, previous work has limitations such as the inability to generate natural head poses or interact with audio, which compromises the realism and expressive power of the generated videos. This paper aims to address these issues and improve the state-of-the-art in this field. To this end, we propose an autoregressive generation method called Flow2Flow and collect a large-scale in-the-wild solo-singing-themed audio-visual dataset called AVVS to investigate the rhythmic head movement patterns. The Flow2Flow model involves a multimodal transformer block with cross-attention, which can encode audio features and historical head poses to establish potential audio-visual motion entanglement and uses normalizing flows to generate future facial motion representation sequences. The generated motion representations are identity-independent, allowing the method to be transferred to any face identity. We model the motion of image content using warping flows generated from 3D keypoints based on the facial motion representation sequences, carefully manipulate animation generation, and estimate dense motion fields based on deformation flows using a neural rendering model to present photo-realistic talking facial videos. Experimental results show that our proposed method generates photo-realistic videos with natural head poses and lip-syncing, and we validate the effectiveness of our method compared to state-of-the-art methods on two public datasets.
3D-CDRNet: Retrieval-based dense point cloud reconstruction from a single image under complex background
2023, Displays
3D reconstruction technique based on deep learning is gaining increasing attention from researchers. The majority of current 3D reconstruction techniques require a simple background, which limit their applications on complex background image. Extracting point cloud features comprehensively is also extremely difficult. This paper design a novel 3D reconstruction network to overcome these limitations. Firstly, we get the image and the retrieved point cloud that is the most similar to the input image. Secondly, to learn the features of the retrieved point cloud, the network encodes and decodes the single image and the retrieved point cloud to generate sparse point cloud. Finally, the proposed dense module densifies the sparse point cloud into the dense point cloud. We use single image of complex background and public dataset to evaluate our network. The reconstruction results indicate that the network surpasses previous reconstruction networks.
Multimodal fusion for autonomous navigation via deep reinforcement learning with sparse rewards and hindsight experience replay
2023, Displays
The multimodal perception of intelligent robots is essential for achieving collision-free and efficient navigation. Autonomous navigation is enormously challenging when perception is acquired using only vision or LiDAR sensor data due to the lack of complementary information from different sensors. This paper proposes a simple yet efficient deep reinforcement learning (DRL) with sparse rewards and hindsight experience replay (HER) to achieve multimodal navigation. By adopting the depth images and pseudo-LiDAR data generated by an RGB-D camera as input, a multimodal fusion scheme is used to enhance the perception of the surrounding environment compared to using a single sensor. To alleviate the misleading way for the agent to navigate with dense rewards, the sparse rewards are intended to identify its tasks. Additionally, the HER technique is introduced to address the sparse reward navigation issue for accelerating optimal policy learning. The results show that the proposed model achieves state-of-the-art performance in terms of success, crash, and timeout rates, as well as generalization capability.
3D spatial measurement for model reconstruction: A review
2023, Measurement: Journal of the International Measurement Confederation
The measurement of 3D spatial coordinates for model reconstruction through artificial machine vision systems based on optical sensors and the corresponding signal processing associated with algorithms is a powerful module for cyber systems. It provides an efficient, functional, and intelligent vision and data information of the objects and scenes under observation for decisions, as well as for remote environment interactivity and autonomous robot systems actuation. Over the past 20 years, the artificial machine vision has benefited from emerging technology and a promising huge potential is peeking out, but also technical difficulties achieving customized and true commercial applications. This paper reviews the research progress, trends, and future research directions; the state-of-the-art of topics related to the 3D spatial measurement for model reconstruction. It classifies the technology by its fundamental principles and applications, to construct an outlook about its advantages, disadvantages, and challenges.

View all citing articles on Scopus

¹: These authors have contributed equally to this work.

View full text

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

Highlights

Abstract

Introduction

Section snippets

Related works

Methodology

Evaluation

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Displays

Displays

Displays

Dtam: Dense tracking and mapping in real-time

Using collaborative sharing on cloud for fast relocalization in keyframe-based SLAM

Real-time large-scale dense 3D reconstruction with loop closure

Elasticfusion: Real-time dense SLAM and light source estimation

Int. J. Robot. Res.

Single-view and multi-view depth fusion

IEEE Robot. Autom. Lett.

A benchmark for the evaluation of RGB-d SLAM systems

DeepV2D: Video to depth with differentiable structure from motion

Method for registration of 3-D shapes

Epnp: An accurate o (n) solution to the pnp problem

Int. J. Comput. Vis.

Depth map prediction from a single image using a multi-scale deep network

Adv. Neural Inf. Process. Syst.

Learning depth from single monocular images using deep convolutional neural fields

IEEE Trans. Pattern Anal. Mach. Intell.

Deeper depth prediction with fully convolutional residual networks

Deep ordinal regression network for monocular depth estimation

Single-image depth perception in the wild

Adv. Neural Inf. Process. Syst.

Unsupervised learning of depth and pose estimation based on continuous frame window

Dense monocular reconstruction using surface normals