3D Reconstruction for Multi-view Objects

doi:10.1016/j.compeleceng.2022.108567

Computers and Electrical Engineering

Volume 106, March 2023, 108567

https://doi.org/10.1016/j.compeleceng.2022.108567 Get rights and content

Abstract

Deep learning-based 3D reconstruction neural networks have achieved good performance on generating 3D features from 2D features. However, they often lead to feature loss in reconstruction. In this paper we propose a multi-view object 3D reconstruction neural network, named P2VNet. The depth estimation module of the front and back layers of P2VNet realizes the smooth transformation from 2D features to 3D features, which improves the performance of single view reconstruction. A multi-scale fusion sensing module in multi-view fusion is also proposed, where more receptive fields are added to generate richer context-aware features. We also introduce 3DFocal Loss to replace binary cross-entropy to address the problems of unbalanced space occupation of the voxel grid and complex division of partial grid occupation. Our experimental results have demonstrated that P2VNet has achieved higher accuracy than existing works.

Introduction

The main objective of 3D reconstruction is to accurately restore the three-dimensional features of objects through their two-dimensional features. The results can be used for robots, virtual reality, computer aided design (CAD) [[1], [2]], unmanned aerial vehicle (UAV) [3], smart city construction [[4], [5]], etc. When matching image features across views, traditional methods such as structure from motion (SFM) [6] and simultaneous localization and mapping (SLAM) [7] are difficult to establish accurate feature correspondence [8]. These methods match features from 2D images captured from different views, and then use the triangulation approach to restore 3D coordinates of image pixels. However, the premise is to capture multiple images of an object with a well calibrated camera, which is often impractical. To solve this problem, in recent years, with the advancement of machine learning and artificial intelligence, scholars have proposed a 3D reconstruction network based on deep learning [9], [10], [11], [12]. Wu et al. [9] proposed a model, 3D ShapeNets for representing geometric 3D shape as binary variable probability distribution in 3D voxel grid, and then processed it with convolution depth network. Li, Xiao, et al. proposed a new weak supervision method to learn the 3D shape distribution of a class of objects from the no occluded contour image [10]. The key to this method is a new Generative Adversarial Network (GAN) multi-projection formula, which learns high-dimensional distribution (voxel grid) from multiple easier-to-obtain low dimensional training data. Liu S et al. proposed the Variant Shape Learner (VSL), which can learn the underlying structure of voxelized 3D shapes without supervision [11]. By using jump joins, VSL models can successfully learn and infer the potential hierarchical representation of objects. Choy et al. proposed a model called 3D-R2N2 [12] that inputs the same group of images in different order. However, there is a problem of long-term memory loss. Fan et al. proposed a model, PSGN [13] to reconstruct 3D from single image. However, the approach cannot reliably reconstruct a complete high-quality shape from a single image, which makes the results less accurate. At the same time, because LSTM [14] used in PSGN has many network parameters, this model is quite time-consuming.

To tackle these limitations, Xie et al. [15] proposed a novel approach called Pix2Vox. The approach eliminates the influence of the order of input images through multiple parallel "encoder-decoder" blocks that predict a rough volume grid from the input view. Moreover, the context-aware fusion module selects high-quality features from rough 3D voxels for reconstruction and fuses them to generate a fine 3D voxel. However, in the process of generating 3D voxels from 2D images, this module does not directly represent the end-to-end mapping of features. At the same time, because the context aware fusion module only uses the receptive field with the size of 3 × 3, the edge effect of the target is deteriorated after single view reconstruction, resulting in poor edge effect after multi-view fusion. To solve these problems, in this paper we propose a multi-view object 3D reconstruction network based on deep neural network and we name this network P2VNet. Multiple front and back layer depth estimation modules are added to the encoder-decoder to generate corresponding 3D features from 2D features of multiple sizes. In the multi-view feature fusion network, P2VNet uses multi-scale receptive fields to achieve higher quality 3D feature reconstruction. Using 3D focal loss [16] as the loss function, this paper also addresses the problem that the proportion of the target object in the voxel mesh is uneven and some meshes are difficult to reconstruct. Our main contributions can be summarized as follows:

(1)
We propose a novel multi-view 3D reconstruction network based on a depth neural network. In the encoder-decoder, multiple front and rear layer depth estimation modules are added to realize the 3D reconstruction of a single view from low resolution to high resolution.
(2)
We propose a context multiscale sensing fusion module. When the target edge is difficult to reconstruct, 3D voxels with higher scores are selected using different receptive fields’ fusion features to improve the final fusion effect.
(3)
The experimental results on the ShapeNet dataset show that the average accuracy of P2VNet target reconstruction is 68.2%, which is 9.5% higher than 3D-R2N2 and 1.5% higher than Pix2Vox. At the same time, the algorithm has a strong generalization ability in reconstructing invisible 3D objects.

Section snippets

Related work

For the visualization of 3D model, we can use voxel, depth map, point cloud, polygon mesh and other data types [[17], [18]]. Therefore, 3D reconstruction methods based on deep learning can be divided into the following three categories accordingly.

(1)
Point cloud reconstruction based on deep learning. In 2018, a depth estimation network (MVSNet) which is based on multi-view image, was proposed by Yao et al. [19]. MVSNet designed a variance-based multi-view matching cost calculation criterion, which

P2VNet Network

Our proposed P2VNet realizes the generation of 3D voxels of objects through multiple images with different views, using an encoder-decoding network. The encoding network generates 3D basic features from images, and then the decoding network generates 3D basic features into voxels of objects.

Fig. 1 includes an encoding network, a depth estimation network, a decoding network, a multi-view fusion network, and a verification network. Firstly, the coding network extracts the two-dimensional features

Dataset Preparation

We use a subset of ShapeNet dataset, which includes 13 categories of objects, and a total of 50000 3D models. Fig. 7 shows all view combinations in a model in the watercraft category, Fig. 8(a) shows the front 3D voxel model corresponding to the model, and Fig. 8(b) shows the open side voxel model.

Before the experiment, the experimental dataset is randomly divided into two parts, of which 80% is used for training, and the other 20% is used for testing. In model training, the training set will

Conclusion

In this paper, we propose P2VNet, a deep learning-based network to generate the corresponding three-dimensional voxel from two-dimensional images. P2VNet adds several depth estimation modules in encoding network and decoding network, and proposes a context multi-scale perception fusion network, which can estimate more accurate 3D voxel. Experimental results show that the average accuracy of P2VNet target reconstruction is 68.2%. It was 9.5% more accurate on average than 3D-R2N2 and 1.5% more

Author statement

Jun Yu: Data curation, Programming, Writing - review & editing. Wenbin Yin: Conceptualization, Methodology, Writing. Zhiyi Hu: Supervision, Investigation. Yabin Liu: Validation Editing, Validation. All persons who have made substantial contributions to the work reported in the manuscript have given their written permission to be named.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (29)

Z. Ma et al.
A review of 3D reconstruction techniques in civil engineering and their applications
Advanced Engineering Informatics
(2018)
A. Graves et al.
Framewise phoneme classification with bidirectional LSTM and other neural network architectures
Neural networks
(2005)
H. Lu et al.
Brain intelligence: go beyond artificial intelligence
Mobile Networks and Applications
(2018)
M.M. Nasralla et al.
Computer vision and deep learning-enabled UAVs: proposed use cases for visually impaired people in a smart city
M.A. Khan et al.
Swarm of UAVs for network management in 6G: A technical review
IEEE Transactions on Network and Service Management
(2022)
M.M. Nasralla et al.
MASEMUL: A simulation tool for movement-aware MANET scheduling strategies for multimedia communications
Wireless Communications and Mobile Computing
(2021)
O. Özyeşil et al.
A survey of structure from motion*
Acta Numerica
(2017)
J. Fuentes-Pacheco et al.
Visual simultaneous localization and mapping: a survey
Artificial intelligence review
(2015)
B. Yang et al.
Dense 3D object reconstruction from a single depth view
IEEE transactions on pattern analysis and machine intelligence
(2018)
Z. Wu et al.
3d shapenets: A deep representation for volumetric shapes

X. Li et al.

Synthesizing 3d shapes from silhouette image collections using multi-projection generative adversarial networks

S. Liu et al.

Learning a hierarchical latent-variable model of 3d shapes

C.B. Choy et al.

3D-R2N2: A unified approach for single and multi-view 3d object reconstruction

H. Fan et al.

A point set generation network for 3d object reconstruction from a single image

Cited by (6)

Multi-view 3D reconstruction based on deep learning: A survey and comparison of methods
2024, Neurocomputing
An important objective in computer vision is to analyze multiple images and subsequently reconstruct the shape and structure in 3D. Traditional multi-view 3D reconstruction techniques extract and match key features from images with known camera parameters. However, this approach is inefficient and fails to fully exploit the advantages of multi-view information. Advancements in deep learning have revolutionized multi-view 3D reconstruction by enabling end-to-end 3D shape inferencing without the need for sequential feature matching typically found in conventional algorithms. Recent rapid progress in this field necessitates a thorough review of current algorithms and provide insight into method of improving 3D reconstruction performance. This review classifies reconstruction algorithms according to their resultant model, including depth map, voxel, point cloud, mesh, and implicit surface. Additionally, this review encompasses the inclusion of frequently employed network training loss functions for network training, assessment metrics, and the incorporation of 3D datasets. Experimental results are also presented to assess the performance of different algorithms. Finally, the paper concludes with a summary, discussion of challenges, and potential future directions.
3DCascade-GAN: Shape completion from single-view depth images
2023, Computers and Graphics (Pergamon)
Depth images can be easily acquired using depth cameras. However, these images only contain partial information about the shape due to unavoidable self-occlusion. Thanks to the availability of large datasets of shapes, it is possible to use a learning-based approach to produce complete shapes from single depth images. State-of-the-art generative adversarial network (GAN) architectures can produce reasonable results. However, the use of relatively local convolutions restricts GAN architectures from producing globally plausible shapes. In this study, we develop a novel dynamic latent code selection mechanism in which the model learns to select only important codes from the latent space. Furthermore, a novel 3D self-attention (3DSA) layer is introduced that is able to capture non-local relationships across the 3D space. We further design a GAN architecture that uses a multistage encoder–decoder to recover the shape, where our 3DSA layer is introduced to the discriminator to help attend to global features, which stabilizes the model learning and encourages shape refinement, making our reconstruction more structurally plausible. Through extensive experiments, we demonstrate that our method outperforms other state-of-the-art methods for single depth image 3D reconstruction.
An image fusion-based method for recovering the 3D shape of roll surface defects
2024, Measurement Science and Technology
Self-Supervised Depth Completion Guided by 3D Perception and Geometry Consistency
2023, arXiv
Enhanced Combined Techniques for Interval-Valued Intuitionistic Fuzzy Multiple-Attribute Group Decision-Making and Applications to Quality Evaluation of Large-Scale Multi-View 3D Reconstruction
2023, IEEE Access
Multi-view Stereo 3D Reconstruction Algorithm Based on Improved PatchMatch Algorithm
2023, 2023 3rd International Conference on Neural Networks, Information and Communication Engineering, NNICE 2023

View full text

3D Reconstruction for Multi-view Objects

Abstract

Introduction

Section snippets

Related work

P2VNet Network

Dataset Preparation

Conclusion

Author statement

Declaration of Competing Interest

Advanced Engineering Informatics

Neural networks

Brain intelligence: go beyond artificial intelligence

Mobile Networks and Applications

Computer vision and deep learning-enabled UAVs: proposed use cases for visually impaired people in a smart city

Swarm of UAVs for network management in 6G: A technical review

IEEE Transactions on Network and Service Management

MASEMUL: A simulation tool for movement-aware MANET scheduling strategies for multimedia communications

Wireless Communications and Mobile Computing

A survey of structure from motion*

Acta Numerica

Visual simultaneous localization and mapping: a survey

Artificial intelligence review

Dense 3D object reconstruction from a single depth view

IEEE transactions on pattern analysis and machine intelligence

3d shapenets: A deep representation for volumetric shapes

Synthesizing 3d shapes from silhouette image collections using multi-projection generative adversarial networks

Learning a hierarchical latent-variable model of 3d shapes

3D-R2N2: A unified approach for single and multi-view 3d object reconstruction

A point set generation network for 3d object reconstruction from a single image