3D Reconstruction for Multi-view Objects

https://doi.org/10.1016/j.compeleceng.2022.108567Get rights and content

Abstract

Deep learning-based 3D reconstruction neural networks have achieved good performance on generating 3D features from 2D features. However, they often lead to feature loss in reconstruction. In this paper we propose a multi-view object 3D reconstruction neural network, named P2VNet. The depth estimation module of the front and back layers of P2VNet realizes the smooth transformation from 2D features to 3D features, which improves the performance of single view reconstruction. A multi-scale fusion sensing module in multi-view fusion is also proposed, where more receptive fields are added to generate richer context-aware features. We also introduce 3DFocal Loss to replace binary cross-entropy to address the problems of unbalanced space occupation of the voxel grid and complex division of partial grid occupation. Our experimental results have demonstrated that P2VNet has achieved higher accuracy than existing works.

Introduction

The main objective of 3D reconstruction is to accurately restore the three-dimensional features of objects through their two-dimensional features. The results can be used for robots, virtual reality, computer aided design (CAD) [[1], [2]], unmanned aerial vehicle (UAV) [3], smart city construction [[4], [5]], etc. When matching image features across views, traditional methods such as structure from motion (SFM) [6] and simultaneous localization and mapping (SLAM) [7] are difficult to establish accurate feature correspondence [8]. These methods match features from 2D images captured from different views, and then use the triangulation approach to restore 3D coordinates of image pixels. However, the premise is to capture multiple images of an object with a well calibrated camera, which is often impractical. To solve this problem, in recent years, with the advancement of machine learning and artificial intelligence, scholars have proposed a 3D reconstruction network based on deep learning [9], [10], [11], [12]. Wu et al. [9] proposed a model, 3D ShapeNets for representing geometric 3D shape as binary variable probability distribution in 3D voxel grid, and then processed it with convolution depth network. Li, Xiao, et al. proposed a new weak supervision method to learn the 3D shape distribution of a class of objects from the no occluded contour image [10]. The key to this method is a new Generative Adversarial Network (GAN) multi-projection formula, which learns high-dimensional distribution (voxel grid) from multiple easier-to-obtain low dimensional training data. Liu S et al. proposed the Variant Shape Learner (VSL), which can learn the underlying structure of voxelized 3D shapes without supervision [11]. By using jump joins, VSL models can successfully learn and infer the potential hierarchical representation of objects. Choy et al. proposed a model called 3D-R2N2 [12] that inputs the same group of images in different order. However, there is a problem of long-term memory loss. Fan et al. proposed a model, PSGN [13] to reconstruct 3D from single image. However, the approach cannot reliably reconstruct a complete high-quality shape from a single image, which makes the results less accurate. At the same time, because LSTM [14] used in PSGN has many network parameters, this model is quite time-consuming.

To tackle these limitations, Xie et al. [15] proposed a novel approach called Pix2Vox. The approach eliminates the influence of the order of input images through multiple parallel "encoder-decoder" blocks that predict a rough volume grid from the input view. Moreover, the context-aware fusion module selects high-quality features from rough 3D voxels for reconstruction and fuses them to generate a fine 3D voxel. However, in the process of generating 3D voxels from 2D images, this module does not directly represent the end-to-end mapping of features. At the same time, because the context aware fusion module only uses the receptive field with the size of 3 × 3, the edge effect of the target is deteriorated after single view reconstruction, resulting in poor edge effect after multi-view fusion. To solve these problems, in this paper we propose a multi-view object 3D reconstruction network based on deep neural network and we name this network P2VNet. Multiple front and back layer depth estimation modules are added to the encoder-decoder to generate corresponding 3D features from 2D features of multiple sizes. In the multi-view feature fusion network, P2VNet uses multi-scale receptive fields to achieve higher quality 3D feature reconstruction. Using 3D focal loss [16] as the loss function, this paper also addresses the problem that the proportion of the target object in the voxel mesh is uneven and some meshes are difficult to reconstruct. Our main contributions can be summarized as follows:

  • (1)

    We propose a novel multi-view 3D reconstruction network based on a depth neural network. In the encoder-decoder, multiple front and rear layer depth estimation modules are added to realize the 3D reconstruction of a single view from low resolution to high resolution.

  • (2)

    We propose a context multiscale sensing fusion module. When the target edge is difficult to reconstruct, 3D voxels with higher scores are selected using different receptive fields’ fusion features to improve the final fusion effect.

  • (3)

    The experimental results on the ShapeNet dataset show that the average accuracy of P2VNet target reconstruction is 68.2%, which is 9.5% higher than 3D-R2N2 and 1.5% higher than Pix2Vox. At the same time, the algorithm has a strong generalization ability in reconstructing invisible 3D objects.

Section snippets

Related work

For the visualization of 3D model, we can use voxel, depth map, point cloud, polygon mesh and other data types [[17], [18]]. Therefore, 3D reconstruction methods based on deep learning can be divided into the following three categories accordingly.

  • (1)

    Point cloud reconstruction based on deep learning. In 2018, a depth estimation network (MVSNet) which is based on multi-view image, was proposed by Yao et al. [19]. MVSNet designed a variance-based multi-view matching cost calculation criterion, which

P2VNet Network

Our proposed P2VNet realizes the generation of 3D voxels of objects through multiple images with different views, using an encoder-decoding network. The encoding network generates 3D basic features from images, and then the decoding network generates 3D basic features into voxels of objects.

Fig. 1 includes an encoding network, a depth estimation network, a decoding network, a multi-view fusion network, and a verification network. Firstly, the coding network extracts the two-dimensional features

Dataset Preparation

We use a subset of ShapeNet dataset, which includes 13 categories of objects, and a total of 50000 3D models. Fig. 7 shows all view combinations in a model in the watercraft category, Fig. 8(a) shows the front 3D voxel model corresponding to the model, and Fig. 8(b) shows the open side voxel model.

Before the experiment, the experimental dataset is randomly divided into two parts, of which 80% is used for training, and the other 20% is used for testing. In model training, the training set will

Conclusion

In this paper, we propose P2VNet, a deep learning-based network to generate the corresponding three-dimensional voxel from two-dimensional images. P2VNet adds several depth estimation modules in encoding network and decoding network, and proposes a context multi-scale perception fusion network, which can estimate more accurate 3D voxel. Experimental results show that the average accuracy of P2VNet target reconstruction is 68.2%. It was 9.5% more accurate on average than 3D-R2N2 and 1.5% more

Author statement

Jun Yu: Data curation, Programming, Writing - review & editing. Wenbin Yin: Conceptualization, Methodology, Writing. Zhiyi Hu: Supervision, Investigation. Yabin Liu: Validation Editing, Validation. All persons who have made substantial contributions to the work reported in the manuscript have given their written permission to be named.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (29)

  • Z. Ma et al.

    A review of 3D reconstruction techniques in civil engineering and their applications

    Advanced Engineering Informatics

    (2018)
  • A. Graves et al.

    Framewise phoneme classification with bidirectional LSTM and other neural network architectures

    Neural networks

    (2005)
  • H. Lu et al.

    Brain intelligence: go beyond artificial intelligence

    Mobile Networks and Applications

    (2018)
  • M.M. Nasralla et al.

    Computer vision and deep learning-enabled UAVs: proposed use cases for visually impaired people in a smart city

  • M.A. Khan et al.

    Swarm of UAVs for network management in 6G: A technical review

    IEEE Transactions on Network and Service Management

    (2022)
  • M.M. Nasralla et al.

    MASEMUL: A simulation tool for movement-aware MANET scheduling strategies for multimedia communications

    Wireless Communications and Mobile Computing

    (2021)
  • O. Özyeşil et al.

    A survey of structure from motion*

    Acta Numerica

    (2017)
  • J. Fuentes-Pacheco et al.

    Visual simultaneous localization and mapping: a survey

    Artificial intelligence review

    (2015)
  • B. Yang et al.

    Dense 3D object reconstruction from a single depth view

    IEEE transactions on pattern analysis and machine intelligence

    (2018)
  • Z. Wu et al.

    3d shapenets: A deep representation for volumetric shapes

  • X. Li et al.

    Synthesizing 3d shapes from silhouette image collections using multi-projection generative adversarial networks

  • S. Liu et al.

    Learning a hierarchical latent-variable model of 3d shapes

  • C.B. Choy et al.

    3D-R2N2: A unified approach for single and multi-view 3d object reconstruction

  • H. Fan et al.

    A point set generation network for 3d object reconstruction from a single image

  • Cited by (6)

    • Multi-view Stereo 3D Reconstruction Algorithm Based on Improved PatchMatch Algorithm

      2023, 2023 3rd International Conference on Neural Networks, Information and Communication Engineering, NNICE 2023
    View full text