Modeling large-scale indoor scenes with rigid fragments using RGB-D cameras

https://doi.org/10.1016/j.cviu.2016.11.008Get rights and content

Highlights

  • A two-stage strategy for real-time dense 3D reconstruction of large-scale scenes.

  • The first stage generates local 3D models from short subsequences of RGB-D images.

  • The second stage builds a global model from local ones while minimizing deformations.

  • Modifications of the global model reduce to re-positioning the planar patches.

  • Our structured 3D scene representation allows efficient, yet accurate computations.

Abstract

Hand-held consumer depth cameras have become a commodity tool for constructing 3D models of indoor environments in real time. Recently, many methods to fuse low quality depth images into a single dense and high fidelity 3D model have been proposed. Nonetheless, dealing with large-scale scenes remains a challenging problem. In particular, the accumulation of small errors due to imperfect camera localization becomes crucial (at large scale) and results in dramatic deformations of the built 3D model. These deformations have to be corrected whenever it is possible (when a loop exists for example). To facilitate such correction, we use a structured 3D representation where points are clustered into several planar patches that compose the scene. We then propose a two-stage framework to build in details and in real-time a large-scale 3D model. The first stage (the local mapping) generates local structured 3D models with rigidity constraints from short subsequences of RGB-D images. The second stage (the global mapping) aggregates all local 3D models into a single global model in a geometrically consistent manner. Minimizing deformations of the global model reduces to re-positioning the planar patches of the local models thanks to our structured 3D representation. This allows efficient, yet accurate computations. Our experiments using real data confirm the effectiveness of our proposed method.

Introduction

3D reconstruction of indoor scenes using consumer depth cameras has attracted an ever-growing interest in the last decade. Many applications such as robot navigation, virtual and augmented reality or digitization of cultural heritage can benefit from highly detailed large-scale 3D models of indoor scenes built using cheap RGB-D sensors. In general, the 3D reconstruction process consists of (1) generating multiple 2.5D views of the target scene, (2) registering (i.e., aligning) all different views into a common coordinate system, (3) fusing all measurements into a single mathematical 3D representation and (4) correcting possible deformations.

To obtain depth measurements of a target scene (i.e., 2.5D views), many strategies exist, which can be classified into either passive sensing or active sensing. A popular example for passive sensing is stereo vision (Lazaros et al., 2008). On the other hand, structured light (Rusinkiewicz et al., 2002) and time-of-flight (Hansard et al., 2012) are the most popular techniques for active sensing. Consumer depth cameras such as the Microsoft Kinect camera or the Asus Xtion pro camera use such techniques and are recent active sensors which produce low quality depth images at video rate and at low cost. These sensors have raised much interest in the computer vision community, and in particular for the task of automatic 3D modeling.

The video frame rate provided by consumer depth cameras brings several advantages for 3D modeling. One distinguished advantage is that it simplifies the registration problem. This is because the transformation between two successive frames can be assumed to be sufficiently small. As a consequence, well-known standard registration algorithms such as variants of the Iterative Closest Point (ICP) (Besl and McKay, 1992) work efficiently. Moreover, by accumulating many (noisy) depth measurements available for each point in the scene, it is possible to compensate for the low quality of a single depth image and construct detailed 3D models. A well known successful work for 3D modeling using RGB-D cameras is KinectFusion (Newcombe et al., 2011). In this system, a linearized version of Generalized ICP (GICP) (Segal et al., 2009) is used in the frame-to-global-model registration framework to align successive depth images, which are accumulated into a volumetric Truncated Signed Distance Function (TSDF) (Curless and Levoy, 1996) using a running average. With using rather simple, well-established tools, impressive 3D models at video frame rate can be obtained, which demonstrates the potential of RGB-D cameras for fine 3D modeling.

Another new interesting property of consumer depth cameras is that they can be held by hands and thus they allow reconstructing large-scale scenes rather easily. With this new possibility, new challenges also arise: how to deal with error propagation and a large amount of data? In other words, how to minimize deformations of the produced 3D model while keeping fine details, even at large-scale?

In the last several years, a lot of research have been reported to allow large-scale 3D reconstruction (Chen, Bautembach, Izadi, 2013, Henry, Fox, Bhowmik, Mongia, 2013, Meilland, Comport, 2013, Neibner, Zollhofer, Izadi, Stamminger, 2013, Roth, Vona, 2012, Thomas, Sugimoto, 2014; Whelan et al., 2012; Zeng, Zhao, Zheng, Liu, 2013, Zhou, Koltun, 2013, Zhou, Miller, Koltun, 2013). Noticeable works employ hash tables for efficient storage of volumetric data (Neibner et al., 2013), patch volumes to close loops on the fly (Henry et al., 2013) and non-rigid registration of sub-models to reduce deformations (Zhou et al., 2013). Though recent works considerably improved the scale and the quality of 3D reconstruction using consumer depth cameras, existing methods still suffer from little flexibility for manipulating the constructed 3D models. This is because most of existing methods work at the pixel level with unstructured 3D representations (voxels with volumetric TSDF or 3D vertices with meshes). Modifying the whole 3D model then becomes difficult and computationally expensive, which precludes from closing loops or correcting deformations on-the-fly. Instead of unstructured representations, introducing the structured 3D representation overcomes this problem. In the work by Henry et al. (2013), indeed, patch volumes are used to manipulate structured 3D models, which allows simpler modifications of the 3D model. In this work, we push forward in this direction by taking advantage of a parametric 3D surface representation with bump image (Thomas and Sugimoto, 2013) that allows easy manipulations of the 3D model while maintaining fine details, real-time performance and efficient storage. Our experimental evaluation demonstrates that by using the structured 3D representation, loops can be efficiently closed on-the-fly and overall deformations are kept to a minimum, even for large-scale scenes.

The main contributions of this paper are: (1) the creation of a semi-global model to locally fuse depth measurements and (2) the introduction to identity constraints between multiple instances of a same part of the scene viewed at different time in the RGB-D image sequence, which enables us to run a fragment registration algorithm to efficiently close loops. Overall, we propose a method that segments the input sequence of RGB-D images both in time and space and that is able to build in real-time high fidelity 3D models of indoor scenes. After briefly recalling the parametric 3D surface representation with bump image in Section 3 we introduce our two-stage strategy in Section 4. We demonstrate the effectiveness of our proposed method through comparative evaluation in Section 5 before concluding in Section 6. We note that a part of this work appeared in Thomas and Sugimoto (2014).

Section snippets

Related work

In the last few years, much work has been proposed to fuse input RGB-D data into a common single global 3D model.

Weise et al. (2009) proposed to build a global 3D model using surface elements called Surfels (Pfister et al., 2000). They also proposed to handle loop closure by enforcing the global model as-rigidly-as-possible using a topology graph where each vertex is a Surfel and edges connect neighboring Surfels. However, maintaining such a topology graph for large-scale scenes is not

Parametric 3D scene representation

We construct 3D models of indoor scenes using a structured 3D scene representation based on planar patches augmented by Bump images (Thomas and Sugimoto, 2013). At run-time, the scene is segmented into multiple planar patches and the detailed geometry of the scene around each planar patch is recorded into the Bump images. Each Bump image (i.e., the local geometry) is assumed to be accurate because standard RGB-D SLAM algorithms work well locally (deformation problems arise at large scale).

A two-stage strategy for 3D modeling

We employ the above mentioned 3D scene representation for constructing large-scale 3D models. With this representation, the target scene is structured into multiple planar patches with attributes. To achieve successful 3D reconstruction using such a representation one key assumption is that each planar patch is correctly reconstructed (i.e., no deformations). If this assumption is satisfied then drift errors in the camera pose estimation can be corrected by just repositioning different planar

Experiments

We evaluated our proposed method in several situations using real data. All scenes were captured at 30 fps and their dimensions are presented in Table 1. Note that datasets Office, Library, Library-2 and Library-3 were captured by ourselves using a Microsoft Kinect for Windows V1 sensor. We used a resolution of 0.4 cm for attribute images in all cases. The CPU we used was an Intel Xeon processor with 3.47  GHz and the GPU was a NVIDIA GeForce GTX 580. Our method runs at about 28 fps with a live

Conclusion

We proposed a two-stage strategy, local mapping and global mapping, to build in details large-scale 3D models with minimal deformations in real time from RGB-D image sequences. The local mapping creates accurate structured local 3D models from short subsequences while global mapping organizes all the local 3D models into a global model in an undeformed way using fragment registration in the graph optimization framework. Introducing rigidity and identity constraints facilitates repositioning

References (26)

  • ZhouQ.-Y. et al.

    Dense scene reconstruction with points of interest

    ACM Trans. Graph.

    (2013)
  • 3D Scene Dataset:,....
  • P.J. Besl et al.

    A method for registration of 3-d shapes

    IEEE Trans. PAMI

    (1992)
  • G. Blais et al.

    Registering multi vie range data to create 3d computer objects

    IEEE Trans. PAMI

    (1995)
  • R. Cameral et al.

    G2O: a general framework for graph optimisation

    Proceedings of ICRA

    (2011)
  • ChenJ. et al.

    Scalable real-time volumetric surface reconstruction

    ACM Trans. Graph

    (2013)
  • B. Curless et al.

    A volumetric method for building complex models from range images

    Proceedings of SIGGRAPH

    (1996)
  • M. Hansard et al.

    Time-of-flight cameras: principles, methods and applications

    Springer Brief in Computer Science

    (2012)
  • P. Henry et al.

    Patch volumes: segmentation-based consistent mapping with RGB-D cameras

    Proceedings of 3DV’13

    (2013)
  • P. Henry et al.

    RGB-D mapping: using kinect-style depth cameras for dense 3d modelling of indoor environments

    Int. J. Rob. Res.

    (2012)
  • M. Keller et al.

    Real-time 3D reconstruction in dynamic scenes using point-based fusion

    Proceedings of International Conference on 3D Vision (3DV)

    (2013)
  • N. Lazaros et al.

    Review of stereo vision algorithms: from software to hardware

    Int. J. Optomech.

    (2008)
  • D.G. Lowe

    Object recognition from local scale-invariant features

    Proceedings of ICCV

    (1999)
  • Cited by (0)

    View full text