Large-scale, dense city reconstruction from user-contributed photos

https://doi.org/10.1016/j.cviu.2011.07.010Get rights and content

Abstract

The goal of our work is to incrementally reconstruct terrestrial city models from standard digital camera images contributed by multiple users. Hence, the Wiki principle well known from textual knowledge databases is transferred to 3D computer vision. Many state-of-the-art computer vision methods must be applied and modified according to the changing requirements. We describe the utilized 3D vision methods in detail and show results obtained from the current image databases Vienna and Graz acquired by in-house participants. The reconstructions are all maintained in a global database and comprise thousands of photographs.

Highlights

► We present incremental large scale scene reconstruction from unordered images. ► We apply the Wiki principle for detailed 3D reconstruction of cities. ► We present a simple, fast and accurate camera calibration method. ► Our approach is fast and scalable and delivers photo-realistic 3D models of a scene.

Introduction

Recently, 3D vision methods achieved increased robustness and demonstrated photorealistic reconstruction results. Therefore, multi-view modeling is more often targeted at large scale reconstruction of environments. In particular, efforts like Google Earth or Microsoft Virtual Earth aim on the systematic creation of virtual models explicitly from aerial images. In this paper we focus on the uncoordinated generation of digital copies of urban habitats from community supplied terrestrial images. One well known problem in computer vision is the creation of panoramas and fully automated tools like Autostich [3] are already utilized by a broad audience. We intend to bring image based reconstruction to the same level of robustness and applicability for the public. Our proposed approach is designed to work on unorganized but pre-calibrated image datasets. Taking advantage of recent progress in image matching and structure-from-motion (SfM) we present an end-to-end workflow for image based scene reconstruction.

In particular, our idea is to apply the famous and effective Wiki principle, well known from textual knowledge databases (e.g. Wikipedia) to the objective of creating photorealistic 3D city models. As input we rely on images from low cost digital consumer cameras taken by multiple users. Being integrated in most of todays mobile phones, digital cameras are nowadays available at any time and everywhere. Furthermore, photogrammetric evaluations [16] have also shown that mobile phone cameras provide a sufficiently high accuracy for many photogrammetric tasks. We expect that these kind of devices can be used for detailed and accurate city modeling in the future. Recent advances in wide baseline image matching and structure-from-motion made it possible to even reconstruct a scene from diverse and uncontrolled photo collections taken by different people under varying weather and illumination conditions. Such a system was presented by Snavely et al. [41] demonstrating fully automatic 3D reconstruction from community photo collections downloaded from photo-sharing websites (e.g. www.flickr.com). Goesele et al. [14] further demonstrated the applicability of Multi-view stereo (MVS) techniques on such inhomogeneous and diverse datasets.

Community photo collections normally comprise thousands and millions of images of famous and important landmarks. However, it turns out that humans have a tendency towards capturing a landmark from just a few prominent viewpoints. These locations comprise a huge image density, whereas photos from ordinary streets or even whole cities might be entirely missing. In contrast to that, a Wiki-based reconstruction approach implies a more structured image acquisition strategy, since photos are intentionally captured for the purpose of 3D modeling. Therefore, larger and more complete 3D models are eventually obtained. Fig. 1a and b shows reconstructions of Notre Dame Cathedral computed from community photo collections. Even though, there exists thousands of images from Notre Dame on the web, from those images only the front facade of the landmark can be reconstructed (due to the lack of images from other perspectives). In contrast to that, a Wiki-based, structured image acquisition strategy leads to more complete reconstruction results as depicted in Fig. 1c.

We expect that a user contributed system will result in a rapid creation of virtual copies of urban environments. In the first instance, we aim on textured dense models in quality similar to the results presented in [36]. Apparently, these raw models need to be post-processed in subsequent steps to allow an efficient visualization. By adding more and more images, the reconstructed models can be incrementally maintained and refined gradually. The final city models can then be used for many applications ranging from tourism and cultural heritage over city planning to emergency support. An overview of our proposed reconstruction pipeline is depicted in Fig. 2.

In this paper we argue that a Wiki-based approach for city modeling has several advantages over 3D reconstruction from Internet photo collections. First of all, a regular and structured image acquisition strategy in general leads to a broader coverage of a landmark. Secondly, a Wiki-based system provides feedback to the user which can decide what images are needed for better scene coverage. Furthermore, such an approach allows direct access to the camera and a precise calibration of the camera intrinsics can be done before reconstruction. To this end we propose an accurate self-calibration method based on coded markers that is both, fast and simple, and can therefore be directly employed by end users. By using calibrated cameras for 3D reconstruction, accurate and robust reconstruction results are obtained. Furthermore, a regular, Wiki-based image acquisition policy enables direct dense modeling techniques, thus view selection optimization for MVS [14] is not necessarily required. In this paper we further propose an incremental structure-from-motion algorithm that can run online, hence there is no requirement to see all images in advance. Finally, we demonstrate dense 3D reconstruction based on variational methods and present a numerical scheme for GPU-based depth image fusion with guaranteed convergence properties.

Section snippets

Related work

Multi-view reconstruction has matured during the past decades [18] and led to fully automatic reconstruction systems based on video data and still images. Pollefeys et al. presented in [35] a large-scale, 3D reconstruction system from video streams which operates in real time and delivers detailed 3D models in the form of textured polygonal meshes. The fusion of the structure-from-motion output with data from an Internal Navigation System (INS) and Global Positioning System (GPS) allows the

System overview

From the perspective of a user, our proposed reconstruction system consists of two processing steps. First, the user has to place some calibration targets on an approximately planar surface (e.g. the floor) and then take several photographs from different viewpoints at constant focal length (e.g. at minimal zoom). The calibration target consists of simple markers printed on several sheets of paper. Thereafter, the images are submitted to the reconstruction system where the intrinsic calibration

Camera calibration

In our reconstruction system we separate the intrinsic camera calibration from the structure-from-motion computation. With a simple to use and flexible calibration procedure we determine the radial distortion and the camera intrinsics K,K=αxsθu00αyv0001,where (u0, v0) are the coordinates of the principal point, αx and αy the scale factors in the x- and y-coordinate and sθ the skew factor (see [18] for details). We assume a simplified camera model, where the aspect ratio is one (i.e. α = αx = αy)

Structure from motion computation

For the structure-from-motion (SfM) computation we start from an expanding set of calibrated images. Each input image is resampled according to the lens distortion obtained from the calibration procedure. Since the internal camera parameters are known, an Euclidean SfM algorithm is employed to compute the camera orientations and the sparse point reconstruction of individual scenes. Overall, our SfM algorithm consists of three major processing steps. Firstly, salient features are extracted in

Dense depth estimation

Obtaining the initial 3D structure and camera locations is an essential aspect of image-based modeling, but dense geometry is required for a faithful virtual representation of the captured scene. Since we face a large number of images with potentially associated depth maps, we focus on simple but fast dense depth estimation procedures.

Plane-sweep approaches to dense stereo [49], [7] enable an efficient, GPU-accelerated procedure to create the depth maps. In order to obtain reasonable depth

Results

In order to test our reconstruction approach at large scale, we acquired several thousands photographs from urban environments over a period of time. Some scenes are fully connected, others are widely separated and cannot be visually linked. In particular, we have a database Vienna consisting of 2640 images and a larger database of 7181 street-side images from Graz. In total, four different compact digital cameras were used to generate the databases. The images were captured at different days

Future work

Currently, images are contributed by persons associated with this project and with basic knowledge in 3D computer vision. Future work needs to increase the robustness of the structure-from-motion methods to allow the public to participate in the creation of the visual database. In particular, it needs to be assured that low-quality or defective images do not degrade the 3D models in the generated database. Another problem are repetitive structures, which are problematic to handle in our current

Acknowledgments

Much of this work was done while Christopher Zach was at the VRVis research center, Graz and Vienna, Austria (http://www.vrvis.at). This work is partly funded by the Vienna Science and Technology Fund (WWTF), and the EU Integrated Project “IPCity” FP6-2004-IST-4-27571.

References (55)

  • Y.D. Dong

    An LS-free splitting method for composite mappings

    Appl. Math. Lett.

    (2005)
  • H. Bay, T. Tuytelaars, L. Van Gool, SURF: Speeded up robust features, in: European Conference on Computer Vision...
  • C. Beder, R. Steffen, Determining an initial image pair for fixing the scale of a 3d reconstruction from an image...
  • M. Brown et al.

    Automatic panoramic image stitching using invariant features

    Int. J. Comput. Vis.

    (2007)
  • M. Brown et al.

    Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets

  • J.-F. Cai, S. Osher, Z. Shen. Convergence of the Linearized Bregman Iteration for l1-Norm Minimization. Technical...
  • O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman. Total recall: Automatic query expansion with a generative...
  • N. Cornelis, L. Van Gool. Real-time connectivity constrained depth map computation using programmable graphics...
  • B. Curless, M. Levoy, A volumetric method for building complex models from range images, in: Proceedings of SIGGRAPH...
  • E. Esser. Applications of Lagrangian-based Alternating Direction Methods and Connections to Split Bregman, Technical...
  • O.D. Faugeras et al.

    Motion and structure from motion in a piecewise planar environment

    Int. J. Pattern Recogn. Artific. Intell.

    (1988)
  • M. Fischler et al.

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

    Commun. ACM

    (1981)
  • R. Glowinsky et al.

    Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics

    (1989)
  • M. Goesele, N. Snavely, B. Curless, H. Hoppe, S.M. Seitz. Multi-view stereo for community photo collections, in: IEEE...
  • M. Grabner, H. Grabner, H. Bischof, Fast approximated SIFT, in: Asian Conference on Computer Vision or ACCV, 2006, pp....
  • Armin Gruen, Devrim Akca. Mobile photogrammetry, in: DGPF Tagungsband 16/2007 – Dreilaendertagung SGPBF, DGPF und OVG,...
  • R.M. Haralick, C. Lee, K. Ottenberg, M. Nölle, Analysis and solutions of the three point perspective pose estimation...
  • R. Hartley et al.

    Multiple View Geometry in Computer Vision

    (2000)
  • A. Hornung, L. Kobbelt. Robust reconstruction of watertight 3D models from non-uniformly sampled point clouds without...
  • H. Jegou, H. Harzallah, C. Schmid. A contextual dissimilarity measure for accurate and efficient image search, in: IEEE...
  • G. Kamberov, G. Kamberova, O. Chum, S. Obdrzalek, D. Martinec, J. Kostkova, T. Pajdla, J. Matas, R. Sara, 3D geometry...
  • M. Kazhdan, M. Bolitho, H. Hoppe. Symposium on geometry processing, in: Symposium on Geometry Processing, 2006, pp....
  • V. Lempitsky, Y. Boykov, Global optimization for shape fitting, in: IEEE Conference on Computer Vision and Pattern...
  • X. Li, C. Wu, C. Zach, S. Lazebnik, J.-M. Frahm, Modeling and recognition of landmark image collections using iconic...
  • M.I.A. Lourakis, A.A. Argyros, The Design and Implementation of a Generic Sparse Bundle Adjustment Software Package...
  • D. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • E. Malis et al.

    Camera self-calibration from unknown planar structures enforcing the multiview constraints between collineations

    IEEE Trans. Pattern Anal. Mach. Intell. (PAMI)

    (2002)
  • Cited by (19)

    • Extracting inundation patterns from flood watermarks with remote sensing SfM technique to enhance urban flood simulation: The case of Ayutthaya, Thailand

      2017, Computers, Environment and Urban Systems
      Citation Excerpt :

      Although a low-cost digital camera can capture high-resolution photos, a computational efficiency is generally at odds with fine-resolution demands of data processing. Parallel processing can help to minimise computational time by increasing the number of Central Processing Units (CPUs) and/or Graphics Processing Units (GPUs) (Wu, Agarwal, Curless, & Multicore, 2011; Irschara, Zach, Klopschitz, & Bischof, 2012). However, it is not clear whether similar improvement in performance can be significantly achieved for the entire city, and how many parallel computing cores are optimal and cost effective for such application.

    • Large-scale outdoor 3D reconstruction on a mobile device

      2017, Computer Vision and Image Understanding
      Citation Excerpt :

      For example, Häne et al. (2011) assigns a very low weight to the free-space information to reduce the bias coming from it. Many other works with or without regularization in the volume either regularize or filter the depth maps before fusion (Irschara et al., 2012; Pradeep et al., 2013; Wendel et al., 2012; Zach et al., 2007). Fig. 5 illustrates the impact of the filtering steps on both the completeness of individual depth maps and the quality of the resulting 3D model.

    • Automating measurement process to improve quality management for piping fabrication

      2015, Structures
      Citation Excerpt :

      Applying this technique for automating the measurement process of QA provides project management with an outstanding opportunity for visualization of the as-built point cloud data, which needs to be properly registered with the as-planned point cloud [26]. The feasibility of the use of these technologies has been the subject of numerous research studies involving the analysis of construction progress and other computer vision and construction management applications [27–42]. Such tools make it is possible to automate tasks related to quality control and quality assessment, including (1) the automated quality assessment of fabricated assemblies, (2) the remote identification of exceeded tolerances, and (3) the remote and continuous quality control of assemblies as they are being fabricated.

    • Urban flood modelling combining top-view LiDAR data with ground-view SfM observations

      2015, Advances in Water Resources
      Citation Excerpt :

      Although a low-cost digital camera can capture the high-resolution photos – fast and easy, a computational efficiency is generally at odds with the fine-resolution demands of data processing. The parallel processing can help to minimise the computational time by increasing the number of Central Processing Unit (CPU) and/or the Graphics Processing Unit (GPU) cores [24,55,56]. However, it is not clear whether the similar improvement in performance can be significantly achieved for the entire city, and how many parallel computing cores are optimal and cost effective.

    • A videogrammetric as-built data collection method for digital fabrication of sheet metal roof panels

      2013, Advanced Engineering Informatics
      Citation Excerpt :

      Considering all of these issues and aiming to find a roof surveying solution that is not hindered by the limitations stated above, the authors decided to investigate the technical feasibility of using close-range machine vision-based methods. The use of machine vision-based techniques for 3D reconstruction of built environments has been the subject of many research initiatives both in computer vision and civil infrastructure management applications [14–22]. Existing methods typically perceive the 3D shape of a structure by analyzing local motions of a camera.

    View all citing articles on Scopus
    View full text