Divide and conquer: A hierarchical approach to large-scale structure-from-motion

https://doi.org/10.1016/j.cviu.2017.02.006Get rights and content

Highlights

  • A divide and conquer based large-scale 3D reconstruction pipeline is proposed.

  • The method partitions a large dataset into smaller well constrained clusters.

  • Each cluster is independently reconstructed using global SfM techniques.

  • Registration method using epipolar geometry is proposed to merge reconstructions.

Abstract

In this paper we present a novel pipeline for large-scale SfM. We first organise the images into a hierarchical tree built using agglomerative clustering. The SfM problem is then solved by reconstructing smaller image sets and merging them into a common frame of reference as we move up the tree in a bottom-up fashion. Such an approach drastically reduces the computational load for matching image pairs without sacrificing accuracy. It also makes the resulting sequence of bundle adjustment problems well-conditioned at all stages of reconstruction. We use motion averaging followed by global bundle adjustment for reconstruction of each individual cluster. Our 3D registration or alignment of partial reconstructions based on epipolar relationships is both robust and reliable and works well even when the available camera-point relationships are poorly conditioned. The overall result is a robust, accurate and efficient pipeline for large-scale SfM. We present extensive results that demonstrate these attributes of our pipeline on a number of large-scale, real-world datasets and compare with the state-of-the-art.

Introduction

In structure-from-motion (SfM) we recover both 3D geometry of points and camera motion parameters from correspondences across images. Despite the impressive performance of contemporary pipelines (Snavely, Seitz, Szeliski, 2006, Wu, 2013), there are some significant issues in solving large-scale SfM as both data size and problem complexity grow substantially. While the SfM literature is voluminous, we confine ourselves here to a discussion of the challenges in solving really large-scale problems (more than 1000 images). In this paper we present an efficient and robust pipeline for solving large-scale SfM problems in a hierarchical manner. Our method combines the benefits of both local and global approaches to SfM which form two prototypical extremes for SfM methods. We demonstrate that our method is more robust with respect to failures when compared to the state-of-the-art and offers at least one order of magnitude speed-up without sacrificing accuracy.

Many of the successful large-scale approaches treat SfM as one of minimising the reprojection error using bundle adjustment (Triggs et al., 2000) and its efficient modern variants (Agarwal, Snavely, Seitz, Szeliski, 2010, Havlena, Torii, Pajdla, 2010, Snavely, Snavely, Seitz, Szeliski, 2006, Snavely, Seitz, Szeliski, 2008, Wu, Agarwal, Curless, Seitz, 2011). As large-scale optimisations do not easily converge, these methods use a variety of heuristics and approximations to achieve a valid solution. In Agarwal et al. (2009), a high level of parallelism is utilised in the pre-processing stages, following which the SfM optimisation is based on a reduced skeletal set. This approach is further refined in Agarwal et al. (2010) by more efficient optimisation. Parallelism is also exploited in Wu et al. (2011). A similar approach of skeletal graph construction out of ‘iconic’ images is presented in Frahm et al. (2010). In sequential approaches such as Snavely et al. (2006) and Wu (2013), the difficulty of large-scale SfM is handled by incrementally adding one image at a time to a solution based on bundle adjustment seeded with a few images. To reduce the matching time, Wu (2013) (henceforth VSFM) proposes preemptive matching to reduce the number of pairs to be matched. Wu (2013) also introduces the concept of delay in optimisation where all cameras and 3D points are optimised only after a certain number of new cameras are incorporated in iterative bundle adjustment. This heuristic introduces some global view in an otherwise sequential method giving it additional robustness with respect to failures when compared to purely incremental methods (Snavely, 2008).

Despite their success, current bundle adjustment approaches suffer from some significant drawbacks. The computational cost for such approaches is very large and these methods have to carefully avoid getting trapped in poor local minima. While sequential methods reduce computational time, being unable to recover from earlier decisions, methods such as Wu (2013) fail to reconstruct on occasion resulting in the final reconstruction being broken into multiple parts. They are also susceptible to the problem of drift and accumulation of errors (Moulon, Monasse, Marlet, 2013, Wu, 2013) and are difficult to parallelise because of their inherent sequential nature.

An alternate solution is provided by motion averaging wherein the geometry of all camera motions is estimated by using pairwise camera motion estimates (Govindu, 2004). Recent work has shown that the motion averaging approaches have met with much success and can efficiently and robustly solve 3D rotations (Chatterjee and Govindu, 2013) and translations (Wilson and Snavely, 2014). But for large SfM datasets, global methods (Jiang, Cui, Tan, 2013, Moulon, Monasse, Marlet, 2013) need to be exceedingly robust to achieve acceptable performance. However, such solutions can effectively provide a good initial estimate for a batch approach to bundle adjustment (henceforth BBA), thereby reducing computational time (Crandall et al., 2013). We also remark that in all these approaches, the largest computational load is actually incurred in the pre-processing step of matching camera pairs. This is true even for methods that employ preemptive matching (Wu, 2013).

Farenzena et al. (2009), Toldo et al. (2015) and Gherardi et al. (2010) propose to merge smaller reconstructions in a bottom up dendrogram. Motivated by their work we adopt a hierarchical divide-and-conquer strategy that is designed to mitigate some of the above problems. In essence, our approach partitions the full image dataset into smaller sets that can each be independently reconstructed using global motion averaging followed by batch mode bundle adjustment. Subsequently, by utilising available geometric relationships between cameras across the individual partitions, we solve a series of global registration problems as we move up a hierarchical tree that correctly and accurately place individual 3D reconstructed components into a single global frame of reference. In what follows we show that this new approach is not only more robust with respect to failures in reconstruction but also gives significant improvements over state-of-the-art techniques in terms of computational speed. The main contributions of our paper are:

  • 1.

    We use hierarchical agglomerative clustering to partition a large dataset into smaller well constrained base clusters. The bottom-up agglomerative clustering starts with individual images as clusters, and pairs of clusters are merged as we move up the hierarchy until base clusters of pre-defined number of images are formed. The merging is based on the robustness of the epipolar geometry estimation and the number of epipolar inliers of the images across the clusters. Thus, by definition of agglomerative clustering, each cluster (at any level) represents as tightly constrained SfM problem as possible, given the problem size. Unlike in Farenzena et al. (2009), Gherardi et al. (2010) and Toldo et al. (2015) wherein the process of 3D reconstruction is initiated at the leaf nodes of the agglomerative tree using pairwise bundle adjustment, we initiate the 3D reconstruction at a node higher up in the tree representing a base cluster of a suitable size. This provides a more global view and yet presents a small enough SfM problem that can be solved reliably and robustly. Our experiments clearly demonstrate that this global view is crucial for a hierarchical strategy to be successful for large-scale SfM problems.

  • 2.

    We propose a reconstruction method for base clusters based on global motion averaging followed by batch mode bundle adjustment. We first use the pairwise epipolar relationships of the images within a base cluster to obtain initial estimates of pairwise rotations and translation directions of the cameras and subsequently apply global rotation (Chatterjee and Govindu, 2013) and translation (Wilson and Snavely, 2014) averaging to obtain consistent estimates of the camera poses. We use the camera poses and the feature correspondence tracks to compute an initial estimate of the 3D structure using triangulation. The camera poses and the 3D structure thus obtained are finally refined using a few iterations of batch mode bundle adjustment. Our experiments clearly demonstrate that for the typical sizes chosen for base clusters, the global approach to SfM is more reliable and robust as compared to both sequential approaches like (Wu, 2013) which have a limited global view and other hierarchical approaches like Farenzena et al. (2009), Gherardi et al. (2010), Ni and Dellaert (2012), Toldo et al. (2015) and Zephyr (2016) which are purely local in nature.

  • 3.

    We propose a method for registering the point clouds corresponding to the independently reconstructed base clusters as well as intermediate clusters using pairwise epipolar geometry relationships (Bhowmick et al., 2014). The epipolar based registration technique proposed in this paper is more robust than the standard techniques for registering point clouds using 3D-3D or 3D-2D correspondences. Registration methods based on 3D point correspondences do not use all available information (image correspondences) and may fail when the point clouds do not have a sufficient number of 3D points in common. 3D-2D based methods, such as a sequential bundle adjustment (Snavely, Wu, 2013), often result in broken reconstructions when the number of points available are inadequate for re-sectioning or when common 3D points are removed at the outlier rejection stage (Wu, 2013). Our experiments clearly demonstrate that the proposed registration algorithm using pairwise epipolar geometry alleviates this problem.

  • 4.

    We propose an effective method for merging and refining the 3D reconstructions as we move up the hierarchical agglomerative tree. After epipolar registration of clusters we selectively apply a step of batch mode bundle adjustment on the combined clusters. We show that often some clusters that could not be merged effectively lower down in the hierarchy can be merged reliably as we move up the hierarchical tree as larger parts of the scene gets accumulated in the reconstruction. Similarly, points that are removed as outliers during reconstruction at the early stages can often be reclaimed by re-sectioning at the later stages as we move up the hierarchical tree.

The two other methods that are closest to our approach are the hierarchical SfM of Farenzena et al. (2009), Gherardi et al. (2010), Toldo et al. (2015) and Zephyr (2016) and the divide and conquer method that we had proposed in Bhowmick et al. (2014).

Farenzena et al. (2009), Gherardi et al. (2010) and Toldo et al. (2015) initiate the reconstruction process using pairwise bundle adjustment at the leaf nodes. Because of the purely local nature of the approach without any global look-ahead, the method often fails to grow the reconstructions by merging or re-sectioning. Their use of reprojection errors of common 3D points for merging appear to be unsuitable for very large datasets. Results in Farenzena et al. (2009), Gherardi et al. (2010) and Toldo et al. (2015) are limited to reconstructions with a maximum of 477 images and our experiments with the codes available for both 3DF Samantha (Toldo et al., 2015) and 3DF Zephyr (trial version) Zephyr (2016) failed to produce results for some of the larger datasets that we consider (see the experimental results in Figs. 4, 5 and 6).

In Bhowmick et al. (2014) we had proposed a divide and conquer based pipeline wherein we used the normalised cut (Shi and Malik, 2000) for obtaining independent base clusters and used (Wu, 2013) for solving the within base cluster SfM problems. The individual reconstructions were merged using a preliminary version of the epipolar registration technique proposed in this paper. As a result, the method of Bhowmick et al. (2014) was not hierarchical. We find hierarchical agglomerative clustering to be more suitable than normalised cut because normalised cut gives poorer control on the size of each component, whereas hierarchical clustering produces balanced and well conditioned base clusters resulting in more robust global reconstruction. Also, the cut edges of the multi-way normalised cut partitions are sometimes too weak for epipolar registration, especially when the base clusters’ sizes are poorly balanced. In addition, our experiments demonstrate that for the typical base cluster sizes that we choose for reconstruction, our global reconstruction framework is more robust than VSFM.

The rest of the paper is organised as follows. In Section 2 we give an outline of our hierarchical approach. In Section 3 we provide the details of construction of the agglomerative tree and clustering. In Section 4 we describe our method of global 3D reconstruction based on motion averaging and batch mode bundle adjustment. In Section 5 we describe our technique of epipolar registration and in Section 6 we present our method of hierarchical merging and refinement along with re-sectioning. In Section 7 we present experimental results on several large datasets and finally, in Section 8, we conclude the paper.

Section snippets

Overview of our hierarchical pipeline

Consider an image dataset that covers a large physical region to be reconstructed (Fig. 1(a)). Cameras that are far apart or look at different directions (global) share no geometric relationships whereas cameras that look at common 3D structures (local) are coupled together. The global connectivity of images arises out of the aggregation of local relationships. In other words, any real-world SfM dataset inherently contains local-to-global relationships between cameras. In our approach, we seek

Agglomerative clustering

The match graph constructed using SIFT matching is further refined using pairwise epipolar geometry. For each pair of connected images i and j in the original match graph, we estimate the epipolar geometry using a RANSAC loop based on the five point algorithm (Nister, 2004). The five point algorithm requires initial focal length estimates for essential matrix computation. In all our experiments, we initialize the focal lengths from the image EXIF information. In case such information is not

Reconstruction of independent base clusters

We consider each individual base cluster as an independent SfM problem and solve these using a batch SfM approach. In what follows, we provide more details of these individual steps.

Throughout this paper, we shall use camera centric motion model wherein any rigid motion can be modelled as a 3D rotation followed by a 3D translation. Consequently, if P is the coordinate of a 3D point described in a world coordinate system, then the coordinate of this point with respect to a camera whose rotation

Epipolar registration

We register independently reconstructed base clusters as well as intermediate clusters in a single frame of reference using pairwise epipolar geometry relationships only. Let us consider Ω be set of reconstructions and AΩ and BΩ be two independently reconstructed groups of cameras.1 Let CAB be the set of connecting edges between A and B. We first compute

Hierarchical merging, refinement and re-sectioning

The individual reconstructions are merged in common frames of reference as we move up the agglomerative tree. We use the pairwise cluster registration technique using the epipolar relationships of images across clusters as discussed in Section 5.

Registering two clusters at a higher level of the agglomeration tree using epipolar relationships may result in high reprojection errors for common 3D points. If the average reprojection error at a vertex in the agglomerative tree is higher than the

Results

We first present results on the effectiveness of each block of our pipeline and then present results on entire 3D reconstructions on several large datasets (see Table 1).

Conclusions

We have presented a hierarchical pipeline where we agglomerate local relationships between images looking at common structures into global aggregations. The smaller local problems are well conditioned and can be solved reliably using motion averaging followed by bundle adjustment. The independent local reconstructions are merged successively as we move up the hierarchy, and are selectively refined using batch mode global bundle adjustment. Thus, as we move up the tree the successive partial

Acknowledgement

The authors affliated with the Indian Institute of Technology Delhi (BB, SP and SB) acknowledge the support of the Department of Science and Technology, Government of India under the Indian Digital Heritage programme.

References (31)

  • R. Toldo et al.

    Hierarchical structure-and-motion recovery from uncalibrated images

    Comput. Vision Image Understanding

    (2015)
  • S. Agarwal et al.

    Bundle adjustment in the large

    Proceedings of the European Conference on Computer Vision

    (2010)
  • S. Agarwal et al.

    Building rome in a day

    Proceedings of the International Conference on Computer Vision

    (2009)
  • B. Bhowmick et al.

    Divide and conquer: Efficient large-scale structure from motion using graph partitioning

    Proceedings of Asian Conference on Computer Vision

    (2014)
  • A. Chatterjee et al.

    Efficient and robust large-scale rotation averaging

    Proceedings of IEEE International Conference on Computer Vision

    (2013)
  • D.J. Crandall et al.

    SfM With MRFs: discrete-continuous optimization for large-scale reconstruction.

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • M. Farenzena et al.

    Structure-and-motion pipeline on a hierarchical cluster tree

    Proceedings of IEEE International Conference on Computer Vision Workshop on 3-D Digital Imaging and Modeling

    (2009)
  • J. Frahm et al.

    Building rome on a cloudless day

    Proceedings of the European Conference on Computer Vision: Part IV

    (2010)
  • Y. Furukawa et al.

    Accurate, dense, and robust multi-view stereopsis

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2010)
  • R. Gherardi et al.

    Improving the efficiency of hierarchical structure-and-motion

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2010)
  • V.M. Govindu

    Lie-algebraic averaging for globally consistent motion estimation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2004)
  • R. Hartley et al.

    Multiple View Geometry in Computer Vision

    (2004)
  • M. Havlena et al.

    Efficient structure from motion by graph optimization

    Proceedings of the European Conference on Computer Vision

    (2010)
  • N. Jiang et al.

    A global linear method for camera pose registration

    Proceedings of IEEE International Conference on Computer Vision

    (2013)
  • P. Moulon et al.

    Global fusion of relative motions for robust, accurate and scalable structure from motion

    Proceedings of IEEE International Conference on Computer Vision

    (2013)
  • Cited by (19)

    • Efficient structure from motion for UAV images via anchor-free parallel merging

      2024, ISPRS Journal of Photogrammetry and Remote Sensing
    • Adaptive weighted motion averaging with low-rank sparse for robust multi-view registration

      2020, Neurocomputing
      Citation Excerpt :

      This algorithm can effectively tackle reliable relative motions, i.e., parts of pair-wise motions, to iteratively update global motions based on the underlying geometry of the matrix Lie groups. Motion averaging algorithms have been widely investigated for use in numerous practical problems of 3D computer vision, including structure-from-motion [27,5], 3D reconstruction [13,25], and simultaneous localization and mapping (SLAM) [6,22]. However, despite the superiority of a motion averaging algorithm in 3D vision, one critical prerequisite is how to access reliable pair-wise relative motions.

    • Efficient structure from motion for large-scale UAV images: A review and a comparison of SfM tools

      2020, ISPRS Journal of Photogrammetry and Remote Sensing
      Citation Excerpt :

      Although the solution proposed in Farenzena et al. (2009), Gherardi et al. (2010), Toldo et al. (2015) can achieve hierarchical reconstruction of a large-scale dataset, this method inherently lacks a more global view because it is initialized at the leaf nodes of the agglomerative tree and performs scene reconstruction and merging in a bottom-up style. Thus, Bhowmick et al. (2017) suggested to divide the large dataset into base clusters with a pre-defined number of images and fire the sub-scene reconstruction and merging from the locations of these base clusters. In addition, for sub-scene reconstruction the global approach to SfM is applied because it has a wider global view than sequential SfM approaches.

    View all citing articles on Scopus
    View full text