ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

Goldstein, Thomas; Hand, Paul; Lee, Choongbum; Voroninski, Vladislav; Soatto, Stefano

doi:10.1007/978-3-319-46478-7_18

Thomas Goldstein¹⁷,
Paul Hand¹⁸,
Choongbum Lee¹⁹,
Vladislav Voroninski¹⁹ &
…
Stefano Soatto²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9911))

Included in the following conference series:

European Conference on Computer Vision

16k Accesses
16 Citations

Abstract

We introduce a new method for location recovery from pairwise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

T. Goldstein, P. Hand, C. Lee and V. Voroninski—These authors contributed equally.

You have full access to this open access chapter, Download conference paper PDF

RepMatch: Robust Feature Matching and Pose for Reconstructing Modern Cities

The Conditional Lucas & Kanade Algorithm

OpenMVG: Open Multiple View Geometry

Keywords

1 Introduction

The typical structure-from-motion (SfM) pipeline consists of (i) establishing sparse correspondence between local regions in different images of a (mostly) rigid scene, (ii) exploiting constraints induced by epipolar geometry to obtain initial estimates of the relative pose (position and orientation) between pairs or triplets of views from which the images were captured, where each relative position is determined up to an arbitrary scale, (iii) reconciling all estimates and their scales to arrive at a consistent estimate up to a single global scale, finally (iv) performing bundle adjustment to refine the estimates of pose as well as the position of the sparse points in three-dimensional (3D) space that gave rise to the local regions in (i), also known as feature points.

As in any cascade method,^{Footnote 1} the overall solution is sensitive to failures in the early stages. While significant effort has gone into designing better descriptors for use in stage (i) of the pipeline, sparse correspondence is intrinsically local and therefore subject to ambiguity. This forces subsequent stages (ii), (iii) to deal with inevitable correspondence failures, often by solving combinatorial matching problems. Stages (ii) and (iv) are well established and are the subject of textbooks. Thus, we hone in on the weak link of the pipeline (iii) to develop global alignment methods that are robust to failure of the correspondence stage. Towards this end, we propose a novel efficient approach based on convex optimization that comes with provable recovery guarantees.

Errors in the correspondence stage (i) usually come in two distinct flavors. First, localization error due to quantization artifacts and sensor noise, which can be modeled as independently and identically-distributed (i.i.d.) additive perturbations drawn from a normal density with zero mean and constant covariance. Second, mismatches due to gross violations of the assumptions underlying local correspondence: co-visibility, constant illumination, and rigidity. The latter can also be modeled as an additive (non i.i.d.) perturbation with unknown distribution. Sparse correspondence errors that arise from only the first source of error, often referred to as “noise,” are called inliers, whereas those subject to both are outliers. Following a classical robust statistical approach, we forgo modeling the distribution of outliers, and indeed allow them to behave in an adversarial manner. We seek algorithms with provable guarantees despite such behavior, while simultaneously being efficiently solvable with low complexity numerical methods.

1.1 Related Work and Contributions

There is a vast literature on sparse matching (i), epipolar geometry (ii) and bundle adjustment (iv) for which we refer the reader to standard Computer Vision textbooks. Stage (iii) can be separated into two parts: global rotation estimation and location recovery. For simplicity, we assume that the intrinsic calibration parameters of all cameras are known.

There are many efficient and stable algorithms for estimating global camera rotations [1, 4, 6–8, 10–12, 17, 18, 20, 22]. Empirically, [24] demonstrates that a combination of filtering, factorization, and local refinement can accurately estimate 3d rotations. Theoretically, [25] prove that rotations can be exactly and stably recovered for a synthetic model by a least unsquared deviation approach on a semidefinite relaxation. Alternatively, in many applications, such as location services from mobile platforms, augmented reality, and robotics, orientation can be estimated far more reliably than location and scale due to the relatively small gyrometer bias compared to the doubly-integrated accelerometer bias and global orientation references provided by gravity and magnetic field.

We concentrate on the location recovery problem from relative directions based on known camera rotations. There have been many different approaches to this problem, such as least squares [1, 3, 10, 17], second-order cone programs, $l_\infty $ methods [15–18, 21], spectral methods [3], similarity transformations for pair alignment [22], Lie-algebraic averaging [11], Markov random fields [5], and several others [13, 20, 22, 23]. Unfortunately, many location recovery algorithms either lack robustness to mismatches, at times produce collapsed solutions [20], or suffer from convergence to local minima, in sum causing large errors in (or even complete degradation of) the recovered locations.

Recent advances have addressed some of these limitations: 1dSfM [26] focuses on removing outliers by examining inconsistencies along one-dimensional projections, before attempting to recover camera locations. This method, however, does not reason about self-consistent outliers, which can occur due to repetitive structures, commonly found in man-made scenes. Also, Jiang et al. [14] introduced a method to filter outlier epipolar geometries based on inconsistent triplets of views. Özyeşil and Singer propose a convex program called Least Unsquared Deviations (LUD) and empirically demonstrate its robustness to outliers [19]. While these methods exhibit good empirical performance, they lack theoretical guarantees in terms of robustness to outliers.

Summary of contributions. In this paper, we propose a novel framework for location recovery from pairwise direction observations. This framework, called ShapeFit, is based on convex optimization and can be proven to recover locations exactly in the presence of adversarial corruptions, under rather broad technical assumptions. We introduce two efficient numerical implementations, ShapeFit and ShapeKick (both described in Sect. 2), show how they can be employed to solve location recovery problems arising in SfM problem with known camera rotations, and extensively validate our methods using benchmark datasets (Sect. 3) and show that our approach achieves significant computational speedups at comparable accuracy.

1.2 Problem Formulation

Let T be a collection of n distinct vectors $t^{(0)}_{1},t^{(0)}_{2},\ldots ,t^{(0)}_{n} \in \mathbb {R}^d$, and let $G = ([n],E)$ be a graph, where $[n] = \{1,2\ldots ,n\}$, and $E = E_g \sqcup E_b$, with $E_b$ and $E_g$ corresponding to pairwise direction observations that are respectively corrupted and uncorrupted. The uncorrupted observations are assumed to be noiseless. That is, for each $ij \in E$, we are given a vector $v_{ij}$, where

$$\begin{aligned} \begin{array}{ll} v_{ij} = \frac{t^{(0)}_{i}- t^{(0)}_{j}}{\bigl \Vert t^{(0)}_{i}- t^{(0)}_{j}\bigr \Vert _2}\, \mathrm{for}\, (i,j)\, \mathrm{in } \,E_g\\ v_{ij} \in S^{d-1}\,\mathrm{arbitrary, for}\,(i,j) \in E_b. \end{array} \end{aligned}$$

(1)

Consider the task of recovering the locations T up to a global translation and scale, from only the observations $\{ v_{ij} \}_{ij \in E}$, and without any knowledge about the decomposition $E = E_g \sqcup E_b $, nor the nature of the pairwise direction corruptions. For $d=3$, this problem corresponds to (iii) once an estimate of directions is provided.

The location recovery problem is to recover a set of points in $\mathbb {R}^d$ from observations of pairwise directions between those points. Since relative direction observations are invariant under a global translation and scaling, one can at best hope to recover the locations $T^{(0)}= \{ t^{(0)}_{1},\ldots , t^{(0)}_{n}\}$ up to such a gauge transformation. That is, successful recovery from $\{v_{ij}\}_{(i,j) \in E}$ is finding a set of vectors ${\{\alpha (t^{(0)}_{i}+ w)\}_{i \in [n]}}$ for some $w \in \mathbb {R}^d$ and $\alpha >0$. We will say that two sets of n vectors $T = \{t_1,\ldots ,t_n\}$ and $T^{(0)}$ are equal up to global translation and scale if there exists a vector w and a scalar $\alpha >0 $ such that $t_i = \alpha (t^{(0)}_{i}+ w)$ for all $i \in [n]$. In this case, we will say that $T \sim T^{(0)}$. The location recovery problem is then stated as:

(2)

Formally, let $\deg _b(i)$ be the degree of location i in the graph $([n], E_b)$ and note that we do not assume anything about the nature of corruptions. That is, we work with adversarially chosen corrupted edges $E_b$ and arbitrary corruptions of observations associated to those edges. To solve the location recovery problem in this challenging setting, we introduce a simple convex program called ShapeFit:

(3)

where $P_{v_{ij}^\perp }$ is the projector onto the orthogonal complement of the span of $v_{ij}$. The objective in (3) is robust to outliers because it has the structure of an $\ell _1$ norm of a set of unsquared distances. The constraints act to remove the scale and translational ambiguities.

This convex program is a second order cone problem with dn variables and two constraints. Hence, the search space has dimension $dn-2$, which is minimal due to the dn degrees of freedom in the locations $\{t_i\}$ and the two inherent degeneracies of translation and scale.

1.3 Theoretical Guarantees and Practical Implications

Although we have established a much broader class of results w.r.t the assumptions on locations (see Appendix), we consider here the physically relevant and simple model where pairwise direction observations about n i.i.d. Gaussian camera locations in $\mathbb {R}^3$ are given according to an Erdős-Rényi random graph G(n, p), which is a graph on n vertices with each pair of vertices (i, j) having an edge with probability p, independently of all other edges. In this setting, ShapeFit (3) achieves exact recovery for any sufficiently large number of locations, provided that a poly-logarithmically small fraction of observations are adversarially corrupted at each node.

Theorem 1

Let G([n], E) be a random graph in G(n, p) with^{Footnote 2} $p = \varOmega ( n^{-1/5} \log ^{3/5} n.)$. Choose the locations of the vertices $ t^{(0)}_1, \ldots t^{(0)}_n \in \mathbb {R}^3$ to be i.i.d., independent vectors from the random normal distribution $\mathcal {N}(0, I_{3 \times 3}),$ and measure the pairwise directions $v_{ij}\in \mathbb {S}^2$ between adjacent vertices. Choose an arbitrary subgraph $E_b$ satisfying $\max _i \deg _b(i) \le \gamma n$ for some positive $\gamma .$ Corrupt these pairwise directions by applying an arbitrary modification to $v_{ij}\in \mathbb {S}^2$ for $ij \in E_b$.

For $\gamma = \varOmega (p^5 / \log ^3 n)$ and sufficiently large n, ShapeFit achieves exact recovery with high probability. More precisely, with probability at least $1- \frac{1}{n^{4}},$ the convex program (3) has a unique minimizer equal to $\left\{ \alpha \Bigl (t^{(0)}_{i}- \bar{t}^{(0)}\Bigr )\right\} _{i \in [n]}$ for some positive $\alpha $ and for $\bar{t}^{(0)}= \frac{1}{n}\sum _{i\in [n]} t^{(0)}_{i}$.

That is, provided the locations are i.i.d Gaussian and the underlying graph of observations is Erdős-Rényi , ShapeFit is exact with high probability simultaneously for all corruption subgraphs of bounded degree with adversarially corrupted directions. To the best of our knowledge, our algorithms are the first to rest on theoretical results guaranteeing location recovery in the challenging case of corrupted pairwise direction observations.

The above result gives us confidence of the robustness of the method we propose, which is validated empirically in Sect. 3 and supported theoretically in the Supplementary Material. Our main contribution in this paper is the design of efficient implementations based on the theory. Indeed, our empirical assessment shows that we can improve computational efficiency by one order of magnitude at accuracy roughly equal to existing state of the art methods.

The efficiency and robustness of our method suggests its use as an alternative to the standard SfM pipeline for real-time applications, by replacing camera-to-camera direction estimation and triangulation with a single corruption-robust simultaneous recovery of camera locations and 3D structure - that is, by compressing two steps of the usual pipeline (ii)–(iii) into a single robust inference step. This alternative applies to the case where rotations are known, for example through inertial measurements. This transforms the location recovery problem, where both camera locations and 3D points are represented as nodes in the graph, into a variant where the graph is bipartite, with edges only between camera positions and the 3D structure points. In the Supplementary Material, we present experimental results for the bipartite case.

1.4 Proof Outline of Theorem 1

A complete proof of Theorem 1 is involved and, and is included in the Appendix. A rough proof outline is as follows. Consider the true locations $t_1^{(0)}, \ldots t_n^{(0)}$ and a feasible perturbation $t_i^{(0)} + h_i$. For any $(i,j) \in E_g$, the objective increases from zero to $\Vert P_{(t_i^{(0)}-t_j^{(0)})^\perp }(h_i - h_j)\Vert _2$, while for $(k,l) \in E_b$ the objective may decrease as much as $\Vert h_k-h_l \Vert _2$. Optimality thus requires

$$ \sum _{ (i,j) \in E_g } \Vert P_{(t_i^{(0)}-t_j^{(0)})^\perp }(h_i - h_j)\Vert _2 > \sum _{(k,l) \in E_b} \Vert h_k- h_l\Vert _2. $$

We call, for any $(i,j) \in E$, $ \Vert P_{ t_{ij}^{(0)}}( h_i - h_j )\Vert _2$ and $ \Vert P_{ (t_{ij}^{(0)})^\perp }( h_i - h_j )\Vert _2$ the parallel and orthogonal deviation of $h_i-h_j$, respectively, and show separately that (i) orthogonal and (ii) parallel deviation of bad edges induces sufficient orthogonal deviation on the good edges.

The proof strategy in both cases is combinatorial propagation of a local geometric property. For case (i) we establish that if a collection of triangles in $\mathbb {R}^3$ share the same base and the locations opposite the base are sufficiently “well-distributed,” then an infinitesimal rotation of the base induces infinitesimal rotations in edges of many of the triangles. Then, for each corrupted edge (k, l) we ensure that we can find sufficiently many triangles in the observation graph with two good edges and base (k, l), with locations at the opposing vertices being “well-distributed.” Case (ii) is more nuanced and requires strongly using the constraints of the ShapeFit program. Here the local property is that for a tetrahedron in $\mathbb {R}^3$ with well distributed vertices, any discordant parallel deviations on two disjoint edges induce enough infinitesimal rotational motion on some other edge of the tetrahedron. Combinatorial propagation is then handled in two regimes of the relative balance of parallel deviations on the good and corrupted subgraphs.

2 Numerical Approach

We now study the efficient numerical minimization of the problem (3) via the alternating direction method of multipliers (ADMM). The ADMM approach is advantageous because each sub-step of the algorithm is efficiently solvable in closed form. Also, unlike simple gradient methods, the ADMM method does not require smoothing/regularization of the $\ell _2$ norm penalty that results in poor conditioning. Problem (3) can be reformulated as

(4)

For notational simplicity, we have removed the constraints on $\{t_i\}$ and written the problem as a minimization over the set of all gauge-normalized point clouds in $\mathbb {R}^3,$ denoted

$$\mathcal G=\{T \in \mathbb {R}^{n\times 3} |\sum _{ij \in E} \langle t_i - t_j, v_{ij} \rangle = 1, \quad \sum _{i=1}^n t_i = 0 \}$$

where $T=\{t_i\}$ is a collection containing n vectors in $\mathbb R^3$.

To derive an ADMM method, we now write the (scaled) augmented Lagrangian for (5), which is [2, 9]

$$\begin{aligned} \begin{array}{ll} \mathcal L_\rho (T,Y, \lambda )=&\sum _{ij \in E} \Vert P_{v_{ij}^\perp } (y_{ij}) \Vert _2 +\frac{\tau }{\rho } \sum _{ij \in E} \Vert t_i - t_j - y_{ij} + \lambda _{ij}\Vert ^2, \end{array} \end{aligned}$$

(5)

where $\rho $ is a constant stepsize parameter, and $\lambda = \{ \lambda _{ij}\}$ contains Lagrange multipliers. The solution to the constrained problem (5) corresponds to a saddle point of the augmented Lagrangian that is minimal for T and Y while being maximal with respect to $\lambda .$ ADMM finds this saddle point by iteratively minimizing $ \mathcal L_\rho (T,Y, \lambda )$ for T and Y, and then using a gradient ascent step to maximize for $\lambda .$ The corresponding updates are

$$ {\left\{ \begin{array}{ll} T \leftarrow \mathop {{{\mathrm{arg\,min}}}}\limits _{T \in \mathcal G} \,\, \mathcal L_\rho (T,Y, \lambda ) \\ Y \leftarrow \mathop {{{\mathrm{arg\,min}}}}\limits _{ Y \in \mathbb R^{|E|\times 3}} \,\, \mathcal L_\rho (T,Y, \lambda )\\ \lambda _{ij} \leftarrow \lambda _{ij}+ t_i - t_j - y_{ij} . \end{array}\right. } $$

The minimization for T is simply a least-squares problem. Let $R: \mathbb R^{n\times 3} \rightarrow \mathbb R^{|E|\times 3}$ be a linear operator such that the kth row of RT is $t_i-t_j,$ where (i, j) is the kth edge in E. The T update now has the form

$$T \leftarrow \mathop {{{\mathrm{arg\,min}}}}\limits _{T\in \mathcal G} \Vert RT -Y+\lambda \Vert ^2.$$

The solution to this minimization is found simply by computing the (possibly sparse) factorization of R, and applying a rank-1 update (i.e., using the Sherman-Morrison formula) to account for the linear constraints in $\mathcal G.$

We now examine the update for y. Let $z_{ij}=t_i - t_j + \lambda _{ij}.$ The updated value of $y_{ij}$ is then the minimizer of

$$\begin{aligned} \Vert P_{v_{ij}^\perp }&(y_{ij})\Vert + \frac{\rho }{2}\Vert z_{ij}-y_{ij}\Vert ^2 = \Vert P_{v_{ij}^\perp }(y_{ij})\Vert \\&+ \frac{\rho }{2}\Vert P_{v_{ij}^\perp }(z_{ij}-y_{ij})\Vert ^2 +\frac{\rho }{2}\Vert P_{v_{ij}}(z_{ij}-y_{ij})\Vert ^2. \end{aligned}$$

The minimum of this objective has the closed form^{Footnote 3}.

$$\begin{aligned} y_{ij} \leftarrow P_{v_{ij}} (z_{ij})+ {{\mathrm{shrink}}}(P_{v_{ij}^\perp } (z_{ij}), 1/\rho ). \end{aligned}$$

(6)

Finally, it is known that the convergence of ADMM for 1-homogenous problems is (empirically) very fast for the first few iterations, and then convergence slows down. For real-time applications, one may prefer a more aggressive algorithm that does not suffer from this slowdown. Slowdown is often combated using kicking [24], and we adopt a variant of this trick to accelerate ShapeFit. The kicking procedure starts with a small value of $\tau ,$ and iterates until convergence stagnates (the values of y become nearly constant). We then increase $\tau $ by a factor of 10, and run the algorithm until it slows down again. The ShapeKick approach drastically reduces runtime when moderate accuracy is needed, however it generally produces higher numerical errors than simply applying the un-kicked ADMM for a very long period of time.

3 Numerical Experiments

For empirical validation we adopt here the data and protocol of the most common benchmarks for SfM and location recovery. We first verify that when the data is generated according to a model that satisfies the assumptions of the analysis, we indeed witness exact recovery despite a large fraction of corruptions. We then report representative results on benchmark datasets in a variety of experimental settings, where in some cases the assumptions may be violated. Although there is considerable performance variability among different methods on different datasets and no uniform winner, our scheme is competitive with the state-of-the-art in terms of accuracy, but at a fraction of the computational cost. The results are summarized in Sect. 3.3.

3.1 Experiments on Synthetic Data

In this section we validate ShapeFit on synthetic data and compare its performance with that of the LUD algorithm of [19], with both algorithms implemented in our ADMM framework. In particular we report on ShapeFit’s exact location recovery from partially corrupted pair-wise directions and stable recovery from noisy and partially corrupted directions. The LUD method also exhibits both of these phenomena, and we compare the (empirical) phase transition diagrams of both methods in identical regimes.

The locations $\{t_i \}_{i=1}^n$ to be recovered are i.i.d $\mathcal {N}(0,I_{3\times 3})$. The graph of pair-wise observations G([n], E) is drawn independently from the Erdős-Rényi model $\mathcal {G}(n,p)$, that is each edge (i, j) is in E with probability p, independently from all other edges. Having drawn locations $\{t_i \}_{i=1}^n$ and G([n], E), consider i.i.d random variables $\eta _{ij} =^d \mathcal {N}(0,I_{3 \times 3})$ for $(i,j) \in E$ independent from all other random variables and let

$$\begin{aligned} \tilde{v}_{ij} = {\left\{ \begin{array}{ll} \eta _{ij} &{} \text {with probability} \, q \\ \frac{t_i -t_j}{\Vert t_i - t_j\Vert _2} + \sigma \eta _{ij} &{} \text { otherwise,} \end{array}\right. } \end{aligned}$$

(7)

where $\sigma \ge 0$ controls the noise level and the assignments are made independently on each edge in E. We then obtain pair-wise direction observations as $v_{ij} = \frac{\tilde{v}_{ij}}{\Vert \tilde{v}_{ij}\Vert _2}$ for each $(i,j) \in E$, and thus $v_{ij}$ is a random direction on the unit sphere with corruption probability q and is a noisy version of the true pair-wise direction with probability $1-q$.

We evaluate recovery performance in terms of a relative Frobenius error (RFE). For any set of locations $\{x_i\}_{i=1}^n$ in $\mathbb {R}^d$, let $T(x_1,\ldots ,x_n)$ be a $d \times n$ matrix with $i^{\text {th}}$ column given by $x_i - \sum _{i=1}^n x_i$. Define $T_0 = T(t_1,\ldots ,t_n)$ as the matrix of original locations and let $\{\hat{t}_i\}_{i=1}^n$ be the set of recovered locations. Define $\hat{T} = T(\hat{t}_1,\ldots ,\hat{t}_n)$. Then the RFE is given by

$$\begin{aligned} \text {RFE}(T_0,\hat{T}) = \left\| T_0/\Vert T_0\Vert _F -\hat{T}/\Vert \hat{T}\Vert _F \right\| _F, \end{aligned}$$

(8)

which accounts for the global translation and scale ambiguity, where $\Vert .\Vert _F$ is the Frobenius norm on $\mathbb {R}^{d \times n}$. Note that an RFE of zero corresponds to exact recovery.

For each pixel in the phase diagrams of Fig. (1), we generate 10 independent random recovery problems as described above, recover locations using ShapeFit (left column) or LUD (right column), and record the average RFE as a grayscale intensity. The first set of experiments considers recovery from partially corrupted and otherwise noiseless ($\sigma =0$) directions. We note that in the top row of phase diagrams for both methods, we see $exact $ recovery from partially corrupted direction observations (we define exact recovery as RFE $ < 10^{-9})$. ShapeFit has a wider region of exact recovery in the (p, q) parameter space, exhibiting exact recovery at up to between $10\,\%$ and $50\,\%$ of corruption (depending on p and n), while LUD stops being exact at around $20\,\%$ corruption.

The second set of experiments considers recovery from partially corrupted and otherwise noisy ($\sigma > 0)$ directions. We take $\sigma = 0.05$ to generate the bottom row of tables and consider phase transitions on a coarser scale of RFE. We see that recovery is stable from noisy and partially corrupted direction observations for both ShapeFit and LUD, with ShapeFit having a more favorable recovery profile at the lower range of corruptions in that the recovery is more accurate than LUD up to the rapid phase transition, while LUD’s performance starts to degrade at a lower level of corruption yet continues to provide meaningful recovery slightly above the level of corruption of ShapeFit’s phase transition.

In Fig. (2) we provide further numerical experiments that illustrate that ShapeFit and LUD have graceful degradation of recovery with respect to noise.

3.2 Setup of Experiments on Real Data

We validate our method ShapeFit (3) on 13 benchmark datasets containing irregular collections of images of real scenes from [26]. We compare its performance to that of LUD [19] and 1dSfM. We implement two fast versions of ShapeFit and a fast version of LUD based on the Alternating Directions Multiplier Method (ADMM) and an aggressive step-size selection method. We refer to the faster of the two implementations of ShapeFit as ShapeKick. To solve 1dsfm we use code provided by [26]. We perform our experiments on an Intel(R) Core (TM) i5 CPU with 2 cores, running at 2.6 GHz. A unique aspect of our numerical comparisons is that we run ShapeFit, ShapeKick, LUD, and 1dSfM on the same problem instances generated from several different regimes. Thus, we measure head-to-head performance on the location recovery task objectively. We emphasize that we do no dataset-specific tuning of the recovery algorithms. For each problem instance, we report on the median and mean Euclidean distance error between estimated camera locations and the ground truth for each algorithm, as in [19, 26].

To generate problem instances for each dataset, we first solve for global camera rotations using the method of Govindu [4], then solve for relative directions between cameras using epipolar geometry, and obtain rotation estimates using code provided by Snavely and Wilson [26] for both of these steps. After this step, we have obtained directions among cameras, directions between cameras, and 3d structure points, all in the same reference frame. Let $G_s([n] \times [m], E_c)$ be the obtained bipartite graph of directions between the n camera locations and m structure points (where we associate the appropriate direction to each edge), and similarly let $G_c([n],E_l)$ be the obtained graph of directions between the n camera locations. To generate problem instances, we consider directions computed as functions of $G_s \sqcup G_c$.

The first problem instance regime is that of using robust PCA to re-compute pairwise direction estimates between cameras, as used by Singer and Ozyesil in [19]. We use code provided in [19] and refer to this regime as Robust PCA. We also generate problem instances using the greedy pruning technique used by [26], which proceeds by selecting a subset $G_{s}^{(k)}$ of $G_s$ greedily to ensure that each pair of selected cameras have at least k co-visible structure points via edges in $G_{s}^{(k)}$ where k is an integer parameter. We use code provided by Snavely and Wilson to generate these subgraphs $G_{s}^{(k)}$ of camera-to-structure directions for $k=6$ and $k=50$ [26]. We consider $G_{s}^{(k)} \sqcup G_c$ as the resulting two problem instances, referred to as Monopartite $k=6$ and $k=50$. These Monopartite problem instances are exactly the same as those generated in [26]. Finally, we consider purely bipartite versions of these problems, by keeping just the camera-to-structure directions $G_{s}^{(k)}$ for $k=6$ and $k=50$ and ignoring translation estimates from epipolar geometry. We refer to these instances as Bipartite $k=6$ and $k=50$. Thus, the bipartite problem instances are strict subsets of the monopartite instances and do not require any epipolar geometry to set up aside from global rotation estimation. In sum, this gives five problem instances per dataset.

As in [19, 26], we consider the ground truth as camera location estimates provided by a sequential SfM solver provided by Snavely and Wilson. To compute the global translation and scale between recovered solutions and the ground truth we use a RANSAC-based method as in [26], using their code.

Table 1 shows the median and mean reconstruction errors (without bundle adjustment) for seven recovery algorithms on thirteen datasets under two monopartite problem formulations. Table 2 reports runtimes needed by different methods to set up and solve translation problems. Table 3 in the supplementary material shows the reconstruction errors under three additional problem formulations, including both bipartite formulations. Table 4 in the supplementary material shows the runtimes for these additional problem formulations. In Table 1, the best median error (among all algorithms) for each dataset and formulation is marked in bold. The best median error (among all algorithms and all five formulations) is marked in red and with an asterisk. The seven algorithms considered are: ShapeKick, ShapeFit, LUD, 1dSfM outlier removal followed by nonlinear least squares solver, 1dSfM followed by a Huber loss solver, 1dSfM followed by ShapeKick, and 1dSfM followed by LUD. The recovery errors are relative to the estimates from [26], which were computed by a sequential SfM solver. Ties are resolved by less significant digits not displayed.

Table 1. Median ($\tilde{e}$) and mean ($\hat{e}$) reconstruction errors (in meters) across multiple datasets and problem formulations. $N_c$ and $N_l$ denote the number of camera locations and the number of directions (camera-to-camera and camera-to-structure), respectively. The best performing algorithm in each row is bolded. For each dataset, the best performing combination of algorithm and problem instance is starred with an asterisk.

Full size table

3.3 Summary and Analysis of Experiments on Real Data

We observe that ShapeKick with 1dSfM outlier removal is a competitive method for location recovery from directions, as measured by median reconstruction error. Table 1 shows that the combination of 1dSfM with ShapeKick has the smallest median reconstruction error^{Footnote 4} for eight of the thirteen datasets. The combination of 1dSfM and Huber has the smallest median error for three datasets. ShapeKick without 1dSfM has the smallest median error for one dataset. Finally, ShapeFit without 1dSfM has the smallest median error for one dataset (see Table 3 in the supplementary material).

We observe that ShapeKick is faster than previously published location recovery algorithms by a factor of 10–50. Table 2 shows that in all cases, ShapeKick with or without 1dSfM are the fastest translations algorithms by wide margins. ShapeKick with 1dSfM is typically slower than ShapeKick alone by up to a factor of two. In a few cases, ShapeKick with 1dSfM is faster than ShapeKick alone because the outlier removal permits faster numerical convergence.

Table 2. Running times for the algorithms in seconds. $T_\text {rot}$ and $T_\text {trans}$ provide the time to solve the rotations problem and to set up the translation problem, respectively. Columns 4–10 present the times for solving the translations problem by our implementations of the respective algorithms.

Full size table

We observe that ShapeKick can sometimes result in lower reconstruction errors than ShapeFit. This effect is possible because the output of ShapeFit is not equal to ground truth. Hence, the output of ShapeKick, which is an approximation of the output of ShapeFit, may return higher or lower reconstruction errors, especially after the outlier-tolerant RANSAC-based error estimation.

We observe that camera-to-camera measurements, though noisy and not directly measured, act to stabilize the location recovery problem in these phototourism datasets. Table 1 shows that the smallest median reconstruction error is achieved by the monopartite $k=6$ formulation for seven datasets. The monopartite $k=50$ formulation is best for three datasets. The Robust PCA formulation is best for two datasets. Finally, the bipartite $k=6$ formulation is best for one dataset (see Table 3 in the supplementary material).

We observe that outlier filtering by 1dSfM enhances the outlier tolerance of convex methods like ShapeKick, ShapeFit, and LUD. As an example, consider the Roman Forum dataset under a monopartite $k=6$ formulation. The recovery errors of SK, SF, and LUD decrease by a factor or 2–10 by 1dSfM filtering.

We observe that outlier-robust location recovery methods are helpful even if 1dSfM is used to initially filter outliers. Notice that for all reported simulations, the outlier-intolerant NLS algorithm never has the smallest median error.

Finally, we comment on the choice of the selection of mean and median recovery error as a metric. Mean errors are susceptible to recovered locations that are outliers. Median errors more accurately measure the overall shape of the set of locations. Table 1 reveals that the LUD method has significantly lower mean reconstruction errors than any other method. The mean reconstruction errors of ShapeKick, while higher than those of LUD, are still much smaller than those of 1dSfM with a Huber loss minimization. Thus, LUD produces typically does not contain significant outliers, and ShapeFit contains outliers that are less significant than those from 1dSfM with a Huber loss.

4 Conclusion

We propose a simple convex program called ShapeFit for location recovery, which comes with theoretical guarantees of exact location recovery from partially corrupted pairwise observations. We propose a highly efficient numerical framework and use it to implement ShapeFit and LUD, producing runtime speedups of 10$\times $ or more over other implementations. Our fastest version of ShapeFit, called ShapeKick, is consistently at least 10$\times $ faster than previously published location recovery methods. We provide experiments on synthetic data illustrating exact recovery and stability of ShapeFit and LUD, and a thorough empirical comparison between ShapeFit, LUD, and 1dSfM on real data shows comparable reconstruction performance between the methods.

We stress that our algorithm is the first to rely on provable performance guarantees despite adversarial corruptions. Such corruptions include photometric ambiguities due to repeated structures in man-made environments, a common occurrence in SfM. We have validated the results on synthetic datasets, as well as on public benchmarks, and demonstrated that ShapeFit achieves comparable reconstruction error to state of the art methods with a 10$\times $ speedup.

Notes

1.
The standard pipeline stands in opposition to direct methods that minimize the discrepancy between the measured images and the images predicted by a forward rendering model with respect to the (infinite-dimensional) shape of the scene, which gives rise to a variational optimization problem which we do not address here.
2.
$p = \varOmega (f(n))$ means that there exists a universal constant C such that $p \ge C f(n)$.
3.
shrink$(x, \lambda ) = \text {sign}(x) \max (0,|x| - \lambda )$.
4.
Six of these can be seen in Table 1, and two can be seen in the monopartite $k=50$ case in the Supplemental Materials.

References

Arie-Nachimson, M., Kovalsky, S.Z., Kemelmacher-Shlizerman, I., Singer, A., Basri, R.: Global motion estimation from point matches. In: 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pp. 81–88. IEEE (2012)
Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends$\textregistered $ Mach. Learn. 3(1), 1–122 (2011)
Google Scholar
Brand, M., Antone, M., Teller, S.: Spectral solution of large-scale extrinsic camera calibration as a graph embedding problem. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3022, pp. 262–273. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24671-8_21
Chapter Google Scholar
Chatterjee, A., Govindu, V.M.: Efficient and robust large-scale rotation averaging. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 521–528. IEEE (2013)
Google Scholar
Crandall, D., Owens, A., Snavely, N., Huttenlocher, D.: Discrete-continuous optimization for large-scale structure from motion. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3001–3008. IEEE (2011)
Google Scholar
Eades, P., Lin, X., Smyth, W.F.: A fast and effective heuristic for the feedback arc set problem. Inf. Process. Lett. 47(6), 319–323 (1993)
Article MathSciNet MATH Google Scholar
Enqvist, O., Kahl, F., Olsson, C.: Non-sequential structure from motion. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 264–271. IEEE (2011)
Google Scholar
Fredriksson, J., Olsson, C.: Simultaneous multiple rotation averaging using Lagrangian duality. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7726, pp. 245–258. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37431-9_19
Google Scholar
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
Article MathSciNet MATH Google Scholar
Govindu, V.M.: Combining two-view constraints for motion estimation. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2, pp. II-218. IEEE (2001)
Google Scholar
Govindu, V.M.: Lie-algebraic averaging for globally consistent motion estimation. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 1, pp. I–684. IEEE (2004)
Google Scholar
Hartley, R., Aftab, K., Trumpf, J.: L1 rotation averaging using the Weiszfeld algorithm. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3041–3048. IEEE (2011)
Google Scholar
Jiang, N., Cui, Z., Tan, P.: A global linear method for camera pose registration. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 481–488. IEEE (2013)
Google Scholar
Jiang, N., Cui, Z., Tan, P.: A global linear method for camera pose registration. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 481–488 (2013)
Google Scholar
Kahl, F.: Multiple view geometry and the $l^\infty $-norm. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 2, pp. 1002–1009. IEEE (2005)
Google Scholar
Kahl, F., Hartley, R.: Multiple-view geometry under the $l_\infty $-norm. IEEE Trans. Pattern Anal. Mach. Intell. 30(9), 1603–1617 (2008)
Article Google Scholar
Martinec, D., Pajdla, T.: Robust rotation and translation estimation in multiview reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)
Google Scholar
Moulon, P., Monasse, P., Marlet, R.: Global fusion of relative motions for robust, accurate and scalable structure from motion. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3248–3255. IEEE (2013)
Google Scholar
Özyeşl, O., Singer, A.: Robust camera location estimation by convex programming. In: Proceedings of Computer Vision and Pattern Recognition (2015). http://arxiv.org/abs/1412.0165
Özyeşl, O., Singer, A., Basri, R.: Camera motion estimation by convex programming. CoRR abs/1312.5047 (2013), http://arxiv.org/abs/1312.5047
Sim, K., Hartley, R.: Recovering camera motion using $l^\infty $ minimization. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1230–1237. IEEE (2006)
Google Scholar
Sinha, S.N., Steedly, D., Szeliski, R.: A multi-stage linear approach to structure from motion. In: Kutulakos, K.N. (ed.) ECCV 2010. LNCS, vol. 6554, pp. 267–281. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35740-4_21
Google Scholar
Tron, R., Vidal, R.: Distributed image-based 3-D localization of camera sensor networks. In: Proceedings of the 48th IEEE Conference on Decision and Control, CDC/CCC 2009, pp. 901–908. IEEE (2009) (2009 Held Jointly with the 2009 28th Chinese Control Conference)
Google Scholar
Tron, R., Zhou, X., Daniilidis, K.: A survey on rotation optimization in structure from motion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 77–85 (2016)
Google Scholar
Wang, L., Singer, A.: Exact and stable recovery of rotations for robust synchronization. Information and Inference, iat005 (2013)
Google Scholar
Wilson, K., Snavely, N.: Robust global translations with 1dSfM. In: Proceedings of the European Conference on Computer Vision (ECCV) (2014)
Google Scholar

Download references

Acknowledgments

TG was partially supported by National Science Foundation CCF-1535902 and by US Office of Naval Research grant N00014-15-1-2676. PH was partially supported by National Science Foundation DMS-1464525. CL was partially supported by the National Science Foundation DMS-1362326. SS was partially supported by Air Force Office of Scientific Research FA9550-15-1-0229. VV was partially supported by the Office of Naval Research.

Author information

Authors and Affiliations

Department of Computer Science, University of Maryland, College Park, MD, USA
Thomas Goldstein
Department of Computational and Applied Mathematics, Rice University, Houston, TX, USA
Paul Hand
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
Choongbum Lee & Vladislav Voroninski
Department of Computer Science, University of California, Los Angeles, CA, USA
Stefano Soatto

Authors

Thomas Goldstein
View author publications
You can also search for this author in PubMed Google Scholar
Paul Hand
View author publications
You can also search for this author in PubMed Google Scholar
Choongbum Lee
View author publications
You can also search for this author in PubMed Google Scholar
Vladislav Voroninski
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Soatto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vladislav Voroninski .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 326 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goldstein, T., Hand, P., Lee, C., Voroninski, V., Soatto, S. (2016). ShapeFit and ShapeKick for Robust, Scalable Structure from Motion. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9911. Springer, Cham. https://doi.org/10.1007/978-3-319-46478-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-46478-7_18
Published: 16 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46477-0
Online ISBN: 978-3-319-46478-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

Abstract

Similar content being viewed by others