1 Introduction

Recent work in Structure from Motion (SfM) has demonstrated the possibility of reconstructing geometry from large photo collections [1, 16]. Efficient non-linear refinement [3, 34] of camera and point parameters was developed to produce optimal reconstruction. However, the important information about the reconstruction quality [12] is missing in state of the art pipelines [28, 33] and optimization solvers [2] because of the computational and memory requirements. If the quality evaluation is fast enough with reasonable memory requirements, it can be used for filtering cameras and 3D points before Bundle Adjustment which usually run in each step of SfM and therefore remove the bottleneck of a reconstruction pipelines [28, 29]. The second usage may be sophisticated smoothing in the most unconstrained directions in dense reconstructions [19, 27].

The mathematics of uncertainty propagation is well understood [15], and specific equations have been derived for many camera setups, e.g. for stereo rigs [22, 25], laser scans [17, 26], lines [4] and edges [6]. Except for few a works, e.g. [20], general error propagation has not received significant attention in the computer vision community. We believe that it is due to the high computational complexity of uncertainty propagation, which is, as a function of the number of camera parameters, cubic in time and quadratic in memory. To evaluate the quality, classical photogrammetric methods propagate the uncertainty of image measurements to the uncertainty of estimated parameters.

General propagation requires transport of covariances of measurements to parameters. The standard non-linear projection function is usually simplified by replacing it by its first order approximation, and the covariance of the estimated parameters is then computed by inverting its information matrix. Common objective function [15] of reconstruction from images is the sum of squared reprojection errors.

Camera rotation angle and radial distortion parameters are usually much smaller than coordinates of 3D points and, at the same time, have a large impact on the objective function. Hence, the Jacobian of the objective function often contains a large range of values which are further squared to information matrix. Thus, the information matrix becomes numerically rank deficient for larger collections of images. The information matrix can be decomposed into blocks which simplify its inversion. The block related to 3D points parameters is, up to special artificial configurations, full rank block diagonal matrix which can be easily inverted. The block related to camera parameters appears as over-parametrised. We have to find the closest one-one mapping from measurements to parameters with respect the Mahalanobis norm to be able to inverse submatrix of camera parameters [15]. The inversion should be computed as Moore-Penrose (MP) inversion [2]. The state of the art algorithm use Singular Value Decomposition (SVD) [20]. This algorithm works well for problems with a few hundreds of photos. As the number of images in reconstructions grows up to several hundred thousand, the efficiency and scalability of the quality evaluation become a critical issue.

We present an iterative algorithm based on the Taylor expansion of Schur complement matrix Z in a regularization point \(\lambda \), which computes an approximation of the MP inverse. This algorithm is approximately three times faster and reduces approximately two times the memory requirements in comparison with SVD. The second improvement is that we use the structure of matrices for uncertainty propagation from camera parameters to the point parameters, which increases the speed about two orders of magnitude for this particular step for real datasets. The output of our work is publicly available source code which can be used as an external library in nonlinear optimization pipelines like Ceres Solver [2]. The code, test problems, and detailed experiments are available online at http://cmp.felk.cvut.cz/~policmic.

The rest of the paper is organized as follow. Section 2 overviews previous works. Then, Sect. 3 provides our formulation of the problem of uncertainty propagation and Sect. 4 introduced our Taylor expansion algorithm. Section 5 describe how to use the structure of information matrix to compute covariances of point parameters. Finally, Sect. 6 reports the results of the experiments.

2 Previous Work

Many methods for 3D reconstruction of camera motion and scene geometry from a large number of images have been developed [1, 13, 16]. To evaluate the quality of the reconstruction, classical photogrammetric methods propagate the uncertainty in image measurements to the uncertainty of estimated parameters. The propagation is often based on the first order approximations [12, 20, 21]. This classical approach becomes computationally challenging with a large number of cameras and points [30, 32] and hence is often not fully implemented in computer vision reconstruction pipelines [2].

Let us next review previous work related to computing uncertainties in 3D reconstruction from images. Blostein and Huang [8] provided closed form expressions for evaluating parameters of error distributions of the disparity for a stereo rig. The confidence intervals for stereo setup were presented in [18]. Work [25] minimizes the Mean Square Error of measurement errors to compute optimal camera poses and to compute covariance matrices representing their uncertainties. The influence of errors in calibration on the errors in 3D reconstruction for stereo setup was investigated in [22]. In the recent paper [5], uncertainties of estimated parameters are studied for each step of a stereo-vision based 3D reconstruction (e.g., for calibration, contour segmentation, matching, and reconstruction), and used to estimate the error distributions of the output geometry. Inverse perspective equations for describing 3D line error were investigated in [4] and later generalized for reconstruction from edges in images [6]. In [23] the uncertainties are expressed as likelihood map of 3D points position created by aggregating local linear extrapolations computed from weighted least squares fits of 2D curves between 3D points.

The most relevant previous work. The most relevant work is from Lhuillier and Perriollat [20] where forming information matrix of estimated parameters and its inverse has been investigated. Lhuillier has computed MP inverse of Schur complement [35] of the submatrix of point parameters using the SVD algorithm. We extend Lhuillier work using faster MP inverse and improve 3D points covariance computation speed using the structure of Jacobian. A general algorithm for covariance propagation forward/backward for system linear/nonlinear which is or isn’t overparameterized was also described by Hartley in the book [15].

3 Problem Formulation

We next review the standard approach for computing covariances of reconstructed parameters. Our contributions are described in Sects. 4 and 5.

We consider a setup with n cameras \(C = \{C_1, C_2, \dots ,C_n\}\) where \(C_i \in \mathbb {R}^p\) is p-dimensional vector of camera parameters, m points \(X = \{X_1, X_2, \dots , X_m\}\) in 3D and k image observations represented by vector \(\varvec{u} \in \mathbb {R}^{2k}\). Each observation \(\varvec{u}_{i,j} \in \varvec{u}\), i.e. an image point, is a projection of 3D point \(X_j\) by camera \(C_i\), using projection function \(\varvec{p}(C_i,X_j)\). Noise \(\epsilon _{i,j}\) has zero mean, \(E(\epsilon _{ij}) = 0\), and standard deviation \(V(\epsilon _{i,j})\). All pairs of indices (ij) are in an index set S that determines which point \(X_j\) is visible in which camera view \(C_i\).

$$\begin{aligned} \varvec{u}_{i,j} = \varvec{p}(C_i, X_j) + \epsilon _{i,j} \quad \quad \forall (i,j) \in S \end{aligned}$$
(1)

Further, we define the vector \(\theta \) which represents all 3D reconstruction parameters \([C_1, \dots , C_n, X_1, \dots , X_m]\), the vector \(\epsilon \) which gathers all \(\epsilon _{i,j}\) where \((i,j) \in S\) and the function f which is composed from projection functions \(\varvec{p}\) and project vector \(\theta \) into the image observations \(\varvec{u}\).

$$\begin{aligned} \varvec{u} = f(\theta ) + \epsilon \end{aligned}$$
(2)

The function (2) leads to a non-linear least squares optimization

$$\begin{aligned} \theta ^* = \mathop {\text {arg min}}_{\theta } \Vert { \varvec{u} - f(\theta ) }\Vert _2^2 \end{aligned}$$
(3)

minimizing the sum of squares of the differences \(r(\theta ) = \varvec{u} - f(\theta )\) between observations \(\varvec{u}\) and reprojections \(f(\theta )\). The standard way to solve the nonlinear differentiable system is to use its first order approximation

$$\begin{aligned} r(\theta + \delta \theta ) \approx r(\theta ) + J(\theta )\,\delta \theta \end{aligned}$$
(4)

One step \(\delta \theta \) towards the solution \(\theta ^*\) is obtained by solving

$$\begin{aligned} J^\top J\, \delta \theta = -J^\top r \end{aligned}$$
(5)

with \(J(\theta )\) and \(r(\theta )\) replaced by simpler J and r. Using Gauss-Markoff theorem, the covariance \(\hat{V}(\theta )\) can be computed as

$$\begin{aligned} \hat{V}(\theta ) = \sigma ^2 (J^\top J)^{-1} \end{aligned}$$
(6)

Parameter \(\sigma \) can be estimated from residuals as \(\sigma ^2 \approx \Vert {r}\Vert ^2\)/\((p\,n)\), which is described in Hartley [15]. If the content of Jacobian is permuted to have cameras followed by points, i.e. \(J = [ J_C \, J_X ]\), the information matrix

$$\begin{aligned} (J^\top J)^{-1} = \left[ \!\!\begin{array}{cc} U &{} W \\ W^\top &{} V\end{array}\!\!\right] ^{-1} \end{aligned}$$
(7)

will be sparse and block diagonal [11]. See the information matrix for cube dataset in Fig. 1(a).

Fig. 1.
figure 1

(a) Structure of information matrix \(J^TJ\) for cube dataset. (b) The structure of matrix Y, \(Z^+\) and \(Y^T\) for cube dataset, where sparse multiplication use red and green values, our approach use green values only and the black boxes highlight final covariances of camera parameters (Color figure online)

To compute the inverse of \(J^\top J\), we introduce \(Y = -V^{-1}W^\top \). We note, first, that V is composed of \(3\times 3 \) blocks on the diagonal and its inverse can be done separately for each block, and then also that forming Y is fast thanks to the sparsity of V and W. The Upper triangular–Diagonal–Lower triangular (UDL) decomposition of the block matrix \(\hat{V}(\theta ) = \sigma ^2 (J^\top J)^{-1}\) leads to

$$\begin{aligned} \hat{V}(\theta ) = \sigma ^2 \left( \left[ \!\!\begin{array}{rc} I &{} -Y^\top \\ 0 &{} I\end{array}\!\!\right] \left[ \!\!\begin{array}{cc} Z &{} 0 \\ 0 &{} V\end{array}\!\!\right] \left[ \!\!\begin{array}{rc} I &{} 0 \\ -Y &{} I\end{array}\!\!\right] \right) ^{-1} \end{aligned}$$
(8)

where the matrix Z is the Schur complement of the block V of the matrix \(J^\top J\)

$$\begin{aligned} Z = U + W Y \end{aligned}$$
(9)

Notice that we are not interested in off-diagonal blocks. All covariances of parameters are in the blocks on the diagonal of the dense matrix \((J^\top J)^{-1}\). Hence, Lhuillier and Perriollat [20] computed the interesting sub-matrices as

$$\begin{aligned} (J^TJ)^{-1} \simeq \left[ \!\!\begin{array}{cc} Z^{-1} &{} - \\ - &{} Y Z^{-1} Y^\top + V^{-1}\end{array}\!\!\right] \end{aligned}$$
(10)

The blocks of size \(\mathbb {R}^{p \times p}\) on the diagonal \(Z^{-1}\) are covariances of camera parameters. Matrix Z is rank deficient and its inversion has to be solved by MP inversion. The blocks of size \(\mathbb {R}^{3 \times 3}\) on the diagonal of sub-matrix \(Y Z^{-1} Y^\top + V^{-1}\) are covariances of point parameters.

4 The Taylor Expansion Approach

We next denote the MP inverse of matrix Z as \(Z^+\). Assume general vectors x and b such that

$$\begin{aligned} Zx = b \end{aligned}$$
(11)

holds true. Our approach is to redefine x as a function where

$$\begin{aligned} Z x(0) = b \end{aligned}$$
(12)

holds true. Unfortunately, the pseudoinverse can’t be computed as \((Z^TZ)^{-1}Z\) since \(Z^TZ\) is also rank deficient. Hence, we compute the inversion for \(x(\lambda )\) as

$$\begin{aligned} x(\lambda ) = (Z^\top Z + \lambda I)^{-1} Z^\top b \end{aligned}$$
(13)

where \(\lambda \) is close to zero. Finally, we estimate x(0) and \(Z^+\) from Taylor series of function \(x(\lambda )\) at the point 0.

We need a general formula for the k-th derivative with respect \(\lambda \)

$$\begin{aligned} x^{k}(\lambda ) = -k (ZZ + \lambda I)^{-1}x^{k-1}(\lambda ) \end{aligned}$$
(14)

for building Taylor series. The matrix Z is symmetric and \(Z^\top Z = ZZ\). Using the derivatives, we can approximate \(x(\lambda )\) around \(\lambda \) using the t-th order Taylor series

$$\begin{aligned} x(0) = x(\lambda ) + \sum _{k=1}^t \frac{(-1)^{k-1} \lambda ^k}{k!} x^{k}(\lambda ). \end{aligned}$$
(15)

Next we substitute \(B = (ZZ + \lambda I)^{-1}\), which is computed using Cholesky decomposition [10]. \(x(\lambda )=BZ\) can be factored out and we get the formula for \(Z^+\) as

$$\begin{aligned} x(0) = Z^+ b \quad \quad Z^+ = \left( I + \sum _{k=1}^t (-1)^{k-1} \lambda ^k B^k \right) B Z \end{aligned}$$
(16)

In our experiments, the algorithm converges for \(t \in [2,5]\). Each iteration performs one symmetric matrix multiplication and one addition. If we scale the matrix ZZ by scalar \(c = 1/mean(ZZ)\), we get \(\hat{\lambda } = c \, \lambda \), \(\hat{B} = (c \hat{Z}\hat{Z} + \hat{\lambda } I)^{-1}\) and

$$\begin{aligned} Z^+ = \left( I + \sum _{k=1}^t (-1)^{k-1} c^k \hat{\lambda }^k \hat{B}^k \right) \hat{c} \hat{B} \hat{Z} \end{aligned}$$
(17)

This scaling decrease the range of values and allows computation in double precision. Numerical stability of the algorithm depends on the value of \(\hat{\lambda }\). We use \(\hat{\lambda } = trace(Z Z) \, c/10^{16}\) which is usually \([10^{-8},10^{-10}]\). The parameter \(\hat{\lambda }\) depends on the size of Schur complement matrix which depends on the number of parameters of cameras. Reconstructions, which contains more than \(10^6\) cameras, may require higher parameter \(\hat{\lambda }\), i.e. above \(10^{-4}\).

5 Point Covariances

We compute the covariance matrices of point parameters from Eq. 10 using the structure of the matrix Y. The matrix Y contains values in blocks where 3D points (triples of rows) are seen by cameras (p columns). We compute only blocks \(\bar{H} = Z^+ Y^\top \) which are required for multiplication \(Y \bar{H}\), Fig. 1(b). The speed up in comparison with sparse multiplication is \(n\,m\)/k times.

Table 1. The table summarize number of cameras \(N_{Cams}\), number of points \(N_{Pts}\), number of observations \(N_{Obs}\) and one side length \(N_{Schur}\) of rectangular Schur complement matrix Z for compared datasets

6 Experimental Evaluation

6.1 Datasets

We experimented with realistic synthetic scenes, as well as a number of medium to large-scale Internet datasets: Flat, Tower of London, Notre Dame, Trench, Seychelles. The Cube dataset was used for visualization of the matrices. Flat is the only dataset that is synthetic. Tower of London and Seychelles are reconstructed with COLORMAP [28], Notre Dame is reconstructed by Bundler [29] and the remaining one by our internal reconstruction pipeline YaSfM. The images for Tower of London and Notre Dame were collected in projects [29, 32] from Flickr by using geotag. The remaining datasets are available on web page http://cmp.felk.cvut.cz/~policmic. The parameters of datasets are summarized in Table 1.

6.2 Algorithms

We compared the performance, memory requirements and the sensitivity on input parameters for three algorithms: qr-iteration, divide-and-conquer and taylor-expansion. The state of the art method is done using SVD [20]. The algorithms qr-iteration, divide-and-conquer are described in [7]. We do not compare our algorithm with recent [24] paper because the Cholesky decomposition on Schur complement matrix is performed there. However, the Schur complement matrix is numerically rank deficient for middle and large reconstructions (i.e. Cholesky decomposition can’t be used). This subsection has following structure: we describe how the Schur complement matrix was formed, how its MP inversion was computed and lastly focus on the precision of Taylor expansion approach. Let us pinpoint the main figures and tables: Fig. 2 shows the structure of C++ code and labels used for comparison of the speed. The comparisons of the speeds are in Tables 2, 3. The Fig. 3 visualize the uncertainty ellipsoids for different values of parameters \(\gamma , \lambda \).

Fig. 2.
figure 2

The diagram of algorithm implementation

Forming Schur Complement. Given Jacobian \(J(\theta ) = [J_C(\theta ) \, J_X(\theta )]\) from Ceres [2], we form an information matrix in Eq. 7 and Schur complement in Eq. 9 using spare matrix multiplication for all algorithms. For each experiment, \(\theta = [\theta _C \theta _X]\) were \(\theta _C = [\theta _{C_1}, \dots , \theta _{C_n}]\) and \(\theta _X = [\theta _{X_1}, \dots , \theta _{X_m}]\). One camera \(\theta _{C_i} \in \mathbb {R}^{9}\) is represented by 3 angle-axis, 3 camera center, 1 focal length, 2 radial distortion parameters and point \(\theta _{X_i} \in \mathbb {R}^{3}\) by its 3D coordinates.

Solving Schur Complement Inverse. The first two methods qr-iteration, divide-and-conquer are SVD algorithms described in [7]. Both algorithms decompose Schur complement into \(Z = \bar{U} \bar{S} \bar{V}^{\top }\) and invert the diagonal values \(\bar{S}^{\prime }_{i,i} = 1\)/\(\bar{S}_{i,i}\) for which holds \(\bar{S}_{i,i}>\gamma \) where \(i \in {1,2,\dots ,N_{Schur}}\). The remaining values on the diagonal \(\bar{S}_{i,i}<=\gamma \) are set to zero. Note that no exact estimate of \(\gamma \) exists. The parameter \(\gamma \), i.e. the number of inverted values, is crucial for the output values. Figure 3 visualize the influence of \(\gamma \) (SVD algorithms) and \(\hat{\lambda }\) (Taylor expansion algorithm) on uncertainty ellipsoids where the ellipsoid show the position of a 3D point or a 3D camera center. Taylor expansion algorithm compute inverse \((c\,Z Z + \hat{\lambda } I)^{-1}\) using Cholesky factorization. The parameter \(\hat{\lambda } = trace(Z Z) \, c{/}10^{16}\) appears as reasonable trade-off between regularization \(c\,Z Z\) and distance from zero which we try to approximate. The algorithm performs several iterations. The i-th iteration computes the i-th term \(T_i\) of Taylor series. We compute next \(T_i\) until \(l_{\infty }(T_i) > 10^{-5}\). The usual number of iterations is [2,5].

Table 2. Additional processing times, especially the time \(T_{V^{-1}}\) required for inversion of the matrix V, the time \(T_{Z}\) required for forming the Schur complement matrix, values of parameter \(\hat{\lambda }\), the time \(T_{X\_sparse}\) required for sparse multiplication and the time \(T_{X\_structure}\) for dense multiplication with using the structure of matrix Y
Fig. 3.
figure 3

Parameter sensitivity of the SVD based algorithms (parameter \(\gamma \)) and Taylor expansion algorithm (parameter \(\hat{\lambda }\)) for flat dataset

Assuming that all algorithms store Z and \(Z^+\) in the same format, the difference between their memory usages depend on how they work with matrix Z. Taylor expansion algorithm requires one additional symmetric matrix of size \(\mathbb {R}^{p\,n\times p\,n}\). Both remaining algorithms require matrices \(\bar{U},\bar{V} \in \mathbb {R}^{p\,n\times p\,n}\) and vector \(diag(\bar{S}) \in \mathbb {R}^{p\,n}\). Further qr-iteration require additional \((p\,n)^2 + 67(p\,n)\) and divide-and-conquer additional \(4(p\,n)^2 + 7(p\,n)\) values for SVD.

Precision of the MP Inverse Approximation. We computed the ground truth using SVD with 100 significant digits and compare it with Taylor expansion and SVD algorithm computed in double (\(\approx \)15 significant digits). The mean difference of camera parameters (angle axis and camera position) between the ground truth covariance matrix and the covariance matrix computed by Taylor expansion approach for \(\lambda = 10^{-9}\) was \(3.9 10^{-3}\) while the difference for SVD for \(\gamma = 10^{-7}\) was \(1.4 10^{-4}\) for Cube dataset. We are not able to test the precision of our approximations for larger dataset because of computational complexity. The covariance matrices computed by Taylor expansion approach may be useful because we get large ellipsoids for the most unconstrained cameras and points and vice versa, Fig. 3. So, we are able to determine which cameras are more and which less uncertain.

All algorithms were implemented as part of a single C++ code base. They share inversion of matrix V and forming Schur complement matrix. We use Intel MKL 11.3.1 [9] implementation of Blas and Lapack for dense linear algebra, Eigen 3.3 [14] for sparse linear algebra and matrix representation. Blas and Lapack functions are called from MAGMA 2.2.0 [31]. All experiments were performed on a single computer with one 2.6 GHz Intel Core i7-6700HQ with 32 GB RAM running a 64-bit Windows 10 operating system.

Table 3. Memory and time requirements. The columns \(M_{Z^+}\) and \(T_{Z^+}\) represent memory and time requirements for computing PM inverse of Schur complement. \(\sum T\) expresses the complete run-time of the algorithm. The parameter \(\gamma \) was equal to \(10^{-10}\) in all evaluations.

7 Conclusion

The state of the art method for finding covariances is based on SVD [20] which is computationally challenging for a large number of cameras and points. We propose a new method which works for scenes with \(10^3\) cameras and \(10^6\) points in reasonable time 60 s on a single computer. The algorithm output changes smoothly with input parameter which is beneficial for large reconstructions. We scale Schur complement to sufficiently reduce numerical errors when using double precision. We hope that more sophisticated pre-conditioners used in Bundle Adjustment solvers [3, 34] can scale values to the range where single precision will be sufficient. That representation will allow GPU usage and order of magnitude speedup.