Coarse-to-fine Planar Regularization for Dense Monocular Depth Estimation

Liwicki, Stephan; Zach, Christopher; Miksik, Ondrej; Torr, Philip H. S.

doi:10.1007/978-3-319-46475-6_29

Stephan Liwicki¹⁷,
Christopher Zach¹⁷,
Ondrej Miksik¹⁸ &
…
Philip H. S. Torr¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9906))

Included in the following conference series:

European Conference on Computer Vision

20k Accesses

Abstract

Simultaneous localization and mapping (SLAM) using the whole image data is an appealing framework to address shortcoming of sparse feature-based methods – in particular frequent failures in textureless environments. Hence, direct methods bypassing the need of feature extraction and matching became recently popular. Many of these methods operate by alternating between pose estimation and computing (semi-)dense depth maps, and are therefore not fully exploiting the advantages of joint optimization with respect to depth and pose. In this work, we propose a framework for monocular SLAM, and its local model in particular, which optimizes simultaneously over depth and pose. In addition to a planarity enforcing smoothness regularizer for the depth we also constrain the complexity of depth map updates, which provides a natural way to avoid poor local minima and reduces unknowns in the optimization. Starting from a holistic objective we develop a method suitable for online and real-time monocular SLAM. We evaluate our method quantitatively in pose and depth on the TUM dataset, and qualitatively on our own video sequences.

You have full access to this open access chapter, Download conference paper PDF

Incorporating scene priors to dense monocular mapping

Article 29 July 2015

Handling pure camera rotation in semi-dense monocular SLAM

Article 01 September 2017

Semi-Direct SLAM with Manhattan for Indoor Low-Texture Environment

Keywords

1 Introduction

Simultaneous localization and mapping (SLAM), also known as online structure from motion, aims to produce trajectory estimations and a 3D reconstruction of the environment in real-time. In modern technology, its application ranges from autonomous driving, navigation and robotics to interactive learning, gaming and enhanced reality [1–7]. Typically, SLAM comprises two key components: (1) a local model, which generates fast initial odometry measurements (which often includes a local 3D reconstruction – e.g. a depth map – as byproduct), and (2) a global model, which performs loop closures and pose refinement via large scale sub-real-time bundle adjustment. In our work, we focus on the former, and propose a new strategy for local monocular odometry and depth map estimation.

Estimating the 3D position of tracked landmarks is a key ingredient in any SLAM system, since it directly allows for the poses to be computed w.r.t. a common coordinate frame. Historically, visual landmarks are induced by sparse keypoints, but there is a recent trend to utilize a dense (or semi-dense) set of points (leading to a dense or semi-dense depth map representation) [8, 9].

Another trend is the inclusion of different sensing modalities for depth estimation. Often, methods exploit (a combination of) alternative sensors, such as infrared, lidar and stereo camera setups, which natively provide fairly accurate depth data [10–13]. Such algorithms are quite advanced and are often employed even in consumer technology where hardware is controllable. Visual SLAM with only monocular camera streams is less common and still challenging in literature [8, 9, 14–21]. Nonetheless, the monocular setup is very suitable for (1) long range estimations, where stereo baselines are negligible, (2) light weight mobile and wearable devices aiming for a minimal amount of sensors to reduce weight and power consumption, and (3) legacy video footage recoded by a single camera.

Classical approaches for monocular visual SLAM are based on keypoint tracking and mapping [15–17], which produces a feature-based sparse depth hypothesis. A number of methods have since been proposed which essentially alternate between tracking (and pose computation) and dense depth map estimation: Most prominently, [8] presents dense tracking and mapping (DTAM) which generates a dense depth map on GPU. Similarly, [18–20] provide dense depth maps, but like [8] also rely heavily on GPU acceleration for real-time performance. In contrast to these methods large-scale direct SLAM (LSD-SLAM) [9] focusses the computation budget on a semi-dense subset of pixels and has therefore attractive running-times, even when run on CPU or mobile devices. As a direct method it computes the odometry measurements directly from image data without an intermediate representation such as feature tracks. Depth is then computed in a separate thread with small time delay. Note that all these methods employ an alternation strategy: odometry is computed with the depth map held fixed, and the depth map is updated with fixed pose estimates. In contrast, we propose joined estimation of depth and pose within a single optimization framework, which runs twice as fast as LSD-SLAM to find structure and motion. In particular, we introduce minimal additional computational cost compared to that of only the tracking thread of LSD-SLAM.

1.1 Contributions

In this work, we present a local SLAM front-end which estimates pose and depth truly simultaneously and in real-time (Fig. 1). We revisit traditional setups, and propose inverse depth estimation with a coarse-to-fine planar regularizer that gradually increases the complexity of the algorithm’s depth perception. Note, many systems for stereo vision or depth sensors incorporate local or global planar regularization [12, 13, 22–24]. Similarly, we employ global planar constraints into our monocular setup, and enforce local smoothness by representing each pixel as lying on a plane that is similar to its neighbours’. Furthermore, similarly to many algorithms in stereo (e.g. [10, 22]), we reduce depth complexity via discretization, in our case through planar splitting techniques which (in the spirit of graphical methods) create labels “on demand”. In summary,

1.
we formulate a global energy for planar regularized inverse depth that is optimized iteratively at each frame,
2.
we revisit depth and pose optimization normally considered separately, and introduce a coarse-to-fine strategy that refines both truly simultaneously,
3.
we establish our method as semi-dense, and find pose and depth twice as fast as LSD-SLAM, by adding minimal cost to LSD-SLAM’s tracking thread,
4.
we evaluate pose and depth quantitatively on the TUM dataset.

Closely related to our work is [25], where depth and pose is optimized simultaneously given the optical flow of two consecutive images. This approach is based on image pairs. Our method considers video input and incrementally improves its belief. In [26, 27] planarity is proposed in conjunction with scene priors, previously learned from data, and [20] presents a hole-filling strategy for semi-dense monocular SLAM. While these methods are real-time, they rely on keypoints at image corners or gradients, which are later enriched with a planar refinement. Importantly however, such methods fail in featureless environments. Finally, we emphasis DTAM [8] performs batch operations on a set of images taken from a narrow field of view, and henceforth introduces a fixed lag before depth is perceived by the system. As this is often unacceptable for robotics setups, our method updates depth incrementally after each frame.

2 Proposed Energy for Monocular Depth Estimation

We formulate our energy function for poses and depth w.r.t. the photometric error over time. Similar to LSD-SLAM, we employ a keyframe-to-frame comparison to estimate camera displacement and each pixels’ depth in the reference image. Let us denote the keyframe as $I$ and its immediately succeeding images as $(I_t)_{t=1}^T$. The tuple of valid pixel locations on the keyframe’s plane is represented by $\mathcal {X}= (\mathbf {x}_i)_{i=1}^{|\mathcal {X}|}$ in normalized homogeneous coordinates (i.e. $z_i = 1$), and their corresponding inverse depth values are expressed by $\mathcal {D}= (d_i)_{i=1}^{|\mathcal {X}|}$. Since we aim to model planar surfaces, we use an over-parametrization given by $\mathcal {S}= (\mathbf {s}_i^\mathtt {T})_{i=1}^{|\mathcal {X}|} \cong \mathbb {R}^{3|\mathcal {X}|}$, where $\mathbf {s}_i = (u_i, v_i, w_i)^\mathtt {T}$ are planes with disparity gradients $u_i$, $v_i$, and inverse depth at 0, $w_i$. Hence, the relation $d_i = \mathbf {s}_i^\mathtt {T}\mathbf {x}_i$ holds.

Tuple $\varXi = (\xi _t)_{t=1}^T$ denotes the changes in camera pose, where $\xi _t \in SE(3)$ is composed of rotation $\mathbf {R}_t \in SO(3)\subset \mathbb {R}^{3\times 3}$ and translation $\mathbf {t}_t \in \mathbb {R}^{3}$ between the keyframe $I$ and frame $I_t$. In principle, the complete cost function should incorporate all available images associated with the current keyframe and optimize over the depth and all poses jointly,

$$\begin{aligned} \hat{E}_{Total}\left( \mathcal {S}, \varXi \right) = \sum _{t = 1}^T E_{Match}^{(t)}(\mathcal {S},\xi _t) + E_{Smooth}(\mathcal {S}). \end{aligned}$$

(1)

Here $E_{Match}^{(t)}$ and $E_{Smooth}$ are energy terms related to image-based matching costs and spatial smoothing assumptions, respectively. Before we describe these terms in more detail in subsequent sections, we modify $\hat{E}_{Total}$ to be more suitable for an incremental online approach. This is advisable since, the objective $\hat{E}_{Total}$ involves the complete history of all frames $I_t$ mapped to the current keyframe $I$. Intuitively the optimization of the poses $(\xi _t)_{t=1}^{T-1}$ is no longer relevant at time T, as only the current pose $\xi _T$ and $\mathcal {S}$ is required. Analytically, we introduce

$$\begin{aligned} E_{History}^{(T)}\left( \mathcal {S}\right) {:=} \min _{( \xi _t)_{t = 1}^{T-1} } \sum _{t = 1}^{T-1} E_{Match}^{(t)}(\mathcal {S},\xi _t) \end{aligned}$$

(2)

where $(\xi _t)_{t=1}^{T-1}$ is the tuple of poses, minimized in previous frames. By splitting the first term in (1), the energy becomes

$$\begin{aligned} \hat{E}_{Total}\left( \mathcal {S},\varXi \right)&= E_{History}^{(T)}\left( \mathcal {S}\right) + E_{Match}^{(T)}(\mathcal {S},\xi _T) + E_{Smooth}(\mathcal {S}). \end{aligned}$$

(3)

Now we replace $E_{History}^{(T)}$ with its second order expansion around

$$\begin{aligned} \left( \mathcal {S}^*, \xi _1^*, \cdots , \xi _{T-1}^* \right) = {\mathop {\hbox {argmin}}\limits _{\mathcal {S},( \xi _t )_{t = 1}^{T-1}}} \sum _{t = 1}^{T-1} E_{Match}^{(t)}(\mathcal {S},\xi _t), \end{aligned}$$

(4)

and thus we obtain an approximation of $E_{History}^{(T)}(\mathcal {S})$, denoted $E_{Temporal}^{(T)}(\mathcal {S})$:

$$\begin{aligned} E_{Temporal}^{(T)}\left( \mathcal {S}\right)&{:=} E_{History}^{(T)}\left( \mathcal {S}^* \right) + \left( \nabla _{\mathcal {S}} E_{History}^{(T)}\left( \mathcal {S}^* \right) \right) ^\mathtt {T}\left( \mathcal {S}- \mathcal {S}^* \right) \nonumber \\&+ \frac{1}{2} (\mathcal {S}- \mathcal {S}^*)^\mathtt {T}\left( \nabla ^2_{\mathcal {S}} E_{History}^{(T)}\left( \mathcal {S}^* \right) \right) (\mathcal {S}- \mathcal {S}^*) \nonumber \\&= E_{History}^{(T)}\left( \mathcal {S}^* \right) + \frac{1}{2} (\mathcal {S}- \mathcal {S}^*)^\mathtt {T}\left( \nabla ^2_{\mathcal {S}} E_{History}^{(T)}\left( \mathcal {S}^* \right) \right) (\mathcal {S}- \mathcal {S}^*) \end{aligned}$$

(5)

As $\mathcal {S}^*$ is a local minimizer of $E_{History}^{(T)}$, $\nabla _{\mathcal {S}} E_{History}^{(T)}(\mathcal {S}^*) = 0$. Furthermore, as our choice of terms leads to a nonlinear least-squares formulation, $\nabla ^2_{\mathcal {S}} E_{History}^{(T)}(\mathcal {S}^*)$ is computed using the Gauss-Newton approximation. Finally, since $E_{History}^{(T)}$ jointly optimizes the inverse depths (in terms of its over-parametrization $\mathcal {S}$) and (internally) the poses, but $E_{Temporal}^{(T)}$ is solely a function of $\mathcal {S}$, we employ the Schur complement to factor out the poses $(\xi _t)_{t=1}^{T-1}$. However, as the poses link the entire depth map, the Schur complement matrix will be dense. We obtain a tractable approximation by using its block-diagonal consisting of $3 \times 3$ blocks (corresponding to $\mathbf {s}_i = (u_i, v_i, w_i)^\mathtt {T}$).^{Footnote 1} The resulting objective at time T is therefore

$$\begin{aligned} E_{Total}^{(T)}\left( \mathcal {S}, \xi _T \right)&= E_{Temporal}^{(T)}\left( \mathcal {S}\right) + E_{Match}^{(T)}(\mathcal {S},\xi _T) + E_{Smooth}(\mathcal {S}). \end{aligned}$$

(6)

There is a clear connection between $E_{Total}^{(T)}$, extended Kalman filtering and maximum likelihood estimation. If $E_{History}^{(T)}$ is interpreted as log-likelihood, then $\left( \mathcal {S}^*, (\xi _t^*)_{t=1}^{T-1} \right) $ is an asymptotically normal maximum likelihood estimate with the Hessian as (approximate) inverse covariance (i.e. precision) matrix. The Schur complement to factor out the poses (in the energy-minimization perspective) corresponds to marginalizing over the poses according to their uncertainty. $E_{Total}^{(T)}$ can be read as probabilistic fusion of past and current observation, but this correspondence is limited, since we are searching for MAP estimates and not posteriors. In the following section we discuss the remaining terms in $E_{Total}^{(T)}$.

2.1 Photometric Energy

The matching cost $E_{Match}^{(T)}(\mathcal {S},\xi _T)$ is derived from an appearance (e.g. brightness) consistency assumption commonly employed in literature, e.g. [28]. Let us define the monocular warping function $W(\mathbf {x}_i,d_i,\xi _t)$ which maps point $\mathbf {x}_i$ in the keyframe to its representation $\mathbf {x}'_i$ in frame t by

$$\begin{aligned} \mathbf {x}'_i = W(\mathbf {x}_i,d_i,\xi _t) = \hom \left( \mathbf{R}_t^\mathtt {T}\left( \mathbf {x}_i - \mathbf{t}_td_i\right) \right) , \end{aligned}$$

(7)

under camera rotation $\mathbf{R}_t$ and translation $\mathbf{t}_t$, where $\hom (\cdot )$ normalizes the homogeneous coordinate. Now we express the matching energy as

$$\begin{aligned} E_{Match}^{(T)}(\mathcal {S},\xi _T) = \sum _{\mathbf {x}_i\in \mathcal {X}} \Vert I(\mathbf {x}_i) - I_T\left( W(\mathbf {x}_i,d_i,\xi _T)\right) \Vert _{\tau _{Match}}, \end{aligned}$$

(8)

where $I(\mathbf {x})$ and $I_T(\mathbf {x})$ are descriptors extracted around pixel $\mathbf {x}$ from keyframe and current frame respectively. We use image intensity values (i.e. a descriptor at pixel only), so that the disparity gradients do not need to be taken into account during warping. Robustness is achieved by employing a smooth truncated quadratic error [29] (visualized in Fig. 2) in the implementation of $\Vert \cdot \Vert _{\tau _{Match}}$.

2.2 Local Spatial Plane Regularizer

The smoothness constraint $E_{Smooth}(\mathcal {S})$ is based on a planar assumption often found in stereo setups [13, 23, 24], which we adapt in this work to support monocular video data. Surface $\mathbf {s}_i$ induces a linear extrapolation of inverse depth via $\hat{d}_i(\mathbf {x}) = \mathbf {s}_i^\mathtt {T}\mathbf {x}$. Plugging this into the homographic transformation yields

$$\begin{aligned} W(\mathbf {x},\hat{d}_i(\mathbf {x}),\xi _t) = \hom \left( \mathbf{R}_t^\mathtt {T}\left( \mathbf {x}- \mathbf{t}_t \mathbf {s}_i^\mathtt {T}\mathbf {x}\right) \right) = \hom \left( \mathbf{R}_t^\mathtt {T}\left( \mathbf {x}_i - \mathbf{t}_t\frac{\mathbf {n}_i^\mathtt {T}}{r_i}\mathbf {x}_i\right) \right) , \end{aligned}$$

(9)

where $\mathbf {n}_i$ is the plane normal and $r_i$ is the point-plane distance to the camera center. Hence we can identify $\mathbf {s}_i \propto \mathbf {n}_i$ and therefore smoothing planes in inverse depth parametrization also smoothes the alignment in 3D space (Fig. 3).

With $\lambda _{Smooth}$ as balancing term, we define the spatial smoothness energy as

$$\begin{aligned} E_{Smooth}(\mathcal {S})&= \lambda _{Smooth}\sum _{\mathbf {x}_i \in \mathcal {X}} \sum _{\mathbf {x}_j \in \mathcal {N}_i} \Vert \mathbf {s}_i^\mathtt {T}\mathbf {x}_i - \mathbf {s}_j^\mathtt {T}\mathbf {x}_i \Vert _{\tau _{Smooth}} \nonumber \\&= \lambda _{Smooth}\sum _{\mathbf {x}_i \in \mathcal {X}} \sum _{\mathbf {x}_j \in \mathcal {N}_i} \Vert d_i - (d_j + \mathbf {s}_j^\mathtt {T}(\mathbf {x}_i - \mathbf {x}_j)) \Vert _{\tau _{Smooth}}, \end{aligned}$$

(10)

where $\mathcal {N}_i$ denotes the 4-neighborhood of $\mathbf {x}_i$. Thus, $E_{Smooth}$ penalizes deviations between linearly extrapolated depth at $\mathbf {x}_j$ and its actual depth. Although some methods try to introduce robustness by appearance-based edge detection, e.g. [30], we again simply employ the smooth version of the truncated quadratic for $\Vert \cdot \Vert _{\tau _{Smooth}}$. Hence, our method is inherently robust without arbitrary color constraints. Unfortunately, (10) is not scale invariant, and scaling the baseline $\mathbf{t}_t$ scales the contribution of $E_{Smooth}$. This is a potential issue only for the first pair of frames $(I, I_1)$, since subsequent frames have their scale determined by preceding frames. It is common usage to fix the initial scale by setting $\Vert \mathbf{t}_1\Vert = 1$, but this is a suboptimal choice, since the same 3D scene geometry is regularized differently depending on the initial baseline. A more sensible choice is to fix e.g. the average depth (or inverse depth) to make $E_{Smooth}$ invariant w.r.t. baselines. For our reconstruction we constrain the average inverse depth to one.

3 Optimization Strategy

In this section we detail our optimization strategy for the energy in (6). We assume small changes between consecutive frames, as video data is used. Therefore we use a similar approach as in standard differential tracking and optical flow by locally linearizing the image intensities $I_T$ in the matching term $E_{Match}^{(T)}$. The pseudocode of the proposed method is given in Algorithm 1. The underlying idea is to optimize the energy incrementally with increased complexity using the scale-space pyramid representation and our restricted depth map update which we detail below. The aim of doing this is two-fold: Firstly it substantially reduces the number of unknowns in the main objective and therefore makes the optimization much more efficient, and secondly it provides an additional level of regularization within the algorithm and combines naturally with a scale-space framework to avoid poor local minima. We discuss this constrained depth map update in the following, and then introduce our optimization which exploits this update to allow for truly simultaneous pose and depth estimation. Finally we present a strategy for realtime performance on CPU.

3.1 Constrained Depth Map Updates

If we consider the current frame at time T and optimize $E_{Total}$ (recall (6)) w.r.t. $\xi _T$ and $\mathcal {S}$, then our algorithmic design choice is to restrict the update $\mathcal {S}- \mathcal {S}^*$ to have low complexity in the following sense:

$$\begin{aligned} \mathbf {s}_i = \mathbf {s}_i^* + \sum _{c= 1}^{C} \mathbb {I}_c(\mathbf {x}_i) \varDelta _c, \end{aligned}$$

(11)

where $\mathbb {I}_c: \mathcal {X}\rightarrow \{+1, -1\}$ is an indicator function, splitting the set of pixels into positive or negative parts. This means that a depth update at each pixel $\mathbf {x}_i$ is constrained to take one of $2^C$ values. With increasing cardinality C, the complexity of the depth map increases.

The optimization is performed greedily by adding a single component $\varDelta _c$ at a time. Notice, if $\xi _T$ and $\mathcal {S}$ were to be optimized simultaneously, an equation with $6 + 3|\mathcal {X}|$ unknowns had to be solved inside a nonlinear least squares solver (i.e. 6 parameters for an element in the lie algebra $\mathfrak {se}(3)$ and 3 for the over-parameterized depth values at each pixel). By using the constrained shape for the updates and by using a greedy framework, we reduce the optimization to $6 + 3$ variables at a time (i.e. $\mathfrak {se}(3)$ and the 3 vector $\varDelta _c$), improving the execution cost and robustness significantly.

Our methodology can be seen in analogy to multi-resolution pyramids which spatially increase the quantization of the image plane, but in addition to spatial resolution we also incrementally increase the quantization level of inverse depths. Specifically, we exploit the representation of a pixel’s plane $\mathbf {s}_i$ as summed components $\varDelta _c$, given in (11). These values correspond to the inverse depth resolution which increases when new components are introduced.

This coarse-to-fine depth estimation is inspired by the human vision [31], which perceives depth in relation to other areas in the scene, rather than absolute values. Specifically, we perform the introduction of new distance values in a relational setting, splitting the data points based on their desired depth value direction. The advantages of this approach are three-fold: (1) we introduce depth by enforcing a regularization across all pixels, (2) our splitting function separates the image data into multiple planes, which naturally encode the image hierarchically from coarse to fine, and (3) the incremental introduction of depth enables fast computation whilst optimizing transformation and depth simultaneously. Moreover, we emphasize while our approach is greedy, it is not final since corrections can be made through further splitting.

Our design choice to regularize the updates of $\mathcal {S}$ requires to determine the binary function $\mathbb {I}_c: \mathcal {X}\rightarrow \{+1, -1\}$. Essentially, if $\varDelta _c$ is given, $\mathbb {I}_c(\mathbf {x}_i)$ corresponds to the sign of the correlation $\varDelta _c^\mathtt {T}\nabla _{\mathbf {s}_i} E_{Total}$ between the depth update direction $\varDelta _c$ and the gradient of the objective with respect to $\mathbf {s}_i$. Since $\varDelta _c$ is subject to subsequent optimization, we determine an initial estimate $\tilde{\varDelta }_c$ as follows: given the current gradients $\nabla _{\mathbf {s}_i} E_{Total}$ (which we abbreviate to $\nabla _{\mathbf {s}_i}$), it is sensible to obtain $\tilde{\varDelta }_c$ as principal direction of the set $\{\nabla _{\mathbf {s}_i} \}_{i = 1}^{|\mathcal {X}|}$, due to the symmetric range in $\mathbb {I}_c$:

$$\begin{aligned} \tilde{\varDelta }_c\leftarrow \mathop {\hbox {argmax}}\limits _{u : \Vert u \Vert = 1} \left\{ u^\mathtt {T}\sum \nolimits _{\mathbf {x}_i \in \mathcal {X}} \nabla _{\mathbf {s}_i} \nabla _{\mathbf {s}_i}^\mathtt {T}\, u \right\} . \end{aligned}$$

(12)

This can be obtained by eigenvalue or singular value decomposition of the $3 \times 3$ scatter matrix $\sum _{\mathbf {x}_i \in \mathcal {X}} \nabla _{\mathbf {s}_i} \nabla _{\mathbf {s}_i}^\mathtt {T}$. Finally, the indicator function is given by

$$\begin{aligned} \mathbb {I}_c(\mathbf {x}_i) = {\left\{ \begin{array}{ll} 1 &{} \text {if}\; \tilde{\varDelta }_c^\mathtt {T}\nabla _{\mathbf {s}_i} \ge 0 \\ -1 &{} \text {otherwise} \end{array}\right. } = \mathrm{sgn}\left( \tilde{\varDelta }_c^\mathtt {T}\nabla _{\mathbf {s}_i} \right) . \end{aligned}$$

(13)

3.2 Simultaneous Pose and Depth Estimation

Let us assume we have an initial estimate for $\xi _T$ and $\mathcal {S}$ available (e.g. $\xi _T \leftarrow \xi _{T-1}$ and $\mathcal {S}\leftarrow \mathcal {S}^*$, which is equivalent to $C = 0$ in (11)). Since our objective is an instance of nonlinear least-squares problems we utilize the Levenberg-Marquardt (LM) algorithm for robust and fast second order minimization. The robust kernels $\Vert \cdot \Vert _{\tau _{Match}}$ and $\Vert \cdot \Vert _{\tau _{Smooth}}$ are handled by an iteratively reweighed least square (IRLS) strategy. Potentially enlarging the convergence basin via a lifted representation of the robust kernel [32] is a topic for future work.

As outlined in Sect. 3.1 the complexity of depth map updates is increased greedily, which means that new components $\varDelta _c$ are successively introduced. We start with $C=0$ and iteratively increase C by adding new components. After introduction of a new component $\varDelta _c$ (and having an estimate for $\mathbb {I}_c$), minimizing $E_{Total}$ with respect to $\varDelta _c$ and $\xi _T$ amounts to solving

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\xi _T,\varDelta _c}}&\Big \{ \sum _{\mathbf {x}_i\in \mathcal {X}} \Vert I(\mathbf {x}_i) - I_T\left( W(\mathbf {x}_i,\left( \mathbf {s}_i+\mathbb {I}_c(\mathbf {x}_i)\varDelta _c\right) ^\mathtt {T}\mathbf {x}_i,\xi _T)\right) \Vert _{\tau _{Match}} \nonumber \\ +&\lambda _{Smooth}\sum _{\mathbf {x}_i \in \mathcal {X}} \sum _{\mathbf {x}_j \in \mathcal {N}_i} \Vert \left( \mathbf {s}_i+\mathbb {I}_c(\mathbf {x}_i)\varDelta _c\right) ^\mathtt {T}\mathbf {x}_i - \left( \mathbf {s}_j+\mathbb {I}_c(\mathbf {x}_j)\varDelta _c\right) ^\mathtt {T}\mathbf {x}_i \Vert _{\tau _{Smooth}} \nonumber \\ +&\sum _{\mathbf {x}_i \in \mathcal {X}} \Vert \mathbf {s}_i^* - \left( \mathbf {s}_i+\mathbb {I}_c(\mathbf {x}_i)\varDelta _c\right) \Vert _{\varLambda _i} \Big \} \end{aligned}$$

(14)

(via LM), followed by the update $\mathbf {s}_i \leftarrow \mathbf {s}_i + \mathbb {I}_c(\mathbf {x}_i) \varDelta _c$. We emphasize, as $\varDelta _c$ is shared between all pixels, this problem is unlikely to be rank deficient. Further components $\varDelta _c$ are introduced as long as $E_{Total}$ is reduced sufficiently (i.e. an improvement larger than $\epsilon _{Complex}$). Notice, while our algorithm iteratively introduces new components $\varDelta _c$, it optimizes pose and depth simultaneously. Analogous to the resolution-based scale-space pyramid, the indicator function acts as surrogate for increased resolution in depth.

For the first frame $I_1$ matched with the keyframe $I$ we need to enforce that the average inverse depth is 1 (recall Sect. 2.2), which implies that

$$\begin{aligned} \sum _{\mathbf {x}_i} \left( \mathbf {s}_i + \mathbb {I}_c(\mathbf {x}_i) \varDelta _c\right) ^\mathtt {T}\mathbf {x}_i = \sum _{\mathbf {x}_i} \left( d_i + \mathbb {I}_c(\mathbf {x}_i) \varDelta _c^\mathtt {T}\mathbf {x}_i \right) = 1 \end{aligned}$$

(15)

must hold. If $d_i$ already satisfies $\sum _{\mathbf {x}_i} d_i = 1$, then the above reduces to

$$\begin{aligned} \sum _{\mathbf {x}_i} \mathbb {I}_c(\mathbf {x}_i) \mathbf {x}_i^\mathtt {T}\varDelta _c= 0. \end{aligned}$$

(16)

We chose a projected gradient approach by projecting the gradient w.r.t. $\varDelta _c$ to the feasible subspace defined by (16) inside the LM optimizer. Note that the planes are initialized to $\mathbf {s}_i=(0, 0,1)^\mathtt {T}$ in the beginning of the algorithm, and by induction $\sum _{\mathbf {x}_i} \mathbf {s}_i^\mathtt {T}\mathbf {s}_i = \sum _{\mathbf {x}_i} d_i = 1$ is always satisfied for the first frame. In subsequent frames the constraint in (16) is not active.

Finally, to determine the precision matrices $\varLambda _i \in {\mathbb {R}}^{3 \times 3}$ needed for $E_{Temporal}^{(T+1)}$, we employ the approximate Hessian via the Jacobian $\mathbf {J}_{Match}$ of $E_{Match}^{(T)}$:

$$\begin{aligned} \begin{pmatrix} \tilde{H}_{\mathcal {S},\mathcal {S}} &{} \tilde{H}_{\mathcal {S},\xi _T}^\mathtt {T}\\ \tilde{H}_{\mathcal {S},\xi _T} &{} \tilde{H}_{\xi _T,\xi _T} \end{pmatrix} {:=} \mathbf {J}_{Match}^\mathtt {T}\mathbf {J}_{Match}, \end{aligned}$$

(17)

and the $3 \times 3$-diagonal block of the Schur complement $\tilde{H}_{\mathcal {S},\mathcal {S}} - \tilde{H}_{\mathcal {S},\xi _T}^\mathtt {T}\tilde{H}_{\xi _T,\xi _T}^{-1} \tilde{H}_{\mathcal {S},\xi _T}$ (denoted $\varLambda _{Match}$). We employ a forgetting factor $\lambda _{Temporal}$ to reduce the overconfident precision matrix, and update $\varLambda _i \leftarrow \lambda _{Temporal}\varLambda _i + \varLambda _{Match}$. Recall that $\tilde{H}_{\xi _T,\xi _T} \in {\mathbb {R}}^{6 \times 6}$ and $\tilde{H}_{\mathcal {S},\xi _T}$ are very sparse.

3.3 CPU Computation in Realtime

Thus far, we present our energy for each pixel in the input video stream. While this is generally useful for dense depth estimation, we may adopt our approach to semi-dense computation to reduce running time. Similar to LSD-SLAM, we can represent the image by its significant gradient values. By only computing on these gradients, execution is significantly reduced. In fact, in comparison to LSD-SLAM, we only need one additional LM iteration per split to introduce depth on top of pose estimation. Finally, we can limit the number of introduced depth components per resolution level to achieve constant running time.

4 Results

We perform our experiments on 13 video sequences in total, using 6 TUM [33] image streams and 7 sequences recoded ourselves. The TUM dataset comprises a number of video sequences with groundtruth pose, as recorded by a Vicon system, and approximate depth through depth sensors [33]. We select a subset of the handheld SLAM videos to measure system performance (i.e. fr1-desk, fr1-desk2, fr1-floor, fr1-room, fr2-xyz and fr3-office). As we are interested in the local aspect of SLAM (operating with single keyframe), we further divide these into smaller sequences. Notice, as we perform keyframe-to-frame comparison, the videos need to contain enough overlap with the reference image. Additionally, we record 7 videos, using a GoPro Hero 3 with a wide angle lens at 30 fps.

As a monocular approach, our method does not fix the scale. Hence, we employ a scale corrected error (SCE) for translation:

$$\begin{aligned} e(\mathbf {t}_t,\mathbf {\hat{t}}_t) = \left\| \mathbf {t}_t\frac{\Vert \mathbf {\hat{t}}_t\Vert }{\Vert \mathbf {t}_t\Vert } - \mathbf {\hat{t}} \right\| , \end{aligned}$$

(18)

where $\mathbf {t}_t$ is the translational displacement of the pose $\xi _t$, and $\mathbf {\hat{t}}_t$ is the groundtruth with respect to the keyframe (or initial frame). An error in rotation is indirectly captured, as it effects the translation of future frames. We now introduce a scale invariant measure to evaluate the depth’s completeness. Given true inverse depth at the keyframe $\hat{\mathcal {D}} = (\hat{d}_i)_{i=1}^{|\mathcal {X}|}$ we define the completeness as the proportion of depth values, satisfying a given accuracy $\epsilon $:

$$\begin{aligned} c\left( \hat{\mathcal {D}},\mathcal {D}\right) = \max _\alpha \sum _i^{|\mathcal {X}|} \frac{n_\alpha (\hat{d}_i,d_i)}{|\mathcal {X}|}, \text{ where }\; n_\alpha (\hat{d}_i,d_i) = \left\{ \begin{array}{l l} 1 &{} \text{ if }\; \Vert \frac{1}{\hat{d}_i} - \frac{\alpha }{d_i}\Vert < \epsilon \\ 0 &{} \text{ otherwise } \end{array}\right. . \end{aligned}$$

(19)

Parameter $\alpha $ represents scale and is found via grid search and refined through gradient decent. In our work, $\epsilon = 0.05$ which corresponds to ±5 cm.

4.1 Quantitative Evaluation on the TUM Dataset

We compare the proposed dense and semi-dense incremental planar system (DIP and SIP respectively) to two versions of LSD-SLAM: (1) we carefully implement a LSD-SLAM version that only uses a single keyframe (LSD-Key), and (2) the original LSD-SLAM as provided by authors of [9], without loop closures or other constraints (LSD-SLAM). We further ensure that mapping is guaranteed to run after every tracking step in both LSD-SLAM systems. Finally, we include our method as disjoint optimization for pose and depth separately and sequentially. Table 1 shows the median SCE for different numbers of frames. The median is calculated over all snippets taken from the individual TUM sequences.

Table 1. Median Scale Corrected Error (in mm) for the compared methods after the listed frame number for different TUM-Dataset sequences. (Note, different characteristics of camera motion in each video lead to different length of keyframe overlaps.)

Full size table

The sequences fr1-desk and fr1-desk2 show an office environment with high camera motion and little overlap towards keyframes. Here, the trajectories are quickly lost when a single keyframe is used. SIP performs best at early stages, while DIP is more suitable for longer tracking. The sequences fr1-floor and fr1-room also have little keyframe overlap, but with slower motion. Here LSD-SLAM performs competitively, as it benefits from keyframe generation.

Long-term tracks are achieved in fr2-xyz and fr3-office. We take a more detailed look at the results of fr3-office. Figure 4 plots the median SCE for each duration. We see that LSD-SLAM and DIP have similar performance early on, but DIP performs better at later stages. Notice, as LSD-SLAM generates new reference images, the baseline is typically small. In contrast DIP benefits from larger baselines. LSD-Key loses track quickly, while SIP performs well in early stages. The trajectory and inverse depth maps for the very first 300 frames are shown in Fig. 5. Figure 6 plots the depth completeness. Here, DIP and SIP reach a peak correctness with increasing baseline, after which they slightly degrades as points are outside the current view, and smoothing takes over their energies.

We remark, similar to many approaches based on gradient decent, our method converges to local minima. However our method relies on graduated optimization which aims to avoid getting trapped in bad minima by optimizing a smoother energy with gradually increased complexity [34]. In contrast to LSD-SLAM, we employ graduated optimization in depth perception as well as traditional scale-space image pyramids leading to superior results. The indicator function is a surrogate for the scale-space pyramid in depth. Finally, we note that the disjoint version is consistently worse in virtually all experiments. The difference is the impact of graduated optimization. For Disjoint, changes in perceived depth are not utilized for pose at the current frame. In contrast, joint optimization finds pose and depth at the same time, yielding improved performance.

In terms of running time, LSD-SLAM and LSD-Key perform tracking and mapping at 14 fps, while SIP performs twice as fast at 30 fps on CPU. DIP is slower on CPU (2 fps), but its GPU implementation runs in realtime (30 fps).

4.2 Qualitative Results

We conclude the experimental with example frames of our 7 additional video sequences (Fig. 7). Generally, LSD-SLAM smoothes well in the local neighborhood, while SIP and DIP perform more consistent on the global inverse depth hypothesis. We note, even with non-planar scenes our methods performs well. We argue, that the local planar surface assumption is reasonable in most environments, as was also witnessed by recent stereo systems, e.g. [13, 23, 24]. Nonetheless, in non-urban scenes, and in situations where the initial frontal plane assumption is significantly wrong (recall initialization of $\mathbf {s}_i = (0,0,1)^\mathtt {T}$), the results are less favorable as seen in the last row of Fig. 7.

5 Conclusion

We introduced a carefully derived coarse-to-fine planar regularization strategy that optimizes for both, pose and depth simultaneously from monocular streams. Our framework is keyframe-based, and incrementally improves its depth hypothesis at each frame as new data arrives. As semi-dense approach, the proposed method runs in realtime on CPU, while realtime for the dense version can be achieved on GPU. In our evaluation, we improved upon the front-end of LSD-SLAM whilst increasing execution time by a factor of two.

Notes

1.
The block-diagonal is an overconfident approximation of the precision. As compensation, we employ a forgetting factor $\lambda _{Temporal}$ in our implementation (see Sect. 3.2).

References

Barfield, W.: Fundamentals of Wearable Computers and Augmented Reality, 2nd edn. CRC Press, Boca Raton (2016)
Google Scholar
Engel, J., Sturm, J., Cremers, D.: Scale-aware navigation of a low-cost quadrocopter with a monocular camera. Robot. Auton. Syst. 62(11), 1646–1656 (2014)
Article Google Scholar
Forster, C., Pizzoli, M., Scaramuzza, D.: SVO: fast semi-direct monocular visual odometry. In: ICRA 2014, pp. 15–22 (2014)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR 2012, pp. 3354–3361 (2012)
Google Scholar
Miksik, O., Vineet, V., Lidegaard, M., Prasaath, R., Nießner, M., Golodetz, S., Hicks, S., Pérez, P., Izadi, S., Torr, P.: The semantic paintbrush: interactive 3D mapping and recognition in large outdoor spaces. In: ACM Conference Human Factors in Computing, CHI 2015, pp. 3317–3326 (2015)
Google Scholar
Vineet, V., Miksik, O., Lidegaard, M., Nießner, M., Golodetz, S., Prisacariu, V., Kähler, O., Murray, D., Izadi, S., Pérez, P., Torr, P.: Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In: ICRA 2015 (2015)
Google Scholar
Schöps, T., Engel, J., Cremers, D.: Semi-dense visual odometry for AR on a smartphone. In: ISMAR 2014, pp. 145–150 (2014)
Google Scholar
Newcombe, R., Lovegrove, S., Davison, A.: DTAM: dense tracking and mapping in real-time. In: IEEE International Conference on Computer Vision, ICCV 2011, pp. 2320–2327 (2011)
Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 834–849. Springer, Heidelberg (2014)
Google Scholar
Miksik, O., Amar, Y., Vineet, V., Pérez, P., Torr, P.: Incremental dense multi-modal 3D scene reconstruction. In: IROS 2015 (2015)
Google Scholar
Newcombe, R., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A., Kohli, P., Shotton, J., Hodges, S., Fitzgibbon, A.: KinectFusion: real-time dense surface mapping and tracking. In: ISMAR 2011, pp. 127–136 (2011)
Google Scholar
Salas-Moreno, R., Glocker, B., Kelly, P., Davison, A.: Dense planar SLAM. In: ISMAR 2014, pp. 157–164 (2014)
Google Scholar
Yamaguchi, K., McAllester, D., Urtasun, R.: Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 756–771. Springer, Heidelberg (2014)
Google Scholar
Nister, D., Naroditsky, O., Bergen, J.: Indoor positioning using multi-frequency RSS with foot-mounted INS. In: CVPR 2004, pp. 652–659 (2004)
Google Scholar
Davison, A.: Real-time simultaneous localisation and mapping with a single camera. In: CVPR 2003, pp. 1403–1410 (2003)
Google Scholar
Davison, A., Reid, I., Molton, N., Stasse, O.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007)
Article Google Scholar
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: ISMAR 2007 (2007)
Google Scholar
Wendel, A., Maurer, M., Graber, G., Pock, T., Bischof, H.: Dense reconstruction on-the-fly. In: CVPR 2012, pp. 1450–1457 (2012)
Google Scholar
Pradeep, V., Rhemann, C., Izadi, S., Zach, C., Bleyer, M., Bathiche, S.: MonoFusion: real-time 3D reconstruction of small scenes with a single web camera. In: IEEE on ISMAR, pp. 83–88 (2013)
Google Scholar
Concha, A., Civera, J.: DPPTAM: dense piecewise planar tracking and mapping from a monocular sequence. In: IROS 2015 (2015)
Google Scholar
Tarrio, J., Pedre, S.: Realtime edge-based visual odometry for a monocular camera. In: IEEE International Conference on Computer Vision, ICCV 2015, pp. 702–710 (2015)
Google Scholar
Geiger, A., Roser, M., Urtasun, R.: Efficient large-scale stereo matching. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part I. LNCS, vol. 6492, pp. 25–38. Springer, Heidelberg (2011)
Chapter Google Scholar
Sinha, S., Scharstein, D., Szeliski, S.: Efficient high-resolution stereo matching using local plane sweeps. In: CVPR 2014, pp. 1582–1589 (2014)
Google Scholar
Zhang, C., Li, Z., Cheng, Y., Cai, R., Chao, H., Rui, Y.: MeshStereo: a global stereo model with mesh alignment regularization for view interpolation. In: IEEE International Conference on Computer Vision, ICCV 2015, pp. 2057–2065 (2015)
Google Scholar
Becker, F., Lenzen, F., Kappes, J., Schnörr, C.: Variational recursive joint estimation of dense scene structure and camera motion from monocular high speed traffic sequences. In: IEEE International Conference on Computer Vision, ICCV 2011, pp. 1692–1699 (2011)
Google Scholar
Concha, A., Hussain, W., Montano, L., Civera, J.: Incorporating scene priors to dense monocular mapping. Auton. Robots 39(3), 279–292 (2015)
Article Google Scholar
Salas, M., Hussain, W., Concha, A., Montano, L., Civera, J., Montiel, J.: Layout aware visual tracking and mapping. In: IROS 2015 (2015)
Google Scholar
Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artifical Intelligence, IJCAI 1981, pp. 674–679 (1981)
Google Scholar
Li, H., Summer, R., Pauly, M.: Global correspondence optimization for non-rigid registration of depth scans. Comput. Graph. Forum 27(5), 1421–1430 (2008)
Article Google Scholar
Yang, J., Li, H.: Dense, accurate optical flow estimation with piecewise parametric model. In: ECCV 2015, pp. 1019–1027 (2015)
Google Scholar
Westheimer, G.: Cooperative neural processes involved in stereoscopic acuity. Exp. Brain Res. 36, 585–597 (1979)
Article Google Scholar
Zach, C.: Robust bundle adjustment revisited. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 772–787. Springer, Heidelberg (2014)
Google Scholar
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: IROS 2012 (2012)
Google Scholar
Mobahi, H., Fisher III, J.W.: On the link between Gaussian homotopy continuation and convex envelopes. In: Tai, X.-C., Bae, E., Chan, T.F., Lysaker, M. (eds.) EMMCVPR 2015. LNCS, vol. 8932, pp. 43–56. Springer, Heidelberg (2015)
Google Scholar

Download references

Acknowledgment

O. Miksik is supported by Technicolor. P. Torr wishes to acknowledges the support of ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC/MURI grant ref EP/N019474/1, EPSRC grant EP/M013774/1, EPSRC Programme Grant Seebibyte EP/M013774/1.

Author information

Authors and Affiliations

Toshiba Research Europe, Cambridge, UK
Stephan Liwicki & Christopher Zach
University of Oxford, Oxford, UK
Ondrej Miksik & Philip H. S. Torr

Authors

Stephan Liwicki
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Zach
View author publications
You can also search for this author in PubMed Google Scholar
Ondrej Miksik
View author publications
You can also search for this author in PubMed Google Scholar
Philip H. S. Torr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephan Liwicki .

Editor information

Editors and Affiliations

RWTH Aachen, Aachen, Germany
Bastian Leibe
Czech Technical University, Prague 2, Czech Republic
Jiri Matas
University of Trento, Povo - Trento, Italy
Nicu Sebe
University of Amsterdam, Amsterdam, The Netherlands
Max Welling

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 26828 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liwicki, S., Zach, C., Miksik, O., Torr, P.H.S. (2016). Coarse-to-fine Planar Regularization for Dense Monocular Depth Estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9906. Springer, Cham. https://doi.org/10.1007/978-3-319-46475-6_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-46475-6_29
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46474-9
Online ISBN: 978-3-319-46475-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics