Keywords

1 Introduction

Depth estimation from a single monocular image [24] is a fundamental problem in computer vision, with various applications in stereo vision, robotics, and scene understanding [17, 26]. In a typical setting most approaches [1, 9, 17, 25] use a standard regression or classification pipeline to predict the depth fitting, orientation and plane fitting. Such pipeline consists of the calculation of dense or sparse features, followed by an appearance feature representation and regressor training. The responses of a classifier or a regressor are combined in a probabilistic framework, and under very strong geometric priors the most probable scene layout is estimated. Despite promising progress achieved, these methods are still far from practical applications, with the conflict between the model capability and the data scalability, resulting in over-fitting or under-fitting for such learning-based paradigm.

Fig. 1.
figure 1

The framework of our method.

Coming with the proliferation of 3D sensing devices e.g. Kinect and matured 3D modeling techniques e.g. Structure-from-Motion [5, 7] and visual SLAM [21, 30], massive-scale 3D data such as point clouds and depth maps are available nowadays, which can provide rich correspondences between 2D visual appearance and 3D depth structures. Therefore, is it possible to take advantage of such rich 2D-3D correspondences towards a search-based paradigm in depth estimation? In this work, we tackle the depth estimation from a different perspective with traditional methods [1113, 16, 17, 25, 26]. In general, we adopt a search-based paradigm that leverages a dictionary-based cross-modality retrieval to robustly and efficiently find best-matches 3D depth given a 2D query patch as the local depth estimator. It is followed by a Taylor formula based contextual refinement to achieve a consistent yet accurate global depth estimation at the image-level.

In particular, unlike the traditional approaches [25, 26] that learn a regressor from image to depth indirectly, we first perform joint dictionary learning to bridge the similarity gap between 2D image patches and 3D depth maps to facilitate cross-modal retrieval. Then, given an image patch, we search for the corresponding 3D patches from a large reference set with the depth information between 2D and 3D local patches. This approach provides key advantages in both online efficiency and generalization ability.

The above patch-wised local depth estimation is further integrated with spatial contextual constraints using Conditional Random Field (CRF), as was commonly adopted in existing works [6, 11, 17, 26]. To evaluate the performance of the proposed method, we conduct experiments on the widely-used Make3D and NYUv2 datasets. We compare the proposed method with several existing state-of-the-art ones, including make3D [26], Semantic Labelling [17] and Depth Transfer [11]. We report significant performance gain to demonstrate the advantages of the proposed model. The main contributions of our work are three-fold:

  • We propose a novel cross-modality retrieval paradigm that does not rely on training depth regressors to tackle over- and under-fitting issues previously existed;

  • A novel coupled dictionary learning is introduced to bridge the similar gap between 2D query and 3D references, with detailed analytical solutions for fast yet accurate parameter learning;

  • Adopting contextual refinement with Taylor expansion and CRF inference, which also improves the generalization capability. Compared with traditional methods [11, 26], the proposed inference does not require parameter fitting on the training set.

2 Related Work

Previous works [4, 26] in depth estimation from a single monocular image typically follow a regression setting. In this setting, the image is first over-segmented into superpixels, and then a pre-trained local depth regressor is applied on each individual superpixel to estimate the corresponding local depth. Subsequently, the Markov Random Field (MRF) or Conditional Random Field (CRF) [4] is frequently employed to impose spatial constraints on the estimated local depth. Such contextual cues usually include 3D location and orientation of the patch [26], as well as the global context [25] among patches. For instance in [17], Liu et al. partitioned depth estimation into two phrases, i.e., semantic segmentation [27] and 3D reconstruction, with the semantic labels guiding the 3D reconstruction. In [20], Liu et al. modeled depth estimation as a discrete-continuous optimization problem, where the continuous variables encode the depth of superpixels in the input image, and the discrete ones represent relationships between neighboring superpixels. Karsch et al. [10, 11] inferred the depth map by three stages: candidate images discovery, point-wise alignment and optimization procedure. More recently, deep learning [6] was introduced for single image depth estimation. For instance, Liu et al. [19] combined the Convolutional Neural Network (CNN) and the CRF model for depth estimation, where the CNN learns the geometric priors and the CRF model could further optimize the depth among adjacent superpixels. Similar to [19], the method in [15] also extracted deep CNN features for depth regression, which was combined with the CRF-based post processing. The limitation of the state-of-the-art methods for single image depth estimation is closely tied to the property of perspective geometry, which becomes a bottleneck for the current RGB-D based methods. In contrast, this limitation severely affects methods based on 3D model, since the 3D model can offer all stereo perspectives, which provides a new aspect to conquer this limitation.

Cross-modality retrieval has also attracted vast research focuses in recent years. In [34], Wang et al. built a cross-modality probabilistic graphical model to discover mutually consistent semantic information among different modalities. In [23], cross-modal correlations and semantic abstraction were employed to jointly model the text and image components. Zhuang et al. [37] proposed a SliM 2 model to formulate the multimodal mapping as a constrained dictionary learning problem, where the label information [3] is employed to discover the shared intra-modality structure. More recently, deep learning methods were further employed in cross-modality retrieval [31, 34] for the tasks of text-to-image and image-to-text search. The main disadvantage of the existing cross-modality retrieval methods is that they can not learn the structure information from images and 3D models. However, we can rebuild the structure information of image patches by sharing the reconstruction coefficients with coupled dictionary learning.

Recent works have shown the effectiveness of coupled dictionary learning in exploring inherent correlations between two data channels. Here, we introduce the most relevant works to ours. Wang et al. [32] proposed a semi-coupled dictionary learning (SCDL) method to conduct cross-style image synthesis. Yang et al. [36] employed neural network to jointly learn dictionaries of different resolutions. To tackle the deblurring problem, Xiang et al. [35] trained dictionaries on both clean and blurred images jointly, and Wang et al. [33] learned a dictionary on deblurred intermediate results and blurred images jointly. Shekhar et al. [28] established the identity of multi-source information by joint sparse representation. And He et al. [8] jointly learned overcomplete dictionaries for one single super-resolution image. Note that the above methods suppose that both dictionaries are learned upon data with the same modality, which is very challenging to capture the cross-modality similarity using learned existing coupled dictionary learning methods.

3 Cross-Modality Retrieval for Local Depth Estimation

The first step is to infer local depth from a single image. To this end, we first train dictionaries from different modalities (i.e., 2D image patches and 3D models) synchronously and then conduct cross-modality retrieval for each target patch. Based on the retrieval results, we estimate the depth from the most correlated 3D model directly. Section 3 presents the details of the above process.Footnote 1

3.1 Coupled Dictionary Learning

Our basic assumption is that if an object can be decomposed into a set of 3D objects, its 2D projection should be able to decomposed in the same way, and vise versa. Therefore, given a set of 2D patchesFootnote 2 \(\varvec{x}_{\text {im}}^{j} \left( j= 1,\cdots ,n\right) \) and the corresponding 3D model \(\varvec{x}_{\text {depth}}^{j}\left( j= 1,\cdots ,n\right) \), from a dictionary learning perspective, we aim to obtain a pair of codes \(y_{\text {im}}^j\) for \(x_{\text {im}}^{j}\) and \(y_{\text {depth}}^j\) for \(x_{\text {depth}}^{j}\) based on two dictionaries \(\varvec{D}_{\text {im}}\) and \(\varvec{D}_{\text {depth}}\). And these two codes are supposed to be similar after the proper projection. This intuition leads to the following formulation:

$$\begin{aligned} \begin{aligned}&\min _{\varvec{D}_{\text {im}},\varvec{D}_{\text {dep}}}\sum _{j}\left\| \varvec{x}_{\text {im}}^{j}-\varvec{D}_{\text {im}}\cdot \varvec{y}_{\text {im}}^{j}\right\| _{2}^{2}+\alpha \left\| \varvec{x}_{\text {dep}}^{j}-\varvec{D}_{\text {dep}}\cdot \varvec{y}_{\text {dep}}^{j}\right\| _{2}^{2} \\&\quad \quad +\beta \left\| \varvec{y}_{\text {dep}}^{j}-\varvec{R}\cdot \varvec{y}_{\text {im}}^{j}\right\| _{2}^{2}\\&s.t.\quad \varvec{R}^{T}\cdot \varvec{R} = \varvec{I}, \end{aligned} \end{aligned}$$
(1)

where \(\varvec{y}_{\text {dep}}^{j}\) and \( \varvec{y}_{\text {im}}^{j}\) are the reconstruction coefficients. \(\varvec{D}_{\text {im}}= [ \varvec{d}_{\text {im}}^{1},\varvec{d}_{\text {im}}^{2},\cdots \) \(,\varvec{d}_{\text {im}}^{c} ]\in \mathfrak {R}^{p\times c}\) is the dictionary 2D patches, while \(\varvec{D}_{\text {dep}}= \left[ \varvec{d}_{\text {dep}}^{1},\varvec{d}_{\text {dep}}^{2},\cdots ,\varvec{d}_{\text {dep}}^{c} \right] \in \mathfrak {R}^{q\times c}\) is the dictionary of 3D models.. The first term in Eq. 1 is the reconstruction error between the 2D image patches \(\varvec{x}_{\text {im}}^{j}\) and their corresponding representation results. The second term is 3D reconstruction error. And the third term is the projection error of coefficient \(\varvec{y}_{\text {dep}}^{j}\) and \( \varvec{y}_{\text {im}}^{j}\). Through the projection matrix \(\varvec{R}\), 3D and 2D coefficients are connected to enable cross-modality similarity matching.

However, it is not exactly proper to force projection matrix \(\varvec{R}\) to be orthogonal. Although orthogonality can guarantee \(\varvec{R}\) to be full rank and make the coefficient spaces equivalent, such strict constraint may leads to a suboptimal result. We therefore relax the constraint and merge the second and third terms in Eq. 2 which is equivalent to Eq. 1, but less restrict.

$$\begin{aligned} \begin{aligned} \min _{\varvec{D}_{\text {im}},\varvec{D}_{\text {dep}}} \left\| \varvec{X}_{\text {im}}-\varvec{D}_{\text {im}} \varvec{Y}\right\| _{F}^{2}+\alpha \left\| \varvec{X}_{\text {dep}}-\varvec{D}_{\text {dep}}\varvec{Y}\right\| _{F}^{2}.\\ \end{aligned} \end{aligned}$$
(2)

where \(\varvec{Y} = \left[ \varvec{y}^{1},\varvec{y}^{2},\cdots ,\varvec{y}^{n}\right] \) is the coefficient matrix and \(\varvec{X}_{\text {im}} =\{\varvec{x}_{\text {im}}^{1},\varvec{x}_{\text {im}}^{2},\cdots ,\varvec{x}_{\text {im}}^{n}\}\), \(\varvec{x}_{\text {im}}^{i}\in \mathfrak {R}^{p\times 1}\) is a set of n RGB image patches, whose corresponding depth image patchesFootnote 3 are \(\varvec{X}_{\text {dep}} = \{ \varvec{x}_{\text {dep}}^{1},\varvec{x}_{\text {dep}}^{2},\cdots ,\varvec{x}_{\text {dep}}^{n}\}\), \(\varvec{x}_{\text {dep}}^{i}\in \mathfrak {R}^{q\times 1}\).

To solve Eq. 2, an alternative minimization approach is designed.

  1. 1.

    Fix \(\varvec{D}\) to optimize \(\varvec{Y}\)

    $$\begin{aligned} \begin{aligned} \min _{\varvec{Y}}\left\| \varvec{X}_{\text {im}}-\varvec{D}_{\text {im}} \varvec{Y}\right\| _{F}^{2}+\alpha \left\| \varvec{X}_{\text {dep}}-\varvec{D}_{\text {dep}}\varvec{Y}\right\| _{F}^{2}, \end{aligned} \end{aligned}$$
    (3)

    is an unconstrained optimization. And we can give the analytic solution as

    $$\begin{aligned} \begin{aligned} \varvec{Y} =\left( \varvec{D}^{T}_{\text {im}}\varvec{D}_{\text {im}}+\alpha \varvec{D}^{T}_{\text {dep}}\varvec{D}_{\text {dep}}\right) ^{-1}\left( \varvec{D}_{\text {im}}^{T}\varvec{X}_{\text {im}} + \alpha \cdot \varvec{D}_{\text {dep}}^{T}\varvec{X}_{\text {dep}}\right) , \end{aligned} \end{aligned}$$
    (4)
  2. 2.

    Fix \(\varvec{Y}\) to update \(\varvec{D}\), then Eq. 2 can be reformed as:

    $$\begin{aligned} \begin{aligned} \min _{\varvec{D}_{t}}\left\| \varvec{X}_{t}-\varvec{D}_{t} \varvec{Y}\right\| _{F}^{2},t\in \{\text {im,dep}\}, \end{aligned} \end{aligned}$$
    (5)

    which can be solved by postmultiplication of Moore-Penrose generalized inverse matrix [2] of \(\varvec{Y}\) Footnote 4 as

    $$\begin{aligned} \begin{aligned} \varvec{D}_{t} =&\varvec{X}_{t} \text {Inverse} \left( \varvec{Y}\right) \\ \text {Inverse} \left( \varvec{Y}\right) =&\varvec{Y}^{T} \left( \varvec{Y}\varvec{Y}^{T}+\varvec{I} \epsilon \right) ^{-1}, \end{aligned} \end{aligned}$$
    (6)
Fig. 2.
figure 2

Visualization results of Dictionaries: The left four columns are visualized from Make3D dataset and the right four columns from NYUv2 dataset. The first row consists of test images, the second row and third row consist of RGB feature dictionaries and depth dictionaries,respectively, which are trained by the candidate [22] images.

3.2 Cross-Modality Retrieval

So far, we have trained the dictionaries \(\varvec{D}_{\text {dep}}\) and \(\varvec{D}_{\text {im}}\). Given a set of queries \(\varvec{X}_{\text {im}}^{\theta }\) of 2D patches, our goal is to get the corresponding 3D model \(\varvec{X}_{\text {dep}}^{\theta }\). The optimal result can be obtained by Eq. 2 as

$$\begin{aligned} \begin{aligned} \min _{\varvec{Y}^{\theta },\varvec{X}^{\theta }_{\text {dep}}}\left\| \varvec{X}^{\theta }_{\text {im}}-\varvec{D}_{\text {im}} \varvec{Y}^{\theta }\right\| _{F}^{2}+\alpha \left\| \varvec{X}^{\theta }_{\text {dep}}-\varvec{D}_{\text {dep}}\varvec{Y}^{\theta }\right\| _{F}^{2}.\\ \end{aligned} \end{aligned}$$
(7)

To accelerate the convergence in Eq. 7, we can initialize parameters using

$$\begin{aligned} \begin{aligned} \hat{\varvec{Y}}^{\theta } =&\min _{\varvec{Y}} \left\| \varvec{X}_{\text {im}}^{\theta }-\varvec{D}_{\text {im}} \varvec{Y}^{\theta }\right\| _{2}^{2} + \alpha \left\| \varvec{X}^{\theta }_{\text {dep}}-\varvec{D}_{\text {dep}}\varvec{Y}^{\theta }\right\| _{F}^{2},\\ \hat{\varvec{X}}_{\text {dep}}^{\theta } =&\varvec{D}_{\text {dep}} \hat{\varvec{Y}}^{\theta }. \end{aligned} \end{aligned}$$
(8)

After obtaining the reconstruction coefficient \(\hat{\varvec{Y}}^{\theta }\) and the related 3D model \(\hat{\varvec{X}}_{\text {dep}}^{\theta }\) of image patches, we can optimize the entire image by setting initial depth value \(\hat{\varvec{X}}_{\text {dep}}^{\theta }\) Footnote 5.

4 Large Margin Structure Inference

We gather the initial depth patches (Sect. 3.2) to form the initial depth of the entire image \(\varvec{I}_{\text {dep}}^{0}\). There are N images \(\varvec{I}_{\text {im}}^{i}\left( i=1,\cdots ,N\right) \) from dataset that are similar [22] with the query image \(\varvec{I}_{\text {im}}^{0}\) in RGB space, whose depth images are \(\varvec{I}_{\text {dep}}^{i}\left( i=1,\cdots ,N\right) \). And the depth image, we want to infer, is \(\widetilde{\varvec{I}}_{\text {dep}}\).

figure a

The Taylor expansion of \(\widetilde{I}_{\text {dep}}\) and \(I_{\text {dep}}^{i}\) at point \(\left( a,b\right) \) are

$$\begin{aligned} \begin{aligned} \widetilde{I}_{\text {dep}}\left( x,y\right)&= \widetilde{I}_{\text {dep}}\left( a,b\right) + \varvec{\nabla }_{x}\widetilde{I}_{\text {dep}}\left( a,b\right) \cdot \left( x-a\right) + \varvec{\nabla }_{y}\widetilde{I}_{\text {dep}}\left( a,b\right) \cdot \left( y-b\right) \\&+ \frac{1}{2}\varvec{\nabla }_{x}^{2}\widetilde{I}_{\text {dep}}\left( a,b\right) \cdot \left( x-a\right) ^{2} + \frac{1}{2}\varvec{\nabla }_{y}^{2}\widetilde{I}_{\text {dep}}\left( a,b\right) \cdot \left( y-b\right) ^{2}\\&+ \frac{1}{2}\varvec{\nabla }_{x,y}\widetilde{I}_{\text {dep}}\left( a,b\right) \cdot \left( x-a\right) \left( y-b\right) \\&+ \frac{1}{2}\varvec{\nabla }_{y,x}\widetilde{I}_{\text {dep}}\left( a,b\right) \cdot \left( x-a\right) \left( y-b\right) + R_{n}\left( x,y\right) \\ \end{aligned} \end{aligned}$$
(12)

and

$$\begin{aligned} \begin{aligned} I_{\text {dep}}^{i}\left( x,y\right)&= I_{\text {dep}}^{i}\left( a,b\right) + \varvec{\nabla }_{x}I_{\text {dep}}^{i}\left( a,b\right) \cdot \left( x-a\right) + \varvec{\nabla }_{y}I_{\text {dep}}^{i}\left( a,b\right) \cdot \left( y-b\right) \\&+ \frac{1}{2}\varvec{\nabla }_{x}^{2}I_{\text {dep}}^{i}\left( a,b\right) \cdot \left( x-a\right) ^{2} + \frac{1}{2}\varvec{\nabla }_{y}^{2}I_{\text {dep}}^{i}\left( a,b\right) \cdot \left( y-b\right) ^{2}\\&+ \frac{1}{2}\varvec{\nabla }_{x,y}I_{\text {dep}}^{i}\left( a,b\right) \cdot \left( x-a\right) \left( y-b\right) \\&+ \frac{1}{2}\varvec{\nabla }_{y,x}I_{\text {dep}}^{i}\left( a,b\right) \cdot \left( x-a\right) \left( y-b\right) + L_{n}\left( x,y\right) , \end{aligned} \end{aligned}$$
(13)

where \(R_{n}\left( x,y\right) \) and \(L_{n}\left( x,y\right) \) are the higher order infinitesimals. To make \(\widetilde{I}_{D}\) and \(I_{D}^{i}\) similar, Eqs. 12 and 13 should also be similar. Then we can get the expression of \(G_{sim}\) and \(G_{sel}\) as

$$\begin{aligned} \begin{aligned} G_{sim}&= \sum _{i=1}^{N} \left\| \varvec{W}_{i}\cdot \left( \widetilde{I}_{D}-I_{D}^{i}\right) \right\| + \alpha \left\| \varvec{W}_{i}\cdot \left( \varvec{\nabla }_{x}\widetilde{I}_{D} - \varvec{\nabla }_{x}I_{D}^{i}\right) \right\| \\ +&\alpha \left\| \varvec{W}_{i}\cdot \left( \varvec{\nabla }_{y}\widetilde{I}_{D} - \varvec{\nabla }_{y}I_{D}^{i}\right) \right\| + \beta \left\| \varvec{W}_{i} \left( \varvec{\nabla }_{x}^{2}\widetilde{I} -\varvec{\nabla }_{x}^{2}I_{D}^{i} \right) \right\| \\ +&\beta \left\| \varvec{W}_{i} \left( \varvec{\nabla }_{y}^{2}\widetilde{I} -\varvec{\nabla }_{y}^{2}I_{D}^{i} \right) \right\| + \beta \left\| \varvec{W}_{i} \left( \varvec{\nabla }_{x,y}\widetilde{I} -\varvec{\nabla }_{x,y}I_{D}^{i}\right) \right\| \\ +&\beta \left\| \varvec{W}_{i} \left( \varvec{\nabla }_{y,x}\widetilde{I} -\varvec{\nabla }_{y,x}I_{D}^{i}\right) \right\| , \end{aligned} \end{aligned}$$
(14)

and

$$\begin{aligned} \begin{aligned} G_{sel}&= \gamma \left\| \widetilde{I}_{D}-I_{D}^{0} \right\| + \alpha \left( \left\| \varvec{W}_{0} \cdot \varvec{\nabla }_{x}\widetilde{I}_{D} \right\| +\left\| \varvec{W}_{0}\cdot \varvec{\nabla }_{y}\widetilde{I}_{D} \right\| \right) \\ + \beta&\left( \left\| \varvec{W}_{0}\cdot \varvec{\nabla }_{x}^{2}\widetilde{I}\right\| + \left\| \varvec{W}_{0}\cdot \varvec{\nabla }_{y}^{2}\widetilde{I} \right\| + \left\| \varvec{W}_{0}\cdot \varvec{\nabla }_{x,y}\widetilde{I} \right\| + \left\| \varvec{W}_{0}\cdot \varvec{\nabla }_{y,x}\widetilde{I} \right\| \right) . \end{aligned} \end{aligned}$$
(15)

where \(G_{sim}\) is used to calculate similarity between input RGB image and candidate images and \(G_{sel}\) is the self control item which guarantees that adjacent points in an image have similar depth value.

Similar to the regular CRF, \(G_{sim}\) and \(G_{sel}\) can be reformed as traditional MRF i.e. the smoothing term \(\varPsi _{ds}\left( \widetilde{\varvec{I}}_{\text {dep}}\right) \), the data term \(\varPsi _{dd}\left( \widetilde{\varvec{I}}_{\text {dep}},\varvec{I}_{\text {dep}}^{i},\widetilde{\varvec{I}}_{\text {im}},\varvec{I}_{\text {im}}^{i}\right) \) and the prior depth term \(\varPsi _{dp}\left( \varvec{I}_{\text {dep}}^{0},\widetilde{\varvec{I}}_{\text {dep}}\right) \), defined as

$$\begin{aligned} \begin{aligned} \varPsi _{d}\left( \widetilde{\varvec{I}}_{\text {dep}},\varvec{I}_{\text {dep}}^{i},\widetilde{\varvec{I}}_{\text {im}},\varvec{I}_{\text {im}}^{i}\right) = \varPsi _{ds}\left( \widetilde{\varvec{I}}_{\text {dep}}\right) \varPsi _{dd}\left( \widetilde{\varvec{I}}_{\text {dep}},\varvec{I}_{\text {dep}}^{i},\widetilde{\varvec{I}}_{\text {im}},\varvec{I}_{\text {im}}^{i}\right) \varPsi _{dp}\left( \varvec{I}_{\text {dep}}^{0},\widetilde{\varvec{I}}_{\text {dep}}\right) . \end{aligned} \end{aligned}$$
(16)

Data Term. Depending on our basic assumption that similar image should have similar depth map, we use similar [22] candidate images to infer our depth map \(\widetilde{\varvec{I}}_{\text {dep}}\). We claim that this “similarity” should not only happen in the original RGB images, but also in the gradient of RGB images. When comparing with pixels in \(\varvec{I}_{\text {im}}^{0}\) and \(\varvec{I}_{\text {im}}^{i}\), the more similar they are, the less weight they have. Then we give our formulation of \(\varPsi _{dd}\left( \widetilde{\varvec{I}}_{\text {dep}},\varvec{I}_{\text {dep}}^{i},\widetilde{\varvec{I}}_{\text {im}},\varvec{I}_{\text {im}}^{i}\right) \) as

$$\begin{aligned} \begin{aligned} \varPsi _{dd}\left( \widetilde{\varvec{I}}_{\text {dep}},\varvec{I}_{\text {dep}}^{i},\widetilde{\varvec{I}}_{\text {im}},\varvec{I}_{\text {im}}^{i}\right)&= \prod _{i=1}^{N} exp ( \left\| \varvec{W}_{i}\left( \widetilde{\varvec{I}}_{\text {dep}}-\varvec{I}_{\text {dep}}^{i} \right) \right\| + \alpha \left\| \varvec{W}_{i}\left( \varvec{\nabla }_{x} \widetilde{\varvec{I}}_{\text {dep}}-\varvec{\nabla }_{x} \varvec{I}_{\text {dep}}^{i}\right) \right\| \\ +&\alpha \left\| \varvec{W}_{i}\left( \varvec{\nabla }_{y} \widetilde{\varvec{I}}_{\text {dep}}-\varvec{\nabla }_{y} \varvec{I}_{\text {dep}}^{i}\right) \right\| + \beta \left\| \varvec{W}_{i} \left( \varvec{\nabla }_{x}^{2} \widetilde{\varvec{I}}_{\text {dep}} - \varvec{\nabla }_{x}^{2} \varvec{I}_{\text {dep}}^{i}\right) \right\| \\ +&\beta \left\| \varvec{W}_{i} \left( \varvec{\nabla }_{y}^{2} \widetilde{\varvec{I}}_{\text {dep}} - \varvec{\nabla }_{y}^{2} \varvec{I}_{\text {dep}}^{i}\right) \right\| + \beta \left\| \varvec{W}_{i}\left( \varvec{\nabla }_{x,y} \widetilde{\varvec{I}}_{\text {dep}}-\varvec{\nabla }_{x,y} \varvec{I}_{\text {dep}}^{i}\right) \right\| \\ +&\beta \left\| \varvec{W}_{i}\left( \varvec{\nabla }_{y,x} \widetilde{\varvec{I}}_{\text {dep}}-\varvec{\nabla }_{y,x} \varvec{I}_{\text {dep}}^{i}\right) \right\| ) \end{aligned} \end{aligned}$$
(17)

where \(\varvec{W}_{i}\) Footnote 6is the point-wise similarity diagonal matrix.Footnote 7

Smoothing Term. We encourage neighborhood pixels have smooth depth estimations. This is achieved in \(\varPsi _{ds}\left( \widetilde{\varvec{I}}_{\text {dep}}\right) \) by setting self-adapting coefficient of adjacent pixels smoothing. When the features of adjacent pixels are similar, then the smoothing coefficient of that pixel pair would achieve a low smoothing weight to make the pixel pair very smooth; meanwhile, when the adjacent pixel features are dramatically different, then the smoothing coefficient will be very high, which makes the smoothing term lose their efficacies. We come up with the following design to characterize the above intuitions.

$$\begin{aligned} \begin{aligned} \varPsi _{ds}\left( \widetilde{\varvec{I}}_{\text {dep}}\right) =&exp ( \alpha \left\| \varvec{W}_{0}\varvec{\nabla }_{x} \widetilde{\varvec{I}}_{\text {dep}}\right\| + \alpha \left\| \varvec{W}_{0}\varvec{\nabla }_{y} \widetilde{\varvec{I}}_{\text {dep}}\right\| + \beta \left\| \varvec{W}_{0}\varvec{\nabla }_{x}^{2} \widetilde{\varvec{I}}_{\text {dep}}\right\| \\ {}&+ \beta \left\| \varvec{W}_{0}\varvec{\nabla }_{y}^{2} \widetilde{\varvec{I}}_{\text {dep}}\right\| + \beta \left\| \varvec{W}_{0}\varvec{\nabla }_{x,y} \widetilde{\varvec{I}}_{\text {dep}}\right\| + \beta \left\| \varvec{W}_{0}\varvec{\nabla }_{y,x} \widetilde{\varvec{I}}_{\text {dep}}\right\| ) \end{aligned} \end{aligned}$$
(18)

where the first two terms in Eq. 18 are of first-order gradient smooth, which cover the nearest four pixels neighbours; while the other terms in Eq. 18 are second-order gradient smooth, which cover more further area. \(\varvec{\nabla }_{x},\varvec{\nabla }_{y},\varvec{\nabla }_{x}^{2},\varvec{\nabla }_{y}^{2},\varvec{\nabla }_{x,y},\varvec{\nabla }_{y,x},\) are the gradient operator matrix, \(\widetilde{\varvec{I}}_{\text {dep}}\) is a column vector and \(\varvec{W}_{0}\) is the self-adapting smooth control (diagonal) matrix.

Prior Term. We also claim that the estimated prior should join in the depth consistency potential as

$$\begin{aligned} \begin{aligned} \varPsi _{dp}\left( \widetilde{\varvec{I}}_{\text {dep}},\varvec{I}_{\text {dep}}^{0}\right) =&exp \left( \gamma \left\| \widetilde{\varvec{I}}_{\text {dep}} - \varvec{I}_{\text {dep}}^{0} \right\| \right) \end{aligned} \end{aligned}$$
(19)

Comparing with traditional methods [11, 26], there is no pre-trained parameters in our model, which provides a highly generalization ability. Meanwhile, the larger neighbourhood have been considered,Footnote 8 without increasing time complexity. We show the proposed algorithm and the entire framework in Algorithm 1 and Fig. 1, respectively.

5 Experiments

In this section, we report our experimental results on single image depth estimation for both outdoor and indoor scenes. We use the Make3D [26] range image data set and the NYUv2 [29] Kinect data set, as they are the largest open data available at present.

5.1 Evaluation Protocols

For quantitative evaluation, we report errors obtained with the following error metrics, which have been extensively used in [11, 14, 17, 26].

  • Mean relative error \(\left( rel\right) \): \(\frac{1}{L}\sum _{i} \frac{|\hat{d}_{i}-d_{i}|}{d_{i}}\);

  • Mean log10 error \(\left( lg10\right) \): \(\frac{1}{L}\sum _{i} |log_{10}\hat{d}_{i}-log_{10}d_{i}|\);

  • Root mean squared error \(\left( rms\right) \):\(\sqrt{\frac{1}{L}\sum _{i} \left\| \hat{d}_{i}-d_{i}\right\| _{2}^{2}}\)

where \(d_{i}\) is the ground truth depth, \(\hat{d}_{i}\) is the estimated depth, and L denotes the total number of pixels in all the evaluated images.

In the training stage, we select 10 similar [22] images from dataset with the query image, and use the patches(\(7\times 7\) pixels, 3 pixels overlap) extracted from these similar images to train RGB feature [26] dictionary and depth dictionary simultaneously, whose dimensionality is 1024. And the balance parameter in Eq. 2 is 1. In the testing stage, we extract non-overlapping patches of query image to infer the prior depth image. And to optimize this prior depth image with Eq. 11, we fix the parameter of Eqs. 17, 18 and 19 as \(\gamma \) for 0.5, \(\alpha \) for 10 and \(\beta \) for 0.1.

5.2 Performance on Make3D Dataset

The Make3D dataset consists of 534 images with corresponding depth maps. There are 400 training images and 134 testing images. All images are resized to \(460 \times 345\) pixels. It is worth noting that this data set is published a decade ago, the resolution and distance range of the depth image is rather limited (only \(55 \times 305\) pixels). Furthermore, it contains noise in the locations of glass window etc. These limitations have some influence on the training stage and the resulting error metrics. Therefore we report errors based on two different criteria in Table 1: (C1) Errors are computed in the regions with ground-truth depth less than 70; (C2) Errors are computed in the entire image. We compare our method with the state-of-the-art methods such as Make3D [26], Depth Transfer [11] and Semantic Labelling [17].

In Table 1, we present a quantitative comparison of the depth estimation between our method and these methods on representative images from Make3D data set. Table 1 demonstrates that, in most cases, our method outperforms those competing methods in terms of two evaluation criteria. Also, to make the result visible, we show the depth prediction results achieved by our method in Fig. 3. From Fig. 3, we can observe that the prediction results achieved by our method are very close to the ground truth images, and are much better than those obtained by the Make3D approach. To prove the validity of our methods, we also compare our method with state-of-the-art in the “Prior Depth Inference” and “Depth optimization”, respectively (Table 2 and Table 3). At last, we show the influence of parameters in Table 4.

Table 1. Result comparisons on the Make3D dataset.(C1) Errors are computed in the regions with ground-truth depth less than 70; (C2) Errors are computed in the entire image

From Table 1, we can see that our method outperforms in most of the metrics. Furthermore, comparing “Error (C2)” criteria with “Error (C1)”, our model achieves more gains in far distance objects than the near ones. And comparing with other methods, without pre-trained parameter [25, 26] and supplementary information [17], our model can still work well.

We also test our model with state-of-the-art in each stage. In Table 2 we assess the effectiveness of the “Prior Depth Inference” stage, and test our model in learned-dictionary and random dictionary. From the result we can see that, even with a random dictionary our model still outperforms state-of-the-art methods. Compared with Table 4, random dictionary performance is similar to learned-dictionary of 5 pixels patch size. In Table 3, we assess the effectiveness of the “Entire image depth inference” stage with the same prior depth value of [11]. From this table we can see that, our model have lower rms but higher rel which means our method effective but slightly unstable.

Table 2. Result comparisons on the Make3D dataset without MRF to fine-tune.
Table 3. Result comparisons on the Make3D dataset, with the same prior depth estimation, different MRF to fine-tune.

In Table 4, we can see that the patch size parameter poses greater influence than the other two. Generally speaking, mapping RGB to depth is an ill-posed problem that there may be many depth patches for a certain RGB patch. And the larger the patch is, the more details can be learnt. However, due to the lack of adequate images, the range of patch size is also limited. Based on this reason, the dictionary size effects a little, which can also be seen in Fig. 2 that there are lots of reduplicative feature in the trained dictionary.

Fig. 3.
figure 3

Examples of depth predictions on the Make3D dataset.

From Fig. 3 we can see that our method reproduce the depth map well, especially at shape controlling.

5.3 Performance on NYUv2 Dataset

The NYUv2 dataset contains of 1449 images, where 795 images are used as a training set and 654 images are used as a testing setFootnote 9. All images are resized to \(460 \times 345\) pixels in order to preserve the aspect ratio of the original images. In Table 5, we compare our method with state-of-the-art methods, including Make3D [26], Depth Transfer [11] and so on.

As illustrated in Table 5, we present a qualitative comparison of the depth estimation with these methods on representative images from NYUv2 data set, which demonstrates the superior performance of our method. Also, to make the result visible, we show our method in Fig. 4. The set of parameter in our method is the same as in Sect. 5.2. Due to the similar experiment result (Sect. 5.2) and the limitation of pages.

Table 4. Result comparisons on the Make3D dataset with different parameters. PatchSize is the size of extracted patches for “Coupled Dictionary Learning” and DictionarySize is the capacity of Dictionary \(\varvec{D}_{\text {im}}\) and \(\varvec{D}_{\text {dep}}\) in Eq. 2.
Fig. 4.
figure 4

Examples of depth predictions on the NYUv2 dataset.

Table 5. Result comparisons on the NYUv2 dataset.

5.4 Comparison with Deep Learning Methods

It is well known that deep learning methods have obtained remarkable achievement in many research areas, due to their greater learning ability than most of traditional methods. The proposed method has no advantage in model capability or complexity, compared to deep learning.

However deep learning has a critical drawback that the training process usually takes a long time (weeks and even months), despite considerable efforts have been taken to alleviate this problem. Most deep neural networks also heavily rely on parameter tuning, with significant sensitivity on certain parameters such as learning rate. This prevents deep learning approaches from being applied in scenarios that require frequent and agile updating.

On the contrary, the proposed approach requires no traditional training stage and few parameters. This greatly reduces the effort of adapting to a new dataset, making the approach more flexible and reliable. Since these two methods are designed for different scenarios, we do not conduct the experimental comparison.

6 Conclusion

In this paper, we propose a novel cross-modality retrieval method to estimate the object depth value from a given 2D image. To our best knowledge, this is the first method to estimate depth value by cross-modality retrieval. And to solve the cross-modality problem, we propose a novel and effective coupled dictionary learning method. Based on the local depth estimation from the cross-modal retrieval using the dictionary, we further refine the depth of the entire image by solving a convex optimization problem. From the depth estimation result (Figs. 3 and 4), we can see that details are not well reserved in our method. Because our method highly depends on the candidate images. When the images do not describe the same scene as query image does, or the “bad” image win a high similar score on pixel level, our method can not work well. In the future, we plan to combine our model with the deep learning or other methods to improve the robustness in handling real-world image transformation. Furthermore, we plan to augment the performance by integrating the semantic information from the recent development in CNN framework.