Keywords

1 Introduction

Depth super-resolution is one of important research topics in the image processing and computer vision field. In practical applications, since depth information is always captured in a low resolution, depth images have to be interpolated to the full size corresponding to the texture images. For example, the resolution of depth image captured by the Swiss Range SR4000 is only QCIF format (\(176\times 144\) pixels). Even for Kinect, the resolution of captured depth image is only \(640\times 480\) (\(512\times 424\) for Kinect v2), which is much lower than its corresponding color image (\(1920\times 1080\) for Kinect v2). Hence, interpolation and other image enhancement techniques are essential to improve the resolution and quality of depth images. For the applications like 3D viewpoint reconstruction, action recognition and object detection [28], high resolution and accuracy depth information can help to improve the system performance.

Reconstructing High Resolution (HR) images from Low Resolution (LR) images is an ill-posed inverse problem [14, 36]. It is difficult to produce high quality. Nevertheless, for depth super-resolution, it could be slightly easier because depth image has more homogeneous regions and more similar structures than the natural images. Generally speaking, researches on depth super-resolution can be divided into two categories: single depth super-resolution and depth super-resolution with multiple images. For single depth super-resolution, depth maps are directly interpolated into the full size of the corresponding color images, without other side information. Consequently, depth super-resolution is equivalent with the general image super-resolution, some classical interpolation filters including bi-linear, bi-cubic interpolation filter can be used. However, since filter-based method rarely consider the property of depth maps, i.e. the importance of edges, the performance of the depth super-resolution is largely limited. Therefore, to preserve depth edges in the interpolation process, optimization based methods which regard depth super-resolution as a Markov Random Field (MRF) or least squares optimization problem are proposed. Kim et al. proposed a novel MRF-based depth super-resolution method taking the noise characteristics of depth map into account [13]. Zhu et al. further extended the traditional spatial MRF by considering temporal coherence [39]. In [6], depth super-resolution was formulated as a convex optimization problem which utilizes anisotropic total generalized variation. Then, patch-based features in a depth map were employed to optimize depth super-resolution. [9] proposed to exploit the self-similar patch in the rigid body to reconstruct the high resolution depth maps. In [16], depth edges were preserved in the interpolation process by adding geometric constraints from self-similar structures. In [4, 7, 11, 33, 35], sparse representation of depth image patches were introduced by imposing a locality constraint. Unfortunately, the performance of the above methods could be limited due to the failure of establishing patch correspondences either from the external dataset or within the same depth map, which leads to edge artifacts between patches or incorrect depth pattern estimation.

To eliminate the edge artifacts after depth super-resolution operation, the corresponding color images are utilized as geometric constraints [20]. A classical method is to apply bilateral filter to enhance depth quality [24, 30], in which color information are jointly utilized as weights of bilateral filter. In [3] and [32], color images were directly used to guide the depth image super-resolution. In [17] and [18], the edge and structure similarity between depth images and color images are considered for depth images up-sampling. Park et al. extended the nonlocal mean filtering with an edge weighting term in [23]. Xie et al. proposed an edge-guided depth super-resolution, which produces sharp edges [34]. Then, Yang et al. proposed to use multiple views to assist the depth super-resolution in [37]. Choi et al. proposed a region segmentation based method to tackle the texture-transfer and depth-bleeding artifacts in [2]. Recently, some convolutional neural network based depth super-resolution methods are also proposed to learn the texture-depth mapping [19, 31], however, these methods still need to use the classical interpolation methods to obtain HR depth in the first stage, ignoring the relations between LR-depth and HR-texture. And most importantly, only depth distortion is considered which is similar as the learning based super-resolution method.

Fig. 1.
figure 1

Flowchart of learning-based image super-resolution.

In fact, for most depth based applications, depth images are generally not provided for watching but for enhancing the applicability. For instance, in 3D video framework, depth images are used to assist virtual views synthesis, instead of being watched by users. Hence, integrating view synthesis quality into depth super-resolution problem is necessary. Jin et al. design a natural image super-resolution framework, in which depth images are utilized to synthesize the image, and the synthesis artifacts are used as a criteria to guide the image super-resolution [12]. [10] introduces the difference between color image and synthesized images as a regularization term for depth super-resolution. In [17], the fractal dimension and texture-depth boundary consistencies are jointly considered in depth super-resolution.

In this paper, we present a depth super-resolution method based on the relations between HR and LR depth patches. Considering the sharpness of depth edge, the LR-depth patches are firstly clustered based on their edge orientations into different edge-orientation classes. Here, the edge-orientation feature is extracted based on our designed gradient operators, in which the edge strength and direction are employed as the basis for the LR patches cluster. Then for each edge-orientation class, a class-dependent linear mapping function is learned using LR-HR patch pairs. Moreover, the view synthesis distortion is integrated into the linear mapping learning process. Therefore, depth super-resolution problem is formulated as view synthesis distortion driven linear mapping learning optimization. Experimental results show that our proposed depth super-resolution method achieves superior performance for the synthesized virtual view compared with other depth super-resolution approaches.

The paper is organized as follows: Sect. 2 describes the proposed depth super-resolution framework, along with details. Section 3 shows the settings of our experiment and the performance of our proposed approach. At last, we conclude this paper in Sect. 4.

2 Methodology

Typically, learning-based image super-resolution aims to learn a linear mapping relation between LR-HR patch pairs. For example, as shown in Fig. 1,

$$\begin{aligned} \mathbf {x = My}, \end{aligned}$$
(1)

where \(y \in \mathbb {R}^m\), \(x \in \mathbb {R}^n\), \(m \le n\), \(\mathbf {M}\) is a linear mapping operator.

For LR-to-HR conversions, the linear mapping between LR-HR pairs should be learned. Specifically, the LR image \(\mathbf {I_l}\) is denoted as

$$\begin{aligned} \mathbf {I_l} = \{l_i\}_{i=1}^{N}, \end{aligned}$$
(2)

where \(l_i\) is \(i-\)th LR image patch, N is the total number of depth image patches in \(\mathbf {I_l}\). Similarly, the HR image is also represented as

$$\begin{aligned} \mathbf {I_h} = \{ h_i \}_{i=1}^{N}. \end{aligned}$$
(3)

Then, the LR-HR patch pairs are classified into different classes based on the specified rule. U is the number of the classes, \(N_{j}\) is the number of patches which belong to \(j-\)th class, \(\sum _{j=1}^U N_j = N\). For each class j, the linear mapping can be learned from an error minimization equation as follows [14]

$$\begin{aligned} M_j = \arg \min _{M_j} \sum _{j=1}^{N_{j}} \Vert h_i^j - M_j\cdot l_i^j\Vert _2^2 + \lambda \Vert M_j\Vert _F^2, \end{aligned}$$
(4)

where \(h_i^j, l_i^j\) are the concatenate matrix of the vectorized HR, LR patches which belong to the \(j-\)th class, \(M_j\) is the mapping kernel of the \(j-\)th class. \(\Vert M_j\Vert _F^2\) is a regularization term with Frobenius norm which can prevent overfitting, and \(\lambda \) is a penalty factor which is empirically set to 1 in general. Therefore, the learning-based image super-resolution is expressed as a multivariate regression problem. The goal of the regression is to minimize the Mean Squared Error (MSE) between the ground-truth HR patches and the patches which are interpolated from the corresponding LR patches,

$$\begin{aligned} J = \min \frac{1}{N_{j}} \sum _{j=1}^N \Vert h_i^j - M_j\cdot l_i^j\Vert _2^2. \end{aligned}$$
(5)

Nevertheless, depth images are different from the natural images, they are just used to assist various applications, i.e. view synthesis, object recognition and action recognition etc. Hence, the goal of depth image super-resolution should be different from the traditional image super-resolution. In this work, we assume that depth images are used for view synthesis. In the following of this paper, we are addressing the depth image super-resolution problem in the framework of view synthesis.

2.1 Depth Patches Classification Based on Edge Orientation

The published super-resolution methods [1, 27, 29] use edge-orientation information to implement the LR-to-HR interpolation for texture images. However, since color images usually poss very complicated texture, the edge-orientation information is difficult to extract for patch clustering. Compared to texture images, depth images represent the distance between the camera and objects in a scene and generally have more homogeneous regions and sharp edges, without much texture. Consequently, edge information in depth images is vivid and the corresponding features are easily extracted. Motivated by this observation, we design a new edge-orientation feature based on the above conventional learning-based super-resolution scheme to learned the mapping between the LR patches and the HR patches, which aims to preserve the depth edge in the LR-to-HR conversion process.

To find edge orientation of depth LR image patches, we employ two simple gradient operators as

$$\begin{aligned} K_h = [1~ -1] \text { and } K_v = \left[ \begin{array}{c} 1\\ -1 \end{array}\right] \end{aligned}$$
(6)

where \(K_h\) and \(K_v\) indicate horizontal and vertical gradient operators, respectively. Here, considering that (5) calculates the pixel-level statistical error between the interpolated patches and the ground-truth patches, we take the pixel variations as the basis for patch classification. In theory, LR depth patches with similar gradient variations between adjacent-pixel pairs are likely to share similar linear mappings in LR-to-HR conversions.

For demonstration, let us take a \(2\times 2\) LR depth patch as an example, which is

$$\begin{aligned} P = \left[ \begin{array}{cc} p_{1,1}\; p_{1,2}\\ p_{2,1}\; p_{2,2} \end{array}\right] . \end{aligned}$$
(7)

The edge orientation is determined in terms of the edge strength and edge directions. Both of two operators \(K_h\) and \(K_v\) are applied to obtain the horizontal and vertical edge strength, as

$$\begin{aligned} \begin{array}{c} g_h = K_h *P \\ g_v = K_v *P, \end{array} \end{aligned}$$
(8)

where \(*\) indicates the convolutional operator. \(g_h\) and \(g_v\) are horizontal and vertical gradients, respectively. Then, the edge strength and edge direction can be computed as

$$\begin{aligned} \begin{array}{c} S = \sqrt{g_h^2 + g_v^2} \\ \phi = \tan ^{-1}(\frac{g_h}{g_v})+\frac{\pi }{2} \end{array} \end{aligned}$$
(9)

where S indicates the edge strength and \(\phi \) is the edge direction for the given LR depth patch.

To correctly distinguish the edge and the homogeneous parts of depth images, we set a threshold T to constrain the edge strength. When the edge strength is lower than T, the corresponding regions in depth patch are regarded as the homogeneous regions. Then,

$$\begin{aligned} S = \left\{ \begin{array}{cc} S &{} \text {if } S > T \\ 0 &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(10)

The edge direction in (9) takes on values from 0 to \(2\pi \). Note that, each edge direction and its opposite direction can be seen as the same edge direction. Therefore, we map the direction information to a range of value \([0,\pi ]\) obtaining a new field

$$\begin{aligned} \hat{\phi }=\left\{ \begin{array}{cc} \phi &{} 0 \le \phi< \pi \\ \phi -\pi &{} \pi \le \phi < 2\pi \end{array}\right. \end{aligned}$$
(11)

Finally, the edge orientation feature can be represented using the following formula

$$\begin{aligned} \mathbf {\Phi } = S e^{j\hat{\phi }}. \end{aligned}$$
(12)

The calculated feature points for the pre-given LR depth patches are clustered into different classes using K-means [8].

2.2 Depth Super-Resolution in View Synthesis

View synthesis technique is often employed to generate the extra virtual viewpoints in 3D video system [12]. In this framework, depth images are used to describe the distance between the camera and objects in a scene. Based on depth information, the virtual view images are synthesized by applying DIBR [5]. Consequently, depth images are only a sort of supplement data for view synthesis rather than an independent image data. The quality of depth images would not linearly affect the quality of the synthesized view images, and the relation varies according to its corresponding texture image information as mentioned in [21, 22]. Thereby, in the learning based depth super-resolution problem, the goal of regression should consider the property of depth images in view synthesis. Instead, the distortion of the synthesized view introduced by the possible depth distortion in super-resolution process can be integrated, which is written as

$$\begin{aligned} \begin{array}{lll} SSD &{} = &{} \sum |V-\tilde{V}|^2 \\ &{} = &{} \sum |f_w(C,D) - f_w(C,\tilde{D})|^2, \end{array} \end{aligned}$$
(13)

where C and V indicate the texture images and its virtually synthesized view, respectively. D is the ground-truth full size depth images, and \(\tilde{D}\) denotes the corresponding interpolated HR depth images. For the synthesized view, it is synthesized based on C and D by the pre-defined warping function, \(f_w\).

Based on (5) and (13), the goal of the learning-based depth super-resolution problem can be expressed as a view synthesis distortion minimization problem, as

$$\begin{aligned} \begin{array}{lll} J &{} = &{} \sum \nolimits _{M} |V-\tilde{V}|^2 \\ &{} = &{} \sum \nolimits _{M} |f_w(C,D) - f_w(C,\tilde{D})|^2, \\ &{}\text {where}&{} \tilde{D} = M\cdot d \end{array} \end{aligned}$$
(14)

here, M denotes the learned mapping functions.

To further simplify this distortion, following [22], (13) can be approximately written as

$$\begin{aligned} \begin{array}{lll} SSD &{} = &{} \sum \nolimits _{\forall (x,y)} |f_w(C,D) - f_w(C,\tilde{D})|^2 \\ &{} \approx &{} \sum \nolimits _{\forall (x,y)} | C_{x,y} - C_{x-\triangle p(x,y),y} |^2 , \end{array} \end{aligned}$$
(15)

where (xy) represents pixel position, and \(\triangle p\) denotes the translational rendering position, which has been proven to be proportional to the depth image error

$$\begin{aligned} \triangle p(x,y) = \alpha \cdot (D_{x,y} - \tilde{D}_{x,y}), \end{aligned}$$
(16)

where \(\alpha \) is a proportional coefficient determined by the following equation

$$\begin{aligned} \alpha = \frac{f\cdot L}{255} \cdot \left( \frac{1}{Z_{near}}-\frac{1}{Z_{far}}\right) \end{aligned}$$
(17)

here, f is the focal length and L is the baseline between the current view and the synthesized view. \(Z_{near}\) and \(Z_{far}\) are the values of the nearest and the farthest depth of the scene, respectively. Therefore, (14) can be further simplified according to [22] as

$$\begin{aligned} J \approx \sum \limits _{\forall (x,y)} \left[ |\triangle p(x,y)|\frac{|C_{x,y}-C_{x-1,y}|+|C_{x,y}-C_{x+1,y}|}{2}\right] ^2. \end{aligned}$$
(18)

Finally, to learn the linear mapping from LR examples to the HR examples, for depth images (4) can be rewritten based on (18) as

$$\begin{aligned} \begin{array}{ll} M_j = \arg \min \limits _{M_j}&{} \Vert \left[ \alpha (D_i^j - M_j d_i^j)\right] \frac{|C_{x,y}-C_{x-1,y}|+|C_{x,y}-C_{x+1,y}|}{2} \Vert ^2_2\\ &{} + \lambda \Vert M_j\Vert _F^2 , \end{array} \end{aligned}$$
(19)

where \(D_i^j\) denotes the HR depth patches which belong to the same class j, and \(d_i^j\) are for the LR depth patches of class j. This is known as multi-variate regression, and according to [38], this optimization problem can be approximately solved as

$$\begin{aligned} \begin{array}{ll} M_j =&\alpha ^2 A^TA D_i^j{d_i^j}^T \left( d_i^j{d_i^j}^T+\lambda \mathbf {I}\right) ^{-1}, \end{array} \end{aligned}$$
(20)

where \(A = \frac{|C_{x,y}-C_{x-1,y}|+|C_{x,y}-C_{x+1,y}|}{2}\), \(\mathbf {I}\) is the identity matrix. Based on (20), the linear mapping can be learned off-line and used to reconstruct HR patches for class j. The complete training process are summarized in Algorithm 1.

figure a

Based on the calculated linear mappings \(M_j\) for each class j, the given LR depth images for testing are firstly divided into a set of LR patches with size \(2\times 2\). Then, using (9) and (12), the edge-orientation feature of each patch \(\mathbf {\Phi _p}\) can be calculated and matched with the cluster centers \(\mathbf {\Phi _c}\). The matching procedure of edge-orientation class can be described as searching the minimal distance between the given LR depth patch and each cluster centers, and the distance metric is

$$\begin{aligned} d = sin(|\mathbf {\Phi _p} - \mathbf {\Phi _c}|), \end{aligned}$$
(21)

which is based on the Sine of the local angular distance. At last, the corresponding linear mapping can be found. The super-resolution phase is summarized in Algorithm 2.

figure b

3 Experimental Results

In this section, the proposed depth super-resolution method is compared with the other 3 depth super-resolution methods, which include a filter-based method which is joint bilateral up-sampling algorithm (JBU) [15], a guidance information assistant method called the color-based depth up-sampling method (CBU) [32] and a learning-based method which is edge-guided depth super-resolution method (EDU) [34]. To train the linear mappings, the depth images from 17 image pairs in the Middlebury Stereo dataset [25] are used. Each image pair consists of 2 views (left and right views, and the corresponding texture and depth image pairs) taken under several different illuminations and exposures. For testing, the realistic depth images from MPEG Standardization Test Dataset are applied to evaluate the performance of depth super-resolution, which include “Newspaper”, “Balloons”, “Kendo”, “Dancer”, “Poznan\(\_\)hall2” and “Poznan\(\_\)street”. The details about the test sequences are shown in Table 1. For both of training and testing, the depth images are down-sampled with the scale factor is 2 by using the “Bicubic” filter. The results are evaluated in PSNR for quality assessments. To evaluate the view synthesis performance, the given depth images from two different views are firstly down-sampled and then up-sampled by using different depth super-resolution methods, in prior to view synthesis. The standard software VSRS 3.5 [26] is employed to generate the synthesized views by using the interpolated depth images and the corresponding texture images. Moreover, the ground-truth depth images are used as reference.

Table 1. Details of test dataset.

For quantitative evaluations, we firstly evaluate the depth super-resolution results on test dataset. Table 2 lists the objective quality of depth super-resolution for each view in the test dataset. As reported in Table 2, the objective quality of the proposed depth super-resolution is limited, because the designed target function (19) is not for minimizing the distortion between up-sampled depth images and the ground-truth ones, as [34]. But the PSNR values of up-sampled depth images obtained by using the proposed method are also near to the other benchmark baselines. By evaluating the synthesized view quality, as shown in Table 3, the proposed depth super-resolution method performs much better than the other 3 methods. Compared with JBU and EDU which both utilize the edge information to guide depth super-resolution without employing the color information, the average PSNR gain on synthesize quality is near to 2 dB. Considering the synthesize distortion as (18), the color information should be considered. Thereby, CBU method shows a good performance on synthesize quality, but the average PSNR is still near to 1.2 dB lower than the proposed method.

Table 2. Objective quality of depth super-resolution.
Table 3. Objective quality of synthesized views by using interpolated depth images with scale factor 2.
Fig. 2.
figure 2

The comparison of visual results of depth images: (a) Newspaper [34]; (b) Newspaper by proposed; (c) Balloons [15]; (d) Balloons by proposed; (e) Kendo [32]; (f) Kendo by proposed; (g) Dancer [34]; (h) Dancer by proposed; (i) Poznan_hall2 [32]; (j) Poznan_hall2 by proposed; (k) Poznan_street [15]; (l) Poznan_street by proposed.

Fig. 3.
figure 3

The comparison of visual results of synthesized view for Newspaper: (a) EDU [34]; (b) Proposed depth super-resolution.

Fig. 4.
figure 4

The comparison of visual results of synthesized view for Balloons: (a) EDU [34]; (b) Proposed depth super-resolution.

Fig. 5.
figure 5

The comparison of visual results of synthesized view for Dancer: (a) EDU [34]; (b) Proposed depth super-resolution.

We also evaluate our proposed method visually in Figs. 2, 3, 4 and 5. The depth visual results are shown in Fig. 2. Note that, not all interpolated depth images by using the baseline methods are shown in Fig. 2 due to the space limitation. Refer to Table 2, we select several depth images which are generated using the baseline methods, to compare with the generated ones by using our proposed method. Visually, the proposed method focus on the transition regions between foreground objects and background, which means that not all edges would be preserved in the super-resolution process. In comparison, JBU [15], CBU [32] and EDU [34] introduce the edge guidance information from texture images or self depth image to optimize the depth super-resolution. Thereby, the texture/depth edge would be sharped. Moreover, Figs. 3, 4 and 5 show the synthesized views by utilizing the interpolated depth images, and some details are shown with zoomed cropped regions. To clearly distinguish the differences of synthesized views, we select the visual results based on Table 3. EDU [34] method has the best objective quality, so the subjective comparison is mainly between EDU [34] and our proposed method. The red circle lines in Figs. 3, 4 and 5 shows the comparison regions between the EDU method [34] and the proposed method.

4 Conclusion

In this paper, we present a depth super-resolution method based on the linear mapping relations between HR and LR depth patch pairs. Motivated by the idea that depth images are not directly watched by viewers, just for assisting different vision tasks, we convert the traditional super-resolution problem as view synthesis driven depth super-resolution optimization. We design an edge-orientation feature based learning method to learn the possible linear mappings, and interpolate the LR depth image to HR version by utilizing the learned mappings. In a realistic test dataset, our proposed method can generate the synthesized views with competitive quality in terms of PSNR, compared to the other depth super-resolution methods.