Keywords

1 Introduction

In recent years, light field imaging [5] becomes one of the most extensively used method for capturing the 3D appearance of a scene. It measures the spatial and angular variations in the intensity of light [18]. In the early years, light field cameras required expensive hardware, such as multi-camera arrays [30]. Although recently commercial and industrial light field cameras such as Lytro [6] and RayTrix [1] are introduced, they still suffer from restricted sensor resolution which hamper their widespread appeal and adoption. Light field cameras must make a trade-off between spatial and angular resolution.

In this paper, we propose a highly accurate multi-view super-resolution method based on the light-field system introduced in [28]. The key idea of the proposed synthesis is a combination of CNN and patch-based method and the method fully uses the correspondence between images in different views. The patch-based method finds the textural similarity between the high resolution image and the low resolution image. However, when dealing with challenging scenes such as large parallax, the correspondence of the high resolution image and the low resolution image is much lower. As a result, the patch-based method is not effective enough to deal with the edges of the objects and the specularity and results in ghosting and blurring. Instead of pairing correspondence between images, VDSR (Very Deep convolutional networks for Super Resolution [19]) super-resolves images in a single view and fully extract the features of the low-resolution image, so it can cover the shortage of the patch-based method. Our method combines these two methods and takes advantage of both. On the one hand, the method explores the similarity between images. On the other hand, dealing with challenging scenes effectively becomes possible.

Our key technical contributions are: (a) the proposed method is able to dispose challenge scenes, especially large parallax; (b) the method performs better than existing methods in accuracy; (c) the super-resolution process is simple and effective and results in less time cost. Experimental result demonstrates that the proposed method obviously improves the quality of the reconstructed high resolution light field.

2 Related Work

Light fields [5] provide a new angular dimension allowing for various visual applications such as light field display [29] and light field microscopy [21]. Many recent works try to capture or synthesize high quality light fields form different types of input data. Light field super-resolution, which aims to improve the spatial resolution of the light fields, is a hotspot of these works.

2.1 Single Image Super-Resolution Method Using CNNs

Image super-resolution using deep convolutional networks is first introduced in [14]. The method differs fundamentally from existing external example-based approaches, in that the method does not explicitly learn the dictionaries [31] or manifolds [4] for modeling the patch space. SuperResolution Convolutional Neural Network (SRCNN) is a representative method for deep-learning-based SR approach and has been used with large improvements in accuracy. Kim et al. [19] has proposed a simple yet effective training procedure that learns residuals only and performs better than SRCNN in accuracy, which is called VDSR. We train our model based on the algorithm illustrated in [19].

We notice that VDSR utilizes contextual information spread over large image regions and performs better than the patch-based algorithms in the position of the edge of the object and the regions full of complex texture. However, VDSR is based on the single image and it could not fully use the information of different views.

2.2 Light Field Super-Resolution

For single camera light-field, the disadvantage is low spatial resolution. Recently there are several methods composed to restore high frequency information.

Spatial Super-Resolution. To increase the spatial resolution, Bishop et al. [7] proposed a method to estimate both high resolution depth map and light in the Bayesian framework under the prior of Lambertian textural. Another method, patch matching based techniques, are widely used in image processing, such as texture synthesis [15], image completion [25], denoising [8], deblurring [12] and image super-resolution [16, 17]. Wanner and Goldluecke [28] introduced a hybrid imaging system using a patch-based algorithm. Cho et al. [11] explicitly model the calibration pipeline of Lytro cameras and propose a learning based interpolation method to obtain higher spatial resolution. However, the quality of the recovered light field images is not as good as that of the input high resolution images. The spatial high frequency details are lost in the super-resolution recovered images.

Angular Super-Resolution. To reconstruct novel views from sparse angular samples, some methods require the input to follow a specific pattern, or to be captured in a carefully designed way. For example, the work by Levin and Durand [20] takes in a 3D focal stack sequence and reconstructs the light field, using a prior based on the dimensionality gap. Shi et al. [24] leverage sparsity in the continuous Fourier spectrum to reconstruct a dense light field from 1D set of viewpoints. Marwah et al. [23] propose a dictionary-based approach to reconstruct light fields a coded 2D projection.

2.3 Hybrid Imaging

The idea of hybrid imaging was proposed in the context of motion deblurring [3], where a low resolution high speed video camera co-located with a high resolution still camera was used to deblur the blurred images. On the basis of that work, several examples of hybrid imaging have found utility in different applications. Cao et al. [9] propose a hybrid imaging system consisting of an RBG video camera and an LR multi-spectral camera to produce HR multi-spectral video using a co-located system. Another example of a hybrid imaging system is the virtual view synthesis system proposed by Tola et al. [26], where four regular video cameras and a time-of-flight sensor are used. They show that by adding the time-of-flight camera they could render better quality virtual views than just using a camera array with similar sparsity. Wang et al. [27] introduced another light-field attachment which combines a DSLR and 8 low-quality cameras around. They improve the accuracy of the super-resolved images but the algorithm of the synthesize is too complex and needs several times of iterations which limits the speed.

Accordingly, our method integrates patch-based method and VDSR, which makes use of the advantages of the two techniques.

3 Proposed Method

This section introduces the proposed patch-based method integrated convolutional networks for super-resolution of the side view images. The configuration is outlined in Fig. 1. Here is our basic idea: using patch-based method to fix the error of the image super-resolved by VDSR.

We consider an input of two images: a high-resolution image (the reference image which we denote as Ref), and a low-resolution image (we denote them as Src). The two images show the same scene in two different views, and the distance of the two views is 10 pixels in the light field.

Fig. 1.
figure 1

The overview of the proposed patch based method integrated CNN.

3.1 Compute Initial Error

In this step, we aims to calculate an error map which presents the error of the image super-resolved by VDSR. As we known, the scaling factor is too large in our experiment’s setup. For the single image super-resolution, it cannot handle under these too large factor. So we draw into the reference high-resolution image.

By down-sampling the Ref by a factor of N, we obtain the image \(R_{low}\) which is the same size as the Src. It is noted that the factor N is the result of the size of Ref divided by the size of Src. Then we super-resolve the \(R_{low}\) by a factor of N using VDSR, and denote it as \(R_{high}\). The initial error map is obtained by subtracting Ref and \(R_{high}\).

$$\begin{aligned} R_{error}=Ref-R_{high} \end{aligned}$$
(1)

The very deep convolution network (VDSR) is inspired by [13]. This residual-learning network converges much faster than the standard CNN and gives a significant boost in performance.

In this paper, we note that the residual-learning network is not conflict with the error map between the high-resolution and low-resolution images. The residual-learning network is not the access to obtain the aim results. It is the more efficiency method.

3.2 Patch-Based Estimation

Now we get the error map between \(R_{high}\) and Ref from the first step. In this step, we use the patch-based method based on the error map (which denoted as \(R_{error}\)) at the view of Ref. We adopt the available patch match-based super-resolution method which improves the algorithm in [28]. In this step, we first build the dictionary \(D_{error}\) consisting of the extracted patches from the error map \(R_{error}\). Then we extract patches from \(R_{high}\) to build dictionary \(D_{high}\). Low resolution features are computed from each of the patches in \(D_{high}\) by down-sampling by a factor of N using the first and second order derivatives filters. The low resolution features are stored in dictionary \(D_{low}\).

Gradient information can be incorporated into patch matching algorithms to improve accuracy when searching for similar patches. Chang et al. [10] use first- and second-order derivatives as features to facilitate matching. The PatchMatch based method also use first- and second-order gradients as the feature which is extracted from the low-resolution patches. The four 1-D gradient filters used to extract the features are:

$$\begin{aligned} {{g}_{1}}=\left[ -1,0,1 \right] , g_2=g_1^T \end{aligned}$$
(2)
$$\begin{aligned} {{g}_{3}}=\left[ 1,0,-2,0,1 \right] , g_4=g_3^T \end{aligned}$$
(3)

where the superscript “T” denotes transpose. For a low-resolution patch l, filter \(\left\{ {{g}_{1}},{{g}_{2}},{{g}_{3}},{{g}_{4}} \right\} \) are applied and feature \(f_l\) is represented as concatenation of the vectorized filter outputs.

To super-resolve Src, the features \(f_{j}\), which are calculated from each patch \(l_j\) of Src, are used to match. The 9 nearest neighbors in \(D_{low}\) with the smallest \(L_2\) distance from \(f_{j}\) are computed. These 9 nearest neighbors in \(D_{low}\) (denoted as {\(f_{ref,k}^{j}\)}\(_{k=1}^{9}\)) correspond to 9 HR patches in \(D_{high}\) and these 9 HR patches maps 9 error patches in \(D_{error}\) (denoted as {\(e_{ref,k}^{j}\)}\(_{k=1}^{9}\)). Then the reconstruction weights motivated from [28] are calculated. The estimated error patches \(\hat{e_j}\) corresponding to \(l_{j}\) is estimated by:

$$\begin{aligned} \hat{e_j}=\frac{\sum _{k=1}^9 w_{k}e_{ref,k}^{j}}{\sum _{k=1}^9 w_{k}},w_{k}=exp\frac{-{||f_j-f_{ref,k}^j||}^2}{2{\sigma }^2} \end{aligned}$$
(4)

So we get an error image (which denoted as \(S_{error}\)) at the view of Src with the sum of similarity weighted error patches from the dictionary \(D_{error}\). We follow the same parameter setting in [28]. The patch size of high resolution patches is \(64\times 64\) and the patch size of low resolution patches is determined by the factor N. The \(S_{error}\) ,which indicates the error of VDSR method at the view of Src, has the same size of the high-resolution image Ref.

3.3 Integrated Super-Resolution

We have got the error map \(S_{error}\) at the view of Src, which indicates the error of VDSR method. So in this step, we will integrate two proposed method to fully use the correspondence of the images of different views. Firstly, Src is super-resolved by VDSR and we denote the result as \(S_{cnn}\). Then we add \(S_{cnn}\) and \(S_{error}\). In this way, the defect of the VDSR super-resolved image is made up by the \(S_{error}\). The final result is generated.

Here we explain why we compute \(S_{error}\) in the process of the synthesis. Patch-based method finds the textural similarity between the HR image and LR image, and the super-resolved image is a sum of the HR patches. At the scene of large parallax, the edges of the objects in image are quite different so the patch-based method results in blur. Also, the specularity cannot be restored well. VDSR can cover the shortage of the patch-based method due to the single view information. We combine two methods to take advantage of both.

4 Experimental Results

We evaluate the performance of our proposed method for side views and dense light field rendering on the Standford light field dataset [2] in several different scenes, including challenging scenes such as complex textures, specularity and large parallax.

Table 1. The superresolve scale is \({\times }4\).
Table 2. The superresolve scale is \(\times 8\).
Fig. 2.
figure 2

Super-resolution results comparision between 3 methods. The super-resolve scale is \(\times \)4. From top to bottom: (a) ground truth, (b) VDSR, (c) patch based method, (d) our method.

4.1 Experiment Setup

For Standford data set, we select 9 views from each light field with similar layout to the light-field attachment. To make the scene challenging, we select the side view image with \(d =10\) in 8-adjacency distance from the central view. We evaluate our method in two different scales: \(\times \)4, \(\times \)8. The input low resolution side view images are obtained by down-sampling of each image with these two factors, and the original high resolution images can act as ground truth. For patch-based super-resolution, we follow the same setting up with [28]. For VDSR, we set the initial training parameters the same as [13]. In the end, we also text several microscope light field datasets, e.g. provided by Lin et al. [22].

Fig. 3.
figure 3

Super-resolution results comparision between 3 methods. The super-resolve scale is \(\times \)8. From top to bottom: (a) ground truth, (b) VDSR, (c) patch based method, (d) our method.

4.2 Super-Resolution Results

We evaluate our method on all light fields in the dataset [2]. The PSNR values of the patch-based super resolution images, VDSR images and our super-resolution images of several listed scenes are shown in Tables 1 and 2. It is noticed that the PSNRs of our method are higher than those of patch-based method and VDSR method, reflecting in both two scales. It is due to the fact that our method fully use the correspondence of the images in different views and takes advantages of two kinds of synthesis.

Figures 2 and 3 illustrates some super-resolution patches cropped the simulations. It is obvious that the patches of out method contain better high frequency details than those of patch-based method. The patch-based method results in blurring and our method alleviates this mistake.

The results of the microscope light field are also presented in the Fig. 4. The top of the results is the Cells dataset, which be all in a muddle situation. The bottom of the results is the fly compound eye. These two typical microscope light field are unstructure and dusky. So the super-resolved results of our method are more similarity to the groundtruth (Table 3).

Fig. 4.
figure 4

Super-resolution results comparision between 3 methods in the microscope light field (Cells and Eye). The super-resolve scale is \(\times \)4. From left to right: (a) ground truth, (b) VDSR, (c) patch based method, (d) our method.

Table 3. The superresolve scale is \(\times 4\) in the microscope light field datasets.

The run-time for our proposed algorithm is about 3 min per picture. The algorithm was implemented in C++ without optimization on an Intel i7 fourth generation processor with 32 GB of RAM. Compared to the synthesis in [27], the speed is much faster.

5 Conclusion

In this work, we proposed a highly accurate multi-view super-resolution method which is used to super-resolve the images captured by light field system. The main process of our method is a combination of patch based algorithm and convolutional neural network. Our method performs better in accuracy than existing method on challenging scenes containing complex texture, specularity and large parallax, while costing less time. Experimental result demonstrates that the proposed method obviously improved the quality of the reconstructed high resolution light field.

In the feature, we would like to utilize the natural property of the light field, which we will reach a better super-resolved results. Besides, some applications should be extended, such as the depth estimation, images sequence interpolation, and so on.