Keywords

1 Introduction

Single image super-resolution (SISR) aims to recover a high-resolution (HR) image from the input low-resolution (LR) image via complex linear or nonlinear models. The SR problem arises in many practical applications, such as medical imaging and video applications [13, 21]. It is a classical problem in low-level computer vision and has attracted a lot of research attention. In recent years, numerous approaches have been proposed to solve this problem. In general, SR algorithms can be divided into three categories: interpolation-based methods [8, 12], reconstruction-based methods [4, 14], and learning-based methods [16, 17, 21].

Interpolation-based methods [8, 12], such as the bilinear method and bicubic method, are efficient, but tend to generate oversmoothing images. Another class of SR approach is based on reconstruction [4, 14]. These methods estimate an HR image by enforcing some reasonable assumptions or prior knowledge to it. However, the high frequency details in images are not reconstructed very well [1].

The most popular method for SR now is the third category, which is known as learning-based method. These approaches usually assume that the lost high-frequency details in LR images can be predicted by the learned information from training set, which consists of a large set of LR patches and HR patches. These methods attempt to capture the co-occurrence prior between LR and HR image patches. Inspired by compressed sensing, Yang et al. [21] adopted sparse representation to solve SR problem. Timofte et al. [16] proposed an anchored neighborhood regression (ANR) method, which learned a sparse dictionary and utilized the sparse dictionary atoms for ridge regression, while, its refined variant, A+ [17], utilized the neighborhood taken from the training pool of samples for each sparse dictionary atom. Deep learning has also been adopted to address SR problem. Dong et al. [5] proposed a super-resolution convolutional neural network (SRCNN) for SR. Kim et al. [11] presented a very deep networks for super-resolution.

According to the ways of extracting training examples, learning-based SR method can be split into two classes. One uses an external database of natural images [3, 5, 11, 16, 17, 20,21,22] and the other utilizes a database obtained from the input LR image itself [2, 6, 7].

The external example-based methods are based on the assumption that the mapping model between LR and HR image patches can be learned from an external database. The methods above are almost external example-based SR models. The internal example-based methods assume that the patches in a natural image tend to redundantly recur many times inside the image, both within the same scale, as well as across different scales [7]. Bevilacqua et al. [2] generated a double pyramid of recursively scaled and interpolated images, thus built a dictionary from the input LR image itself.

External example-based SR methods and internal example-based SR methods both have their own advantages and disadvantages, for example, some features of medical endoscopic images can not be well represented by widely used training set. Therefore, we can jointly train the model to get better medical image super-resolution results. Wang et al. [18] defined two loss functions using sparse coding-based external exmaples, and epitomic matching based on internal examples. Timofte [15] proposed a method, which fused A+ [17] and CSCN [19] as a new image feature, and applied the anchor strategy for SR. However, both of them adopted two different SR strategies and are based on the results of reconstruct HR image patches. In this paper, We propose novel joint SR to adaptively integrate the merits of both external-based and internal-based SR methods. What’s more, we can fuse the mapping matrices in training phase, and thus obtain fusion matrices.

The remainder of the paper is as follows: Sect. 2 details the universal fusion strategy for SR. Section 3 shows the experimental results. Conclusion follows in Sect. 4.

2 Proposed Method

In the external example-based SR methods, we cannot guarantee that any input image patch can be matched and expressed by a limited set of external database. When dealing with some textures which are missing in the external database, the SR results may be oversmoothing and product serious noise. Internal strategy can handle this situation. But it can not perform well when the image has some patches that rarely recur. So it is reasonable that jointly learning for SR from external and internal examples.

However, there are lots of different SR methods. With different SR methods, it is hard to identify that the result of the final improvement is from whether the two different SR approaches or the combination of two different example selection strategies. For the purpose of getting a universal conclusion, we adopt the same strategy, A+ [17], based on external examples and internal examples respectively, to obtain a joint SR model. In this way, the improvement only depends on a combination of samples.

2.1 Training Model

We adopt the same training strategy with A+ to obtain the mapping matrix for each anchor point.

In external example-based A+ method, we apply K-SVD to get a sparse dictionary \(\mathbf {A}_e\). Each atom of the dictionary is regarded as an anchor point. We search \(N_e\) nearest neighbors in the training set to conduct a sub-dictionary pair \(\{\mathbf {D}_{He}^{ke}, \mathbf {D}_{Le}^{ke}\}_{ke=0}^{N_e}\) for each anchor point.

As for internal example-based A+ method, we adopt the double pyramid method to get the internal database. As shown in Fig. 1, we regard the input LR image \(\mathbf {Y}\) as an HR training image. The other HR training images are generated by scaling down the LR input image \(\mathbf {Y}\) with small factor \(p_i\). So the HR training set is denoted as \(\{ \mathbf {Y}_H^i\}_{i=0}^{N_s}\), and \(N_s\) is the number of generated HR images. The LR image training set is conducted by scaling down each HR image with factor s, which is the same with the factor in reconstruction step. We also rotate and flip the input LR image for data augmentation. Then, we can conduct an HR and LR patch set for training. With the training set obtained, a similar sparse dictionary \(\mathbf {A}_i\) is learned by K-SVD. For each anchor point in sparse dictionary, we also conduct the sub-dictionary pair \(\{\mathbf {D}_{Hi}^{ki}, \mathbf {D}_{Li}^{ki}\}_{ki=0}^{N_i}\), \(N_i\) is the number of anchor points in internal model.

Fig. 1.
figure 1

The strategy of generating training set by the input image.

2.2 Mapping Model

In this paper, we adopt the ridge regression for learning the mapping matrix. We take the external example-based method as example, the regression is formulated as:

$$\begin{aligned} \mathbf {w} = \arg \min _{\mathbf {w}} {\Vert {\mathbf {y}}_l - {\mathbf {D}}_{Le}^{ke} \mathbf {w} \Vert }_2^2 + \lambda {\Vert \mathbf {w} \Vert }_2^2, \end{aligned}$$
(1)

where \(\mathbf {y}_l\) is an input LR patch. \({\mathbf {D}}_{Le}^{ke}\) is the corresponding sub-dictionary of \(\mathbf {y}_l\), and ke is the index and is depended on the distance between anchor point and LR patch \(\mathbf {y}_l\). \(\mathbf {w}\) is the representation of \(\mathbf {y}_l\) on sub-dictionary \(\mathbf {D}_{Le}^{ke}\).

Equation 1 has a closed-form solution:

$$\begin{aligned} \mathbf {w} = {( {{\mathbf {D}}_{Le}^{ke}}^T {\mathbf {D}}_{Le}^{ke} + \lambda \mathbf {I} )}^{-1} {{\mathbf {D}}_{Le}^{ke}}^T \mathbf {y}_l, \end{aligned}$$
(2)

Thus, we can get the corresponding HR image patch \(\mathbf {y}_h\) using the same coefficient on HR sub-dictionary \(\mathbf {D}_{He}^{ke}\):

$$\begin{aligned} \mathbf {y}_h = \mathbf {D}_{He}^{ke} \mathbf {w}, \end{aligned}$$
(3)

We can obtain the mapping matrix \(\mathbf {P}_e^{ke}\):

$$\begin{aligned} \mathbf {P}_e^{ke} = \mathbf {D}_{He}^{ke} {( {{\mathbf {D}}_{Le}^{ke}}^T {\mathbf {D}}_{Le}^{ke} + \lambda \mathbf {I} )}^{-1} {{\mathbf {D}}_{Le}^{ke}}^T, \end{aligned}$$
(4)

The mapping matrix \(\{\mathbf {P}_i^{ki}\}_{ki=0}^{N_i}\) in internal example-based method also can be computed in the same way.

2.3 Fusion Model and Image SR Reconstruction

In this stage, the input LR image are divided into overlapped image patches \(\{\mathbf {y}_i\}_{i=0}^{N}\). The underlying HR image patches are noted as \(\{\mathbf {x}_i\}_{i=0}^{N}\). Once we get the mapping matrices \(\{\mathbf {P}_e^{ke}\}_{ke=0}^{N_e}\) and \(\{\mathbf {P}_i^{ki}\}_{ki=0}^{N_i}\), we need to fuse them based on the distance between input LR patch and anchor point in \(\mathbf {A}_e\) and \(\mathbf {A}_i\) respectively.

We denote \(\mathbf {d}_e\) and \(\mathbf {d}_i\) as the minimum distance between LR input patch \(\mathbf {y}_i\) and anchor points in \(\mathbf {A}_e\) and \(\mathbf {A}_i\) respectively. In this paper, cosine distance is chosen as distance metric. So the greater the value, the closer the distance. We attempt two joint strategies.

The first one we call nearest strategy. For each input LR patch \(\mathbf {y}_i\), we compare \(\mathbf {d}_e\) with \(\mathbf {d}_i\). If \(\mathbf {d}_e\) is bigger than \(\mathbf {d}_i\), it means the anchor point generated by external example-based method is closer than internal one. Thus, we choose the external mapping matrix \(\mathbf {P}_e^{ke}\).

$$\begin{aligned} \mathbf {P}^k =\left\{ \begin{array}{ll} \mathbf {P}_e^{ke} &{} \ if \quad d_e > d_i \\ \mathbf {P}_i^{ki} &{} \ else, \end{array} \right. \end{aligned}$$
(5)

The other is weighted strategy. According to the distance \(\mathbf {d}_e\) and \(\mathbf {d}_i\), we give different weights to two mapping matrices \(\mathbf {P}_e^{ke}\) and \(\mathbf {P}_i^{ki}\).

(6)

where, \(w_1\) and \(w_2\) are weights that balance the two mapping matrices. Since the bigger the value of \(\mathbf {d}_e\) and \(\mathbf {d}_i\), the closer the distance, the corresponding weight should also be bigger than another. Thus, if \(\mathbf {d}_e\) is bigger than \(\mathbf {d}_i\), \(w_1\) should also be bigger than \(w_2\). We apply a simple weighted strategy to our model:

(7)

Once the fusion mapping matrix is got, we directly use it to reconstruct the underlying HR image patch \(\mathbf {x}_i\).

$$\begin{aligned} \mathbf {x}_i = \mathbf {P}^k \mathbf {y}_i, \end{aligned}$$
(8)

The desired HR image \(\mathbf {X}\) is reconstructed by merging all the HR image patches \(\{{\mathbf {x}_i}\}_{i=0}^N\), and averaging the overlapping regions between the adjacent patches.

3 Experimental Results

In this section, we first compare the proposed method with external A+ method and internal A+ method to evaluate the validity of fusion strategy. We also compare it with several representative SISR methods, including external-based methods ScSR [21], Zeyde’s [22], A+[17], internal-based method SelfEx [9] and deep method SRCNN [5]. All the experiments are carried out in the Matlab (R2016a) environment. For fair comparison, the external example-based methods are all trained on 91-image dataset [21]. The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are applied to evaluate the quality of SR reconstruction and results are listed in Tables 1 and 3. We use three testing set (Set5, Set14 and B100) for SR evaluation.

3.1 Implementation Details

We convert RGB color space into YCbCr color space and apply the proposed algorithm on luminance channel (Y) and up-sample color channels (CbCr) by interpolation since human vision is much more sensitive to illuminance changes. The magnification factor is 3. The size of LR and HR image patch is \(5 \times 5\) with overlapping 4 pixels. The features of LR images are the first and second order derivatives of the patches. The features of HR images are the residual between ground truth and the interpolated LR images, and represent the lost high frequency details. The number of generated HR images, \(N_s\) is 19, which means there are 20 HR images (including the input image itself). We also make a data argumentation for training. We rotate the image in 64 angles, and there is a 5.625\(^\circ \) difference in each angle. The size of sparse dictionary in external part is 2048, i.e. \(N_e\) is 2048. \(N_i\), the size in internal part, is 1024. The regularization parameter, \(\lambda \), is set as 0.01.

Fig. 2.
figure 2

Results of medical endoscopic images.

3.2 Quality Evaluation

Table 1 shows average performance of fusion using two different strategies. Compared with external A+ SR method and internal A+ SR method, both joint methods can improve the SISR result, indicating the effectiveness of the method. And nearest strategy defeats weighted strategy, which impels us to use it in the rest of the experiment.

Table 1. Average performance in PSNR and SSIM using nearest strategy and weighted strategy on BSD100. Up-scaling factor: 3
Table 2. Comparison on PSNR with different methods on test images Set5. Upscale factor: 3.
Table 3. Benchmark SISR results. Average PSNR/SSIM for scale factor \(\times 3\) on datasets Set14 and BSD100. \(\mathbf {Bold}\) represent the best performance.

Table 2 shows the PSNR results on Set5. Our method achieves the best performance on most test images. We also compare the proposed method (with nearest strategy) with some state-of-the-art SR methods on Set14 and BSD100. Table 3 shows the average PSNR and SSIM results for up-scaling factor 3. Our method outperforms external, internal, and deep-based methods on all datasets. The average SSIM also performs best, revealing that our reconstructed results achieve best structural similarity with the ground truth. We also collect some medical endoscopic images for visual comparison, as shown in Fig. 2. We can see that our method (with nearest strategy) recovers more visually pleasing results with fewer artifacts, more accurate details and sharper edges.

4 Conclusion

External-based and internal-based super-resolution methods both have their own advantages. This paper studies the strategy of joint learning of two kinds of methods and propose a universal fusion strategy for super-resolution. We utilize the strategy, which is the same with A+ [17], to obtain a external sub-dictionaries and internal sub-dictionaries. Then, we use the nearest strategy and weighted strategy to fuse the external and internal mapping matrices. The high-resolution image is reconstructed by the new mapping model. The experiments prove the effectiveness of our strategy.