Modified non-local means for super-resolution of hybrid videos

https://doi.org/10.1016/j.cviu.2017.11.010Get rights and content

Highlights

  • Adaptive decaying factor for computing average weight in non-local means is proposed.

  • Predefined and exhaustive searching windows adapt to image characteristics.

  • Analysis of local difference shows its influence on the algorithm’s performance.

Abstract

Hybrid videos that contain periodic low-resolution (LR) frames and high-resolution (HR) guide frames are largely used for the consideration of bandwidth efficiency and the tradeoff between spatial and temporal resolution. Super-resolution (SR) algorithms are necessary to refine the LR frames, in which non-local means (NLM) is a promising algorithm. NLM replaces every pixel with a weighted average of its neighbors based on non-local self-similarity between pixels. However, the fixed decaying factor of NLM cannot satisfy regions of distinct characteristics in LR frames. The fixed neighborhood or the so-called searching window fails to balance the requirements of low computation and advanced video quality. In this paper, we propose novel criteria to choose the parameters adaptively. The decaying factor is defined by patch difference of a pixel and guarantees NLM to find relevant pixels. Two methods, namely a predefined method inspired by motion estimation and an exhaustive method by searching progressively enlarged neighborhood are proposed to determine the neighborhood size. Bilateral adjacent HR guide frames are used to handle the occlusion problem. We also analyze the defined patch difference on pixel-, frame-, and sequence-level and reveal its influence on the algorithm. The experimental results verify the validity of the proposed method.

Introduction

Recently, hybrid video or the so-called reversed-complexity video coding (Brandi et al., 2008) and inconsistent scalable video streaming (Mahfoodh et al., 2015), which is a low-resolution (LR) video with periodic high-resolution (HR) frames, has been studied out of different considerations, applied under distinct scenarios, and enhanced by several super-resolution (SR) algorithms. On the one hand, from the perspective of video compression, the use of hybrid video (see Fig. 1) can not only reduce video’s data size and improve the efficiency of bandwidth usage but also reduce encoding complexity. Multi-view mixed-resolution video originates from the same purpose (Garcia, Dorea, de Queiroz, 2012, Jin, Tillo, Yao, Xiao, 2015, Li, Li, Fu, Niu, Long, 2016, Li, Li, Fu, Zhong, 2016, Richter, Seiler, Schnurrer, Kaup, 2015). Mukherjee et al. adopted this concept and proposed a resolution reduction based coding mode in existing codecs where computational complexity was transferred from encoder side to the decoder side (Mukherjee et al., 2007). Brandi et al. directly proposed the use of hybrid video for the purpose of data-size reduction (Brandi et al., 2008).

On the other hand, the tradeoff between spatial and temporal resolution also leads to the incorporation of hybrid video (Ben-Ezra, Nayar, 2003, Tai, Du, Brown, Lin, 2010). In order for a pixel to be detected, an image sensor needs a minimum exposure time to accumulate sufficient irradiance. Providing that the sensor size is constant, the footprint of every pixel on the sensor will be reduced with the increase of image spatial resolution, which means exposure time should be prolonged to accumulate the same amount of irradiance on a reduced pixel footprint. Hybrid cameras can simultaneously capture periodic HR snapshots with low rate and LR frames with high frame rate. Commercial cameras such as Canon EOS 500D and Sony HDR-SR11 support this application (Ancuti, Ancuti, Bekaert, 2010, Basavaraja, Bopardikar, Velusamy, 2010).

To enhance the LR frames of a hybrid video, Brandi et al. proposed a motion estimation based SR method where HF components in the HR frames were used to recover those in the LR frames (Brandi et al., 2008). Song et al. further used hierarchical motion estimation to obtain as accurate motion vectors as possible and employed example-based SR when motion estimation failed (Song et al., 2011). Mahfoodh et al. utilized quad tree structure based motion estimation and incorporated their algorithm in VP9 spatial SVC (Mahfoodh, Mukherjee, Radha, 2015, Mukherjee, Bankoski, Grange, Han, Koleszar, Wilkins, Xu, Bultje, 2013). All of the aforementioned algorithms recovered an LR patch using only one HR patch derived from motion estimation, which limited the algorithms’ performance. Thus, Hung et al. developed example-based SR, by searching and combining multiple HR patches in codebooks derived from key frames, to super-resolve an LR patch (Hung et al., 2012). Bevilacqua et al. recovered an LR patch by taking sparse combinations of patches found in the adjacent HR frames (Bevilacqua et al., 2013).

The existing SR algorithms can be classified into three categories, namely interpolation-based, reconstruction-based, and example-based.

Basic interpolation-based methods such as bilinear or bicubic interpolation are based on the smoothness assumption of natural images. But they tend to blur the derived images especially at edges. Advanced approaches belong to a class of visually oriented interpolation techniques, including edge directed, content adaptive, and wavelet-based methods (Allebach, Wong, 1996, Li, Orchard, 2001, Wang, Ward, 2007, Zhang, Wu, 2006). However, the video sequences refined by interpolation suffer from perceived loss of detail in texture regions because they are unable to estimate HF information.

Reconstruction-based methods use subpixel shifts among several LR images of the same scene taken from multiple viewpoints. By estimating these shifts, pixels are rearranged into an HR grid and combined to complete an HR image (Bose, Ahuja, 2006, Farsiu, Robinson, Elad, Milanfar, 2004, Takeda, Farsiu, Milanfar, 2007). Iterative back projection (IBP) recovers a final HR image by projecting the reconstruction error between the LR and intermediate HR images back to the HR image iteratively (Gan, Cui, Chen, Zhu, 2013, Zhang, Liu, Li, Zhou, Zhang, 2016). Maximum a posteriori probability (MAP) methods utilize Bayesian statistical properties of images and adopt prior information such as total variation, Tikhonov regularization (Fu et al., 2016a), and non-local prior (Zhang et al., 2012) to stabilize the solution. However, due to the limited information available, reconstruction-based methods hit a bottleneck in improving the recovered image quality.

Example-based algorithms use known HR images to build a database which consists of pairs of LF information and HF information in a training phase (Freeman, Jones, Pasztor, 2001, Timofte, Rothe, Van Gool, 2016, Wang, Gao, Zhang, Li, 2016, Yin, Gao, Cai, 2015). Then the established database guides the learning phase to search a matching HR block for every block in the LR image. Under the paradigm of learning, some other algorithms adopt classical image processing techniques such as convolutional neural networks (CNN) (Kim et al., 2016), sparse representation (Polatkan et al., 2015), and neighborhood embedding (Chang et al., 2004) to address SR problems. These algorithms achieve state-of-the-art performances. For video SR, motion estimation plays a key role in making up for the motion between consecutive video frames (Brandi, de Queiroz, Mukherjee, 2008, Hung, de Queiroz, Brandi, de Oliveira, Mukherjee, 2012, Song, Jeong, Choi, 2011). Thus, Liu and Sun proposed a Bayesian framework for video SR simultaneously estimating motion, blur kernel and noise level (Liu and Sun, 2014). However, the iterative procedure caused heavy computational burden. Therefore, Liao et al. solved the problem by employing a non-iterative method based on deep draft-ensemble learning (Liao et al., 2015). Kappeler et al. used explicit adaptive motion compensation as a preprocessing for video frames before they were fed into CNN framework (Kappeler et al., 2016). Other exampled-based video SR can be found in Dai et al. (2017); Huang et al. (2017); Shi et al. (2016).

Apart from the concept of learning, non-local self-similarity is another important concept for addressing image processing problems. The basic idea originates from the observation that similar image patches usually reproduce within the range of a natural image (Buades et al., 2005), its derivatives (Gilboa, Osher, 2008, Zhang, Burger, Bresson, Osher, 2010), or even its sparse coding coefficients (Dong et al., 2013). Buades et al. first proposed a non-local means (NLM) filter used as an image denoising filter (Buades et al., 2005). Inspired by this idea, Kostadin et al. assigned image patches into three-dimensional (3D) groups according to the non-local self-similarity between them and devised a block-matching and 3D (BM3D) filter (Dabov et al., 2007). Some researchers also incorporated non-locality into a variational framework and proposed non-local total variation which was widely applied in image inpainting (Gilboa and Osher, 2008), motion estimation (Werlberger et al., 2010), and image SR (Dong, Zhang, Shi, Li, 2013, Ren, He, Nguyen, 2017).

Protter et al. first generalized NLM to SR from the viewpoint of error energy minimization (Protter et al., 2009). Basavaraja et al. combined the work in Brandi et al. (2008) and Protter et al. (2009) to compute the HF part of a pixel using NLM (Basavaraja et al., 2010). Lengyel et al. incorporated illuminance and gradient information into the similarity comparison and reduced the averaging pixels by thresholding (Lengyel et al., 2014).

The classical NLM algorithm has two major steps. First, it compares the similarity between a pixel and its neighbors and assigns weights to these neighbors. The weight is some inverse function of the Euclidean distance between the patches surrounding two pixels. In the second step, NLM replaces every pixel with a weighted average of its neighbors. To adapt to SR tasks, the algorithm is altered in the second step where it only computes the HF part of a pixel by a weighted average of its neighbors’ HF parts. Then the derived HF part is added to the interpolated LR frames to complete the SR processing.

In this paper, we focus on two major parameters of NLM for video SR including the decaying factor used to compute the weights and the size of the neighborhood (searching window) within which NLM searches to find similar pixels to the target pixel. However, the fixed decaying factor of NLM cannot satisfy regions of distinct characteristics in an LR frame. And the fixed searching window fails to balance the requirements of low computational complexity and high quality of super-resolved images. Thus, we propose a novel criterion to select the decaying factor adaptively. We also propose two methods to adaptively determine the size of searching window, namely the predefined searching window (Li et al., 2016b) and the exhaustive searching window (Li et al., 2016d). The predefined method is a preprocessing implemented before NLM which is inspired by motion estimation but more efficient to carry out than motion estimation. The exhaustive method is incorporated during the process of NLM and determines the window size by searching progressively enlarged window iteratively until local difference drops below a termination standard.

The following of the paper is organized as follows. Section 2 discusses hybrid videos and basic NLM algorithm. We also define patch difference, local difference, and global difference in this section. Section 3 explains the proposed algorithm. Section 4 analyzes the defined path difference on three levels, namely pixel-, frame-, and sequence-level and reveals its influence on NLM. Section 5 shows the experimental results. Section 6 concludes the paper.

Section snippets

Hybrid video

In the image and video acquisition process, an image is usually degraded by several processes such as blurring, decimation, and noise corruption, i.e.,Y=DBX+nwhere X is the ground-truth image of the actual scene, Y is the degraded image, B stands for blurring, D stands for decimation, and n is usually independent Gaussian noise.

For the hybrid video shown in Fig. 1, the basic LR frames and periodic HR frames follow distinct degradation models. The periodic HR frames are the ground-truth images

Bilateral video super-resolution

NLM searches in the neighborhood of a center pixel to find similar pixels it. The method can be considered as a coarse and implicit motion estimator.

Generally, a single frame in a video sequence can be divided into background and foreground objects. The background is usually stable or moves slowly and it is enough to super-resolve the background pixels in an LR frame using a forward or backward HR frame. However, the foreground object may move fast so that an object around the boundary moves in

Analysis of patch difference

The patch difference has a huge influence on the performance of NLM algorithm. Analysis of the patch difference and the concept defined by patch difference can lead to insightful understanding of the mechanics of the algorithm. Thus, in this section, we analyze patch difference in three levels, namely pixel-level, frame-level, and sequence-level. In this section, unless otherwise stated, the period T of the HR frames of a hybrid video is 6. That is, every seventh frame from the first one is HR

Experimental results

In this section, we show the experimental results of the proposed and compared methods. All of these methods have been tested on 13 video sequences with different characteristics including Ballroom, Foreman, Mobile, News, Hall, Flower, Container, Waterfall, Coastguard, Mother-Daughter, Crowd, Exit, and Vassar. The compared methods include bilinear interpolation (BI), TNLM (Buades et al., 2005), DWSR (Basavaraja et al., 2010), DWSR with adaptive decaying factor (FHL), DWSR with predefined

Conclusion

The NLM algorithm has a very promising application in SR tasks. However, the traditional NLM algorithms suffer from two main drawbacks, i.e., the fixed decaying factor and searching window. The fixed decaying parameter is unfit for regions with different characteristics. It tends to blur the relatively flat regions in the image, resulting in perceived loss of detail. On the other hand, the fixed searching window leads to mismatches between pixels, causing unbearable degradation of the video. In

Acknowledgment

This work was supported by the Natural Science Foundation of China (61671126).

References (57)

  • Y. Li et al.

    Bilateral video super-resolution using non-local means with adaptive parameters

    Proc. IEEE International Conference on Image Processing

    (2016)
  • K. Zhang et al.

    Single image super-resolution with non-local means and steering kernel regression

    IEEE Trans. Image Process.

    (2012)
  • J. Allebach et al.

    Edge-directed interpolation

    Proc. IEEE International Conference on Image Processing

    (1996)
  • C. Ancuti et al.

    Video super-resolution using high quality photographs

    Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing

    (2010)
  • S.V. Basavaraja et al.

    Detail warping based video super-resolution using image guides

    Proc. IEEE International Conference on Image Processing

    (2010)
  • M. Ben-Ezra et al.

    Motion deblurring using hybrid imaging

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    (2003)
  • M. Bevilacqua et al.

    Video super-resolution via sparse combinations of key-frame patches in a compression context

    Proc. Picture Coding Symposium

    (2013)
  • N.K. Bose et al.

    Superresolution and noise filtering using moving least squares

    IEEE Trans. Image Process.

    (2006)
  • F. Brandi et al.

    Super-resolution of video using key frames and motion estimation

    Proc. IEEE International Conference on Image Processing

    (2008)
  • A. Buades et al.

    A non-local algorithm for image denoising

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    (2005)
  • H. Chang et al.

    Super-resolution through neighbor embedding

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    (2004)
  • B.-T. Choi et al.

    New frame rate up-conversion using bi-directional motion estimation

    IEEE Trans. Consum. Electron.

    (2000)
  • K. Dabov et al.

    Image denoising by sparse 3-D transform-domain collaborative filtering

    IEEE Trans. on Image Process.

    (2007)
  • Q. Dai et al.

    Sparse representation-based multiple frame video super-resolution

    IEEE Trans. Image Process.

    (2017)
  • W. Dong et al.

    Nonlocally centralized sparse representation for image restoration

    IEEE Trans. Image Process.

    (2013)
  • C.E. Duchon

    Lanczos filtering in one and two dimensions

    J. Appl. Meteorol.

    (1979)
  • S. Farsiu et al.

    Fast and robust multiframe super resolution

    IEEE Trans. Image Process.

    (2004)
  • W.T. Freeman et al.

    Example-based super-resolution

    IEEE Comput. Graph. Appl.

    (2001)
  • Z. Fu et al.

    Frequency domain based super-resolution method for mixed-resolution multiview images

    J. Syst. Eng. Electron.

    (2016)
  • Z. Fu et al.

    Adaptive luminance adjustment and neighborhood spreading strength information based view synthesis

    J. Syst. Eng. Electron.

    (2016)
  • Z. Gan et al.

    Adaptive joint nonlocal means denoising back projection for image super resolution

    Proc. IEEE International Conference on Image Processing

    (2013)
  • D.C. Garcia et al.

    Super resolution for multiview images using depth information

    IEEE Trans. Circuits Syst. Video Technol.

    (2012)
  • G. Gilboa et al.

    Nonlocal operators with applications to image processing

    Multiscale Model. Simul.

    (2008)
  • Y. Huang et al.

    Bidirectional recurrent convolutional networks for multi-frame super-resolution

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • E.M. Hung et al.

    Video super-resolution using codebooks derived from key-frames

    IEEE Trans. Circuits Syst. Video Technol.

    (2012)
  • Z. Jin et al.

    Virtual view assisted video super-resolution and enhancement

    IEEE Trans. Circuits Syst. Video Technol.

    (2015)
  • A. Kappeler et al.

    Super-resolution of compressed videos using convolutional neural networks

    Proc. IEEE International Conference on Image Processing

    (2016)
  • J. Kim et al.

    Accurate image super-resolution using very deep convolutional networks

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Cited by (8)

    • Plug-and-Play video super-resolution using edge-preserving filtering

      2022, Computer Vision and Image Understanding
      Citation Excerpt :

      The single frame-based super resolution methods can be divided into three sub-branches including interpolation based, reconstruction-based and example learning-based methodologies. Interpolation-based techniques are among the simplest strategies for video super-resolving (Li et al., 2018; Nimisha et al., 2018; Laghrib et al., 2018; Yan et al., 2020; Al Ismaeil et al., 2016). Some examples can be mentioned as bilinear, bicubic, and B-spline kernels super-resolution methods.

    • Change Detection in SAR Images via Ratio-Based Gaussian Kernel and Nonlocal Theory

      2022, IEEE Transactions on Geoscience and Remote Sensing
    View all citing articles on Scopus
    View full text