Modified non-local means for super-resolution of hybrid videos
Introduction
Recently, hybrid video or the so-called reversed-complexity video coding (Brandi et al., 2008) and inconsistent scalable video streaming (Mahfoodh et al., 2015), which is a low-resolution (LR) video with periodic high-resolution (HR) frames, has been studied out of different considerations, applied under distinct scenarios, and enhanced by several super-resolution (SR) algorithms. On the one hand, from the perspective of video compression, the use of hybrid video (see Fig. 1) can not only reduce video’s data size and improve the efficiency of bandwidth usage but also reduce encoding complexity. Multi-view mixed-resolution video originates from the same purpose (Garcia, Dorea, de Queiroz, 2012, Jin, Tillo, Yao, Xiao, 2015, Li, Li, Fu, Niu, Long, 2016, Li, Li, Fu, Zhong, 2016, Richter, Seiler, Schnurrer, Kaup, 2015). Mukherjee et al. adopted this concept and proposed a resolution reduction based coding mode in existing codecs where computational complexity was transferred from encoder side to the decoder side (Mukherjee et al., 2007). Brandi et al. directly proposed the use of hybrid video for the purpose of data-size reduction (Brandi et al., 2008).
On the other hand, the tradeoff between spatial and temporal resolution also leads to the incorporation of hybrid video (Ben-Ezra, Nayar, 2003, Tai, Du, Brown, Lin, 2010). In order for a pixel to be detected, an image sensor needs a minimum exposure time to accumulate sufficient irradiance. Providing that the sensor size is constant, the footprint of every pixel on the sensor will be reduced with the increase of image spatial resolution, which means exposure time should be prolonged to accumulate the same amount of irradiance on a reduced pixel footprint. Hybrid cameras can simultaneously capture periodic HR snapshots with low rate and LR frames with high frame rate. Commercial cameras such as Canon EOS 500D and Sony HDR-SR11 support this application (Ancuti, Ancuti, Bekaert, 2010, Basavaraja, Bopardikar, Velusamy, 2010).
To enhance the LR frames of a hybrid video, Brandi et al. proposed a motion estimation based SR method where HF components in the HR frames were used to recover those in the LR frames (Brandi et al., 2008). Song et al. further used hierarchical motion estimation to obtain as accurate motion vectors as possible and employed example-based SR when motion estimation failed (Song et al., 2011). Mahfoodh et al. utilized quad tree structure based motion estimation and incorporated their algorithm in VP9 spatial SVC (Mahfoodh, Mukherjee, Radha, 2015, Mukherjee, Bankoski, Grange, Han, Koleszar, Wilkins, Xu, Bultje, 2013). All of the aforementioned algorithms recovered an LR patch using only one HR patch derived from motion estimation, which limited the algorithms’ performance. Thus, Hung et al. developed example-based SR, by searching and combining multiple HR patches in codebooks derived from key frames, to super-resolve an LR patch (Hung et al., 2012). Bevilacqua et al. recovered an LR patch by taking sparse combinations of patches found in the adjacent HR frames (Bevilacqua et al., 2013).
The existing SR algorithms can be classified into three categories, namely interpolation-based, reconstruction-based, and example-based.
Basic interpolation-based methods such as bilinear or bicubic interpolation are based on the smoothness assumption of natural images. But they tend to blur the derived images especially at edges. Advanced approaches belong to a class of visually oriented interpolation techniques, including edge directed, content adaptive, and wavelet-based methods (Allebach, Wong, 1996, Li, Orchard, 2001, Wang, Ward, 2007, Zhang, Wu, 2006). However, the video sequences refined by interpolation suffer from perceived loss of detail in texture regions because they are unable to estimate HF information.
Reconstruction-based methods use subpixel shifts among several LR images of the same scene taken from multiple viewpoints. By estimating these shifts, pixels are rearranged into an HR grid and combined to complete an HR image (Bose, Ahuja, 2006, Farsiu, Robinson, Elad, Milanfar, 2004, Takeda, Farsiu, Milanfar, 2007). Iterative back projection (IBP) recovers a final HR image by projecting the reconstruction error between the LR and intermediate HR images back to the HR image iteratively (Gan, Cui, Chen, Zhu, 2013, Zhang, Liu, Li, Zhou, Zhang, 2016). Maximum a posteriori probability (MAP) methods utilize Bayesian statistical properties of images and adopt prior information such as total variation, Tikhonov regularization (Fu et al., 2016a), and non-local prior (Zhang et al., 2012) to stabilize the solution. However, due to the limited information available, reconstruction-based methods hit a bottleneck in improving the recovered image quality.
Example-based algorithms use known HR images to build a database which consists of pairs of LF information and HF information in a training phase (Freeman, Jones, Pasztor, 2001, Timofte, Rothe, Van Gool, 2016, Wang, Gao, Zhang, Li, 2016, Yin, Gao, Cai, 2015). Then the established database guides the learning phase to search a matching HR block for every block in the LR image. Under the paradigm of learning, some other algorithms adopt classical image processing techniques such as convolutional neural networks (CNN) (Kim et al., 2016), sparse representation (Polatkan et al., 2015), and neighborhood embedding (Chang et al., 2004) to address SR problems. These algorithms achieve state-of-the-art performances. For video SR, motion estimation plays a key role in making up for the motion between consecutive video frames (Brandi, de Queiroz, Mukherjee, 2008, Hung, de Queiroz, Brandi, de Oliveira, Mukherjee, 2012, Song, Jeong, Choi, 2011). Thus, Liu and Sun proposed a Bayesian framework for video SR simultaneously estimating motion, blur kernel and noise level (Liu and Sun, 2014). However, the iterative procedure caused heavy computational burden. Therefore, Liao et al. solved the problem by employing a non-iterative method based on deep draft-ensemble learning (Liao et al., 2015). Kappeler et al. used explicit adaptive motion compensation as a preprocessing for video frames before they were fed into CNN framework (Kappeler et al., 2016). Other exampled-based video SR can be found in Dai et al. (2017); Huang et al. (2017); Shi et al. (2016).
Apart from the concept of learning, non-local self-similarity is another important concept for addressing image processing problems. The basic idea originates from the observation that similar image patches usually reproduce within the range of a natural image (Buades et al., 2005), its derivatives (Gilboa, Osher, 2008, Zhang, Burger, Bresson, Osher, 2010), or even its sparse coding coefficients (Dong et al., 2013). Buades et al. first proposed a non-local means (NLM) filter used as an image denoising filter (Buades et al., 2005). Inspired by this idea, Kostadin et al. assigned image patches into three-dimensional (3D) groups according to the non-local self-similarity between them and devised a block-matching and 3D (BM3D) filter (Dabov et al., 2007). Some researchers also incorporated non-locality into a variational framework and proposed non-local total variation which was widely applied in image inpainting (Gilboa and Osher, 2008), motion estimation (Werlberger et al., 2010), and image SR (Dong, Zhang, Shi, Li, 2013, Ren, He, Nguyen, 2017).
Protter et al. first generalized NLM to SR from the viewpoint of error energy minimization (Protter et al., 2009). Basavaraja et al. combined the work in Brandi et al. (2008) and Protter et al. (2009) to compute the HF part of a pixel using NLM (Basavaraja et al., 2010). Lengyel et al. incorporated illuminance and gradient information into the similarity comparison and reduced the averaging pixels by thresholding (Lengyel et al., 2014).
The classical NLM algorithm has two major steps. First, it compares the similarity between a pixel and its neighbors and assigns weights to these neighbors. The weight is some inverse function of the Euclidean distance between the patches surrounding two pixels. In the second step, NLM replaces every pixel with a weighted average of its neighbors. To adapt to SR tasks, the algorithm is altered in the second step where it only computes the HF part of a pixel by a weighted average of its neighbors’ HF parts. Then the derived HF part is added to the interpolated LR frames to complete the SR processing.
In this paper, we focus on two major parameters of NLM for video SR including the decaying factor used to compute the weights and the size of the neighborhood (searching window) within which NLM searches to find similar pixels to the target pixel. However, the fixed decaying factor of NLM cannot satisfy regions of distinct characteristics in an LR frame. And the fixed searching window fails to balance the requirements of low computational complexity and high quality of super-resolved images. Thus, we propose a novel criterion to select the decaying factor adaptively. We also propose two methods to adaptively determine the size of searching window, namely the predefined searching window (Li et al., 2016b) and the exhaustive searching window (Li et al., 2016d). The predefined method is a preprocessing implemented before NLM which is inspired by motion estimation but more efficient to carry out than motion estimation. The exhaustive method is incorporated during the process of NLM and determines the window size by searching progressively enlarged window iteratively until local difference drops below a termination standard.
The following of the paper is organized as follows. Section 2 discusses hybrid videos and basic NLM algorithm. We also define patch difference, local difference, and global difference in this section. Section 3 explains the proposed algorithm. Section 4 analyzes the defined path difference on three levels, namely pixel-, frame-, and sequence-level and reveals its influence on NLM. Section 5 shows the experimental results. Section 6 concludes the paper.
Section snippets
Hybrid video
In the image and video acquisition process, an image is usually degraded by several processes such as blurring, decimation, and noise corruption, i.e.,where X is the ground-truth image of the actual scene, Y is the degraded image, B stands for blurring, D stands for decimation, and n is usually independent Gaussian noise.
For the hybrid video shown in Fig. 1, the basic LR frames and periodic HR frames follow distinct degradation models. The periodic HR frames are the ground-truth images
Bilateral video super-resolution
NLM searches in the neighborhood of a center pixel to find similar pixels it. The method can be considered as a coarse and implicit motion estimator.
Generally, a single frame in a video sequence can be divided into background and foreground objects. The background is usually stable or moves slowly and it is enough to super-resolve the background pixels in an LR frame using a forward or backward HR frame. However, the foreground object may move fast so that an object around the boundary moves in
Analysis of patch difference
The patch difference has a huge influence on the performance of NLM algorithm. Analysis of the patch difference and the concept defined by patch difference can lead to insightful understanding of the mechanics of the algorithm. Thus, in this section, we analyze patch difference in three levels, namely pixel-level, frame-level, and sequence-level. In this section, unless otherwise stated, the period T of the HR frames of a hybrid video is 6. That is, every seventh frame from the first one is HR
Experimental results
In this section, we show the experimental results of the proposed and compared methods. All of these methods have been tested on 13 video sequences with different characteristics including Ballroom, Foreman, Mobile, News, Hall, Flower, Container, Waterfall, Coastguard, Mother-Daughter, Crowd, Exit, and Vassar. The compared methods include bilinear interpolation (BI), TNLM (Buades et al., 2005), DWSR (Basavaraja et al., 2010), DWSR with adaptive decaying factor (FHL), DWSR with predefined
Conclusion
The NLM algorithm has a very promising application in SR tasks. However, the traditional NLM algorithms suffer from two main drawbacks, i.e., the fixed decaying factor and searching window. The fixed decaying parameter is unfit for regions with different characteristics. It tends to blur the relatively flat regions in the image, resulting in perceived loss of detail. On the other hand, the fixed searching window leads to mismatches between pixels, causing unbearable degradation of the video. In
Acknowledgment
This work was supported by the Natural Science Foundation of China (61671126).
References (57)
- et al.
Bilateral video super-resolution using non-local means with adaptive parameters
Proc. IEEE International Conference on Image Processing
(2016) - et al.
Single image super-resolution with non-local means and steering kernel regression
IEEE Trans. Image Process.
(2012) - et al.
Edge-directed interpolation
Proc. IEEE International Conference on Image Processing
(1996) - et al.
Video super-resolution using high quality photographs
Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing
(2010) - et al.
Detail warping based video super-resolution using image guides
Proc. IEEE International Conference on Image Processing
(2010) - et al.
Motion deblurring using hybrid imaging
Proc. IEEE Conference on Computer Vision and Pattern Recognition
(2003) - et al.
Video super-resolution via sparse combinations of key-frame patches in a compression context
Proc. Picture Coding Symposium
(2013) - et al.
Superresolution and noise filtering using moving least squares
IEEE Trans. Image Process.
(2006) - et al.
Super-resolution of video using key frames and motion estimation
Proc. IEEE International Conference on Image Processing
(2008) - et al.
A non-local algorithm for image denoising
Proc. IEEE Conference on Computer Vision and Pattern Recognition
(2005)
Super-resolution through neighbor embedding
Proc. IEEE Conference on Computer Vision and Pattern Recognition
New frame rate up-conversion using bi-directional motion estimation
IEEE Trans. Consum. Electron.
Image denoising by sparse 3-D transform-domain collaborative filtering
IEEE Trans. on Image Process.
Sparse representation-based multiple frame video super-resolution
IEEE Trans. Image Process.
Nonlocally centralized sparse representation for image restoration
IEEE Trans. Image Process.
Lanczos filtering in one and two dimensions
J. Appl. Meteorol.
Fast and robust multiframe super resolution
IEEE Trans. Image Process.
Example-based super-resolution
IEEE Comput. Graph. Appl.
Frequency domain based super-resolution method for mixed-resolution multiview images
J. Syst. Eng. Electron.
Adaptive luminance adjustment and neighborhood spreading strength information based view synthesis
J. Syst. Eng. Electron.
Adaptive joint nonlocal means denoising back projection for image super resolution
Proc. IEEE International Conference on Image Processing
Super resolution for multiview images using depth information
IEEE Trans. Circuits Syst. Video Technol.
Nonlocal operators with applications to image processing
Multiscale Model. Simul.
Bidirectional recurrent convolutional networks for multi-frame super-resolution
IEEE Trans. Pattern Anal. Mach. Intell.
Video super-resolution using codebooks derived from key-frames
IEEE Trans. Circuits Syst. Video Technol.
Virtual view assisted video super-resolution and enhancement
IEEE Trans. Circuits Syst. Video Technol.
Super-resolution of compressed videos using convolutional neural networks
Proc. IEEE International Conference on Image Processing
Accurate image super-resolution using very deep convolutional networks
Proc. IEEE Conference on Computer Vision and Pattern Recognition
Cited by (8)
Plug-and-Play video super-resolution using edge-preserving filtering
2022, Computer Vision and Image UnderstandingCitation Excerpt :The single frame-based super resolution methods can be divided into three sub-branches including interpolation based, reconstruction-based and example learning-based methodologies. Interpolation-based techniques are among the simplest strategies for video super-resolving (Li et al., 2018; Nimisha et al., 2018; Laghrib et al., 2018; Yan et al., 2020; Al Ismaeil et al., 2016). Some examples can be mentioned as bilinear, bicubic, and B-spline kernels super-resolution methods.
Video super-resolution using hybrid support vector regression–Actor Critic Neural Network model
2022, Evolutionary IntelligenceInfrared image super-resolution method for edge computing based on adaptive nonlocal means
2022, Journal of SupercomputingChange Detection in SAR Images via Ratio-Based Gaussian Kernel and Nonlocal Theory
2022, IEEE Transactions on Geoscience and Remote SensingModified Rider Optimization-Based v Channel Magnification for Enhanced Video Super Resolution
2021, International Journal of Image and GraphicsSuper-resolution reconstruction of license plate image based on gradual Back-Projection Network
2020, Laser and Optoelectronics Progress