Elsevier

Signal Processing

Volume 189, December 2021, 108274
Signal Processing

ESKN: Enhanced selective kernel network for single image super-resolution

https://doi.org/10.1016/j.sigpro.2021.108274Get rights and content

Highlights

  • An enhanced selective kernel module (ESKM) and a symmetric connection scheme (SCS) are proposed for high quality SISR.

  • We integrate a re-calibration process to adjust the weights for different filters, thus can better characterize insignificant but important features.

  • We utilize a spatial attention module to adjust the low-level features for more consistent/stable fusion with the corresponding high-level features.

Abstract

For single image super-resolution (SISR), one recent research direction is to build an effective multi-scale context extraction pipeline via parallel convolutional streams. Although very competitive SR performance has been achieved, effective solutions for extracting and integrating multi-scale context are still under-explored. We propose an enhanced selective kernel module (ESKM) to address this challenging problem and build a network that achieves high-quality SISR. The key of the proposed ESKM is to perform self-learned filter-oriented weights re-calibration to better extract insignificant but important features which are critical for high-accuracy SISR tasks. Moreover, we replace the Softmax operation with Sigmoid for more flexible weights learning and remove the dimension reduction/expansion component to build a direct correspondence between channels and their weights. We also design a symmetric connection scheme (SCS) to better fuse the hierarchical features extracted from different convolutional stages. More specifically, the low-level features are adjusted via a spatial attention module to achieve more effective fusion with high-level semantic features. We then stack multiple ESKMs via SCS to build our new network, named Enhanced Selective Kernel Network (ESKN). Extensive experimental results demonstrate the effectiveness of our proposed ESKN model, outperforming the state-of-the-art SISR methods in terms of restoration quality and network complexity.

Introduction

Single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image with abundant details and textures from its low-resolution (LR) version. It provides an effective technique to increase the spatial resolution of optical sensors and thus has attracted considerable attention from both the academic and industrial communities. Recently, many machine learning-based SISR algorithms have been developed. However, SISR remains a challenging ill-posed problem, as one specific LR input can correspond to many possible HR versions and the mapping space is too vast to explore.

In the past few years, convolution neural networks (CNNs) have achieved great performance in SISR, and a recent research direction toward better super-resolution is through designing more sophisticated modules or pipelines. Specifically, extracting multi-scale context via parallel convolutional streams has been actively studied [1], [2], [3] and they have achieved very competitive SR performance. However, most of these approaches employ a simple concatenation layer and a convolutional layer to linearly aggregate features coming from multiple branches [4]. Such a linear aggregation may result in networks with insufficient adaption capacity. Selective Kernel Module (SKM) is proposed to integrate multi-scale features by calculating input-specific importance scores via channel attention mechanism and a Softmax operation [4]. However, the channel-wise weights are computed based on the extracted features after performing the global average pooling (GAP) operation, thus the normal channel attention mechanism would not satisfactorily boost the insignificant but important features. Moreover, Softmax imposes a competition relationship between multi-scale features, which might be sub-optimal for SISR tasks.

Deeper networks have been proved effective in SISR. But as the depth increases, the network usually suffers from training difficulty and limited performance gain [5]. The underlying cause of this phenomenon is a lack of long-term memory [6], e.g., high-level features from later layers do not contain low-level information from earlier layers in the pipeline. Many methods have been developed to overcome this issue by forwarding low-level features to subsequent layers via skip connections [1], [5], [6], [7]. To avoid the neglect of useful low-level features, a simple and natural strategy is to pass features extracted in every layer to the end of the network, as proposed in [1], [7]. However, concatenating all these features at the end of the pipeline results in a huge amount of redundant information, and the computational cost of the subsequent convolution significantly increase with the growth of module numbers. In addition, most skip connection-based methods integrate low-level and high-level features by directly adding/concatenating them together, neglecting the difference between local information (low-level) and semantic information (high-level). In fact, a pixel in the high-level features is corresponding to a region of pixels in the low-level features.

To tackle these aforementioned critical issues, we firstly design an enhanced feature extraction module (i.e., ESKM) based on the selective kernel module (SKM) [4]. Since the channel-wise weights are calculated based on the extracted features after applying the global average pooling (GAP) operation, corresponding weights for insignificant but important features might be very small. Some of the learned filters can extract certain important local structures (e.g., textures or details). Whether their corresponding features have strong or weak signals, these filters and corresponding channels are important. The key of the proposed ESKM is to integrate a filter-oriented weights re-calibration process to calculate the extra weights for different filters, thus can better extract insignificant but important features (e.g., textures or details) which are critical for high-accuracy SISR tasks. To further improve the performance of ESKM, we replace the Softmax function with Sigmoid, and remove the dimension reduction/expansion to preserve important information for the subsequent channel attention-based feature re-calibration. Such enhanced selective kernel module (ESKM) allows our proposed SISR model to generate highly discriminative features for high-quality SISR by emphasizing the important features. Secondly, we also propose a novel connection scheme, named symmetric connection scheme (SCS), which adds low-level features with corresponding high-level features in the symmetric position. Since the spatial information encoded in the low-level features is very different from the semantic information encoded in the high-level features, simple addition may cause unstable training process. As an effective remedy, the low-level features with rich spatial information are first adjusted by a spatial attention module before adding with the extracted high-level features to further emphasize their consistency for the effective fusion of hierarchical features. Compared with existing connection schemes, SCS can make better use of the spatial information encoded in low-level features, improve the gradient flow, and support more consistent fuse of hierarchical features. Thus, it can reduce the training difficulty of the network and enhance super-resolution results. By stacking a sequence of ESKMs via SCS, we propose an enhanced selective kernel network (ESKN) to better extract and aggregate multi-scale features and ease the training difficulty. Fig. 1 shows the architecture of the proposed ESKN.

This work has the following three main contributions.

  • We design an enhanced selective kernel module (ESKM) based on SKM. The most significant improvement is to integrate a re-calibration process to adjust the weights for different filters thus can better characterize insignificant but important features (e.g., textures or details). Moreover, we replace the Softmax operation by Sigmoid operation for more flexible weights learning, and remove dimension reduction/expansion component to build a direct correspondence between channels and their weights. Such ESKM can better adaptively fuse multi-scale features extracted from multiple branches with different kernel sizes.

  • To better utilize the hierarchical features extracted by ESKMs, we design a symmetric connection scheme (SCS) to pass the low-level features via skip connections and fuse with high-level features in the symmetric position. Different from the existing skip connection-based designs which directly add/concatenate low-level and high-level features, we add an extra spatial attention module to adjust the low-level features for more consistent/stable fusion with the corresponding high-level features. This connection scheme makes better use of low-level features and improves the gradient flow, and hence, enhances the network performance.

  • By integrating ESKMs using SCS, we compose a compact but powerful network (ESKN) for high-quality SISR. The proposed ESKN model show superior performance over the state-of-the-art SISR methods [7], [8] on multiple benchmark datasets, achieving more accurate image restoration results with fewer parameters.

The remainder of this paper is organized as follows. We firstly review related deep learning-based SISR methods in Section 2. Then Section 3 elaborates details of key components in our ESKN model and the implementation settings. We evaluate our model and conduct qualitative and quantitative comparisons with state-of-the-art methods in Section 4, and conclude the paper in Section 5.

Section snippets

Related work

Over the past decades, developing effective SISR techniques to reconstruct an HR image from its corresponding single LR version has attracted extensive attention from both the academic and the industrial communities. Recently, CNN-based approaches have achieved the state-of-the-art performance in SISR. Therefore, we mainly focus on reviewing the CNN-based methods.

Approach

Figure 1 shows the pipeline of our proposed ESKN model. Our ESKN is composed of three sub-networks: an initial feature extraction sub-network (IFENet) to learn feature maps from low-resolution input ILR, a feature mapping sub-network (FMNet) to transform low-level features into high-level ones, and a reconstruction sub-network (RNet) to reconstruct the super-resolved high-resolution image ISR. The core of ESKN is the FMNet designed to learn more informative features for SR. The FMNet is

Datasets and metrics

Training. Like most recent SISR methods, we made use of 800 training images from the DIVerse 2K resolution image dataset (i.e., DIV2K) [40] to train our ESKN. In each training batch, 16 LR RGB patches with the size of 48×48 and corresponding HR patches are randomly cropped. They are then randomly augmented by horizontal or vertical flips and 90 rotations. We pre-processed all the images by subtracting the mean RGB value of the DIV2K dataset.

Testing. Five commonly used public benchmark datasets

Conclusions

We present a novel enhanced selective kernel network (ESKN) for single image super-resolution. Specifically, we design an enhanced selective kernel module (ESKM) by revising the selective kernel module (SKM) (introduced by [4] for image classification). For ESKM, we (1) introduce a new self-learned filter-oriented weight, (2) use Sigmoid to avoid unnecessary competitions introduced by Softmax operation from SKM, and (3) simplify the two FC layers to one to avoid hyper-parameter tuning and

CRediT authorship contribution statement

Zewei He: Conceptualization, Methodology, Software, Writing – original draft. Guizhong Fu: Software, Data curation. Yanpeng Cao: Writing – review & editing, Formal analysis, Funding acquisition. Yanlong Cao: Supervision, Project administration. Jiangxin Yang: Supervision, Project administration. Xin Li: Conceptualization, Writing – original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (2020YFB1711400) and the National Natural Science Foundation of China (52075485).

References (42)

  • Y. Cao et al.

    Fast and accurate single image super-resolution via an energy-aware improved deep residual network

    Signal Process.

    (2019)
  • J. Li et al.

    Multi-scale residual network for image super-resolution

    ECCV

    (2018)
  • Z. He et al.

    MRFN: multi-receptive-field network for fast and accurate single image super-resolution

    IEEE Trans. Multimedia

    (2020)
  • Y. Hu et al.

    Single image super-resolution via cascaded multi-scale cross network

    arXiv preprint

    (2018)
  • X. Li et al.

    Selective kernel networks

    CVPR

    (2019)
  • Y. Tai et al.

    MemNet: a persistent memory network for image restoration

    ICCV

    (2017)
  • Y. Zhang et al.

    Residual dense network for image super-resolution

    CVPR

    (2018)
  • B. Lim et al.

    Enhanced deep residual networks for single image super-resolution

    CVPR workshop

    (2017)
  • C. Dong et al.

    Learning a deep convolutional network for image super-resolution

    ECCV

    (2014)
  • C. Dong et al.

    Image super-resolution using deep convolutional networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • J. Kim et al.

    Accurate image super-resolution using very deep convolutional networks

    CVPR

    (2016)
  • J. Kim et al.

    Deeply-recursive convolutional network for image super-resolution

    CVPR

    (2016)
  • Y. Tai et al.

    Image super-resolution via deep recursive residual network

    CVPR

    (2017)
  • C. Dong et al.

    Accelerating the super-resolution convolutional neural network

    ECCV

    (2016)
  • S. Wenzhe et al.

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    CVPR

    (2016)
  • W.-S. Lai et al.

    Deep Laplacian pyramid networks for fast and accurate super-resolution

    CVPR

    (2017)
  • W.-S. Lai et al.

    Deep Laplacian pyramid networks for fast and accurate super-resolution

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • C. Ledig et al.

    Photo-realistic single image super-resolution using a generative adversarial network

    CVPR

    (2017)
  • T. Tong et al.

    Image super-resolution using dense skip connections

    ICCV

    (2017)
  • K. He et al.

    Deep residual learning for image recognition

    CVPR

    (2016)
  • R. Timofte et al.

    NTIRE 2017 challenge on single image super-resolution: methods and results

    CVPR Workshop

    (2017)
  • Cited by (7)

    • Single image super‐resolution based on progressive fusion of orientation‐aware features

      2023, Pattern Recognition
      Citation Excerpt :

      Single image super-resolution (SISR) aims to restore a high-resolution (HR) image containing abundant details and textures based on its low-resolution (LR) version [1–4].

    View all citing articles on Scopus
    View full text