Full length article
Multi-scale attention network for image super-resolution

https://doi.org/10.1016/j.jvcir.2021.103300Get rights and content

Highlights

  • This work for image super-resolution utilizing fewer parameters and calculations.

  • Multi-scale cross block for capturing multi-scale and multi-level features.

  • Multi-path wide-activated attention block for mutual communication among channels.

Abstract

The power of convolutional neural networks (CNN) has demonstrated irreplaceable advantages in super-resolution. However, many CNN-based methods need large model sizes to achieve superior performance, making them difficult to apply in the practical world with limited memory footprints. To efficiently balance model complexity and performance, we propose a multi-scale attention network (MSAN) by cascading multiple multi-scale attention blocks (MSAB), each of which integrates a multi-scale cross block (MSCB) and a multi-path wide-activated attention block (MWAB). Specifically, MSCB initially connects three parallel convolutions with different dilation rates hierarchically to aggregate the knowledge of features at different levels and scales. Then, MWAB split the channel features from MSCB into three portions to further improve performance. Rather than being treated equally and independently, each portion is responsible for a specific function, enabling internal communication among channels. Experimental results show that our MSAN outperforms most state-of-the-art methods with relatively few parameters and Mult-Adds.

Introduction

For decades, image super-resolution (SR) has been applied to solve the problem of reconstructing low-resolution (LR) images to high-resolution (HR) images. However, the task is inherently an ill-posed inverse issue that may generate multiple HR images from an identical LR image. Many SR methods have been proposed to address this problem, including interpolation-based [1], reconstruction-based [2], and learning-based [3], [4] methods.

Deep learning has recently played a significant role in image SR applications and has demonstrated brilliant reconstruction performance due to its robust nonlinear mapping. Dong et al. [5] first constructed a three-layer convolutional neural network (CNN) for SR tasks, acquiring a better performance when compared to conventional SR algorithms. Kim et al. [6] presented a very deep convolutional network for SR (VDSR) by increasing the network depth to 20 layers, which obtained a significant improvement. After that, progressively more CNN-based approaches have been developed to improve SR results further. Increasing the network depth can indeed enhance reconstruction accuracy, however, it inevitably introduces more parameters and heavy computational resources. For example, the enhanced deep SR network (EDSR) [7] and residual dense network (RDN) [8] were very deep networks that entailed 43M and 22M parameters, respectively. Residual channel attention network (RCAN) [9] has more than 400 convolutional layers with about 15.59M parameters. All these approaches exhibited nice SR performance, while their excessive parameters hindered their practical applications.

Concerning reducing the model parameters, some network architectures began to focus on the recursive mechanism and lightweight models. Deeply recursive convolutional network (DRCN) [10] and deep recursive residual network (DRRN) [11] referred to the recursive mechanism, in which a set of recursive blocks was applied to decrease the network parameters. Although the recursive mechanism strategy demonstrated favorable results with fewer parameters, it still has a huge memory footprint. Currently, to reach a better performance and model size trade-off, some lightweight methods were implemented. Cascading residual network (CARN) [12] adopted cascading residual architecture with group convolution to reduce the parameters, but it sacrificed the accuracy. A neural architecture search technique was used by Chu et al. [13] that allowed seeking lightweight networks. Nevertheless, the performance is limited owing to the constraints of search space. In spite of their success in fascinating results, there is still much room for further improvement between reconstruction accuracy and model capacity.

Moreover, broadening the network width to increase model receptive field, is also an effective way to boost performance. Li et al. [14] built a novel multi-scale residual network (MSRN) to obtain the image features on different scales. Hu et al. [15] proposed a deep cascaded multi-scale cross network (CMSC) by designing multi-scale cross modules to fuse multi-scale information. Both MSRN and CMSC were adopted to different convolution kernels to achieve the extraction of multi-scale information. However, due to the utilization of huge convolution kernels, the model size will increase dramatically.

Additionally, attention mechanism has been widely used in various computer vision tasks [16], [17]. It allows networks to focus on more valuable features, thereby enhancing their representational capability. Motivated by [16], Zhang et al. [9] designed the channel attention mechanism in RCAN that can rescale channel-wise features. Attention-based DenseNet with residual deconvolution (ADRD) [18] and residual feature aggregation network (RFANet) [19] learned the spatial context by designing a spatial attention module. In particular, channel attention and spatial attention mechanisms were jointly combined in residual attention SR network (SRRAM) [20] and multi-path adaptive modulation network (MAMNet) [21], exploring the relationship of both inter-and intra-channels. Nevertheless, since the attention mechanism models interdependencies across the whole channels, directly applying the attention mechanism to image SR involves unnecessary parameters and calculations. It remains to be explored how to build a compact network that strikes a balance between model performance and capacity.

To alleviate the above issues, we propose a lightweight and efficient multi-scale attention network (MSAN), whose overall architecture is depicted in Fig. 2. The MSAN consists of a shallow feature part, a sequence of multi-scale attention blocks (MSAB), and a reconstruction part. A sequence of MSABs is adaptively cascaded to infer informative features in a coarse-to-fine manner. In detail, MSAB is primarily composed of a multi-scale cross block (MSCB) and multi-path wide-activated attention block (MWAB). The MSCB utilizes three parallel convolutions with different dilation rates, which are connected hierarchically. The unique connection can effectively extract multi-scale and multi-level features while expanding the receptive field. Particularly, the dilated convolution replaces the large kernel convolution, achieving the reduction of model parameters. To further yield more expressive feature representations, we propose a MWAB to split the features produced by MSCB into three portions inhomogeneously, with each portion being in charge of a particular functionality in a heterogeneous manner. These particular functionalities can adequately establish mutual communication among channels to facilitate the diversified feature output. Moreover, these functionalities do not operate on the whole channels but partial channels, further decreasing the number of parameters and calculations. Compared with state-of-the-art SR networks, our proposed MSAN demonstrates higher performance as well as lower computation time, as shown in Fig. 1.

In summary, our main contributions are threefold:

  • We propose a lightweight yet efficient MSAN by utilizing a set of MSABs for a more accurate image SR. Thanks to our MSAB consisting of MSCB and MWAB, we obtain better results with much fewer parameters and Mult-Adds.

  • We propose a MSCB, based on the hierarchical connections among three parallel convolutions, that can fully learn multi-scale and multi-level features. The three parallel convolutions with different dilation rates effectively enlarge the receptive field while releasing parameters overhead.

  • We design a MWAB to achieve internal communication among channel features, further improving performance and speeding up the training process. The channel features are divided into three portions unevenly, each of which goes through a distinctive pathway, i.e., original spatial (OS), spatial attention (SA), and channel attention (CA).

The rest of this study is organized as follows: Section 2 reviews the CNN-based and attention-based methods in SR tasks. Section 3 describes our proposed MSAN in detail. Section 4 provides model analysis and comparative experiments with other methods. We conclude our work in Section 5.

Section snippets

Single image super-resolution

CNN approaches based on deep learning have become a key technology for addressing image SR tasks. As pioneers, Dong et al. [5] introduced a three-layer CNN to learn the non-linear mapping between HR image and LR image, which was named SRCNN. Inspired by this strategy, Kim et al. [6] stacked 20 convolutional layers to increase the network depth and achieved a significant performance improvement. Later on, a series of CNN-based works was mostly devoted to deepening the network to boost

Overall network architecture

The overall architecture of our proposed MSAN is demonstrated in Fig. 2. Similar to most SR network architectures [6], [20], MSAN can be partitioned into three parts: (1) shallow feature extraction part, (2) the chained stacking MSABs part, and (3) reconstruction part. We denote ILRRH×W×C as the LR input image and IHRRrH×rW×C as the corresponding HR output image, where H and W are the height and the width of the input image, respectively. r is a scale factor, and C denotes the number of

Datasets and metrics

Following the recent methods [9], [20], we train our proposed MSAN using 800 high-quality images from the DIV2K [32] dataset. For testing, we conduct on five standard benchmark datasets, including Set5 [33], Set14 [34], B100 [35], Urban100 [36], and Manga109 [37]. To prove the superiority of our MSAN, we leverage three common degradation models with Bicubic (BI), Blur-downscale (BD), and Downscale-noise (DN). The SR results are evaluated by the peak signal-to-noise ratio (PSNR) and structure

Conclusion

In this study, we propose an efficient and lightweight MSAN for single image SR. We implement a sequence of MSABs as the backbone in MSAN to progressively refine diversified information. In MSAB, the efficient combination of MSCB and MWAB can greatly profit SR performance in addition to accelerating the training process. Specifically, the MSCB is designed to hierarchically connect among parallel dilated convolutions to catch advanced features at different scales and levels as well as to enlarge

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 51979085, 61903124).

References (45)

  • Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention...
  • KimJ. et al.

    Deeply-recursive convolutional network for image super-resolution

  • TaiY. et al.

    Image super-resolution via deep recursive residual network

  • N. Ahn, B. Kang, K.A. Sohn, Fast, accurate, and lightweight super-resolution with cascading residual network, in:...
  • ChuX. et al.

    Fast, accurate and lightweight super-resolution with neural architecture search

  • J. Li, F. Fang, K. Mei, G. Zhang, Multi-scale residual network for image super-resolution, in: Proceedings of the...
  • HuY. et al.

    Single image super-resolution via cascaded multi-scale cross network

    (2018)
  • J. Hu, L. Shen, G. Sun, S. Albanie, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer...
  • S. Woo, J. Park, J.Y. Lee, I.S. Kweon, CBAM: Convolutional block attention module, in: Proceedings of the European...
  • LiZ.

    Image super-resolution using attention based DenseNet with residual deconvolution

    (2019)
  • J. Liu, W. Zhang, Y. Tang, J. Tang, G. Wu, Residual feature aggregation network for image super-resolution, in:...
  • KimJ.H. et al.

    RAM: Residual attention module for single image super-resolution

    (2018)
  • Cited by (14)

    • A super-resolution-based license plate recognition method for remote surveillance

      2023, Journal of Visual Communication and Image Representation
    • Fine-grained neural architecture search for image super-resolution

      2022, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text