Skip to main content

Multi-granularity Transformer for Image Super-Resolution

  • Conference paper
  • First Online:
Computer Vision – ACCV 2022 (ACCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13843))

Included in the following conference series:

  • 992 Accesses

Abstract

Recently, transformers have made great success in computer vision. Thus far, most of those works focus on high-level tasks, e.g., image classification and object detection, and fewer attempts were made to solve low-level problems. In this work, we tackle image super-resolution. Specifically, transformer architectures with multi-granularity transformer groups are explored for complementary information interaction, to improve the accuracy of super-resolution. We exploit three transformer patterns, i.e., the window transformers, dilated transformers and global transformers. We further investigate the combination of them and propose a Multi-granularity Transformer (MugFormer). Specifically, the window transformer layer is aggregated with other transformer layers to compose three transformer groups, namely, Local Transformer Group, Dilated Transformer Group and Global Transformer Group, which efficiently aggregate both local and global information for accurate reconstruction. Extensive experiments on five benchmark datasets demonstrate that our MugFormer performs favorably against state-of-the-art methods in terms of both quantitative and qualitative results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. arXiv preprint arXiv:2103.15691 (2021)

  2. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In: Proceedings of the British Machine Vision Conference (2012)

    Google Scholar 

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, B., et al.: GLiT: neural architecture search for global and local image transformer. In: Proceedings of the IEEE International Conference on Computer Vision (2021)

    Google Scholar 

  5. Chen, H., et al.: Pre-trained image processing transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)

    Google Scholar 

  6. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)

    Article  Google Scholar 

  7. Chu, X., et al.: Twins: revisiting spatial attention design in vision transformers. In: Proceedings of the Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  8. Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

  9. Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L.: Second-order attention network for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11065–11074 (2019)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13

    Chapter  Google Scholar 

  12. Dong, W., Zhang, L., Shi, G., Li, X.: Nonlocally centralized sparse representation for image restoration. IEEE Trans. Image Process. 22(4), 1620–1630 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representation (2021)

    Google Scholar 

  14. Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)

    Article  MathSciNet  Google Scholar 

  15. Haris, M., Shakhnarovich, G., Ukita, N.: Task-driven super resolution: object detection in low-resolution images. arXiv preprint arXiv:1803.11316 (2018)

  16. Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1664–1673 (2018)

    Google Scholar 

  17. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206 (2015)

    Google Scholar 

  18. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)

    Google Scholar 

  19. Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol 13669, pp. 280–296. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_17

  20. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: SwinIR: image restoration using swin transformer. arXiv preprint arXiv:2108.10257 (2021)

  21. Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, pp. 136–144 (2017)

    Google Scholar 

  22. Liu, D., Wen, B., Fan, Y., Loy, C.C., Huang, T.S.: Non-local recurrent network for image restoration. In: Proceedings of the Advance in Neural Information Processing Systems (2018)

    Google Scholar 

  23. Liu, J., Zhang, W., Tang, Y., Tang, J., Wu, G.: Residual feature aggregation network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2359–2368 (2020)

    Google Scholar 

  24. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  25. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 416–423. IEEE (2001)

    Google Scholar 

  26. Matsui, Y., et al.: Sketch-based manga retrieval using manga109 dataset. J. Multimed. Tools Appl. 76(20), 21811–21838 (2017)

    Article  Google Scholar 

  27. Mei, Y., Fan, Y., Zhou, Y.: Image super-resolution with non-local sparse attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3517–3526 (2021)

    Google Scholar 

  28. Mei, Y., Fan, Y., Zhou, Y., Huang, L., Huang, T.S., Shi, H.: Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5690–5699 (2020)

    Google Scholar 

  29. Niu, B., et al.: Single image super-resolution via a holistic attention network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_12

    Chapter  Google Scholar 

  30. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)

  31. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  32. Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883 (2016)

    Google Scholar 

  33. Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L.: NTIRE 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, pp. 114–125 (2017)

    Google Scholar 

  34. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  35. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  36. Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Max-deeplab: end-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5463–5474 (2021)

    Google Scholar 

  37. Wang, L., Li, D., Zhu, Y., Tian, L., Shan, Y.: Dual super-resolution learning for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3774–3783 (2020)

    Google Scholar 

  38. Wang, W., et al.: PVTv 2: improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797 (2021)

  39. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE International Conference on Computer Vision (2021)

    Google Scholar 

  40. Wang, W., et al.: Scene text image super-resolution in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 650–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_38

    Chapter  Google Scholar 

  41. Wang, W., Yao, L., Chen, L., Cai, D., He, X., Liu, W.: CrossFormer: a versatile vision transformer based on cross-scale attention. arXiv preprint arXiv:2108.00154 (2021)

  42. Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision Workshop (2018)

    Google Scholar 

  43. Wang, Z., Cun, X., Bao, J., Liu, J.: UFormer: a general U-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106 (2021)

  44. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203 (2021)

  45. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. In: Proceedings of the Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  46. Zamir, S.W., et al.: Learning enriched features for real image restoration and enhancement. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 492–511. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_30

    Chapter  Google Scholar 

  47. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Boissonnat, J.-D., Chenin, P., Cohen, A., Gout, C., Lyche, T., Mazure, M.-L., Schumaker, L. (eds.) Curves and Surfaces 2010. LNCS, vol. 6920, pp. 711–730. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27413-8_47

    Chapter  Google Scholar 

  48. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European Conference on Computer Vision, pp. 286–301 (2018)

    Google Scholar 

  49. Zhang, Y., Li, K., Li, K., Zhong, B., Fu, Y.: Residual non-local attention networks for image restoration. In: Proceedings of the International Conference on Learning Representation (2019)

    Google Scholar 

  50. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2472–2481 (2018)

    Google Scholar 

  51. Zhao, D., Li, J., Li, H., Xu, L.: Hybrid local-global transformer for image dehazing. arXiv preprint arXiv:2109.07100 (2021)

  52. Zhao, H., Kong, X., He, J., Qiao, Yu., Dong, C.: Efficient image super-resolution using pixel attention. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12537, pp. 56–72. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-67070-2_3

    Chapter  Google Scholar 

  53. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgement

The research was partially supported by the Natural Science Foundation of China, No. 62106036, and the Fundamental Research Funds for the Central University of China, DUT21RC(3)026.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xu Jia .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1780 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhuge, Y., Jia, X. (2023). Multi-granularity Transformer for Image Super-Resolution. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13843. Springer, Cham. https://doi.org/10.1007/978-3-031-26313-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26313-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26312-5

  • Online ISBN: 978-3-031-26313-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics