Skip to main content

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks

  • Conference paper
  • First Online:
Advanced Concepts for Intelligent Vision Systems (ACIVS 2023)

Abstract

We presented the Pyramid Swin Transformer, a versatile and efficient architecture tailored for object detection and image classification. This time we applied it to a wider range of tasks, such as object detection, image classification, semantic segmentation, and video recognition tasks. Our architecture adeptly captures local and global contextual information by employing more shift window operations and integrating diverse window sizes. The Pyramid Swin Transformer for Multi-task is structured in four stages, each consisting of layers with varying window sizes, facilitating a robust hierarchical representation. Different numbers of layers with distinct windows and window sizes are utilized at the same scale. Our architecture has been extensively evaluated on multiple benchmarks, including achieving 85.4% top-1 accuracy on ImageNet for image classification, 51.6 \(AP^{box}\) with Mask R-CNN and 54.3 \(AP^{box}\) with Cascade Mask R-CNN on COCO for object detection, 49.0 mIoU on ADE20K for semantic segmentation, and 83.4% top-1 accuracy on Kinetics-400 for video recognition. The Pyramid Swin Transformer for Multi-task outperforms state-of-the-art models in all tasks, demonstrating its effectiveness, adaptability, and scalability across various vision tasks. This breakthrough in multi-task learning architecture opens the door to new research and applications in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ali, A., et al.: XCiT: cross-covariance image transformers. Adv. Neural. Inf. Process. Syst. 34, 20014–20027 (2021)

    Google Scholar 

  2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)

    Google Scholar 

  3. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)

    Google Scholar 

  4. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Contributors, M.: MMSegmentation: OpenMMlab semantic segmentation toolbox and benchmark (2020). https://github.com/open-mmlab/mmsegmentation

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  11. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Li, Y., et al.: MViTv 2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804–4814 (2022)

    Google Scholar 

  14. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  15. Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)

    Google Scholar 

  16. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  17. Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)

    Google Scholar 

  18. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)

    Google Scholar 

  19. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  20. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  21. Wang, C., Endo, T., Hirofuchi, T., Ikegami, T.: Pyramid swin transformer: different-size windows swin transformer for image classification and object detection. In: Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 583–590 (2023). https://doi.org/10.5220/0011675800003417

  22. Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

    Google Scholar 

  23. Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)

    Article  Google Scholar 

  24. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)

    Google Scholar 

  25. Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008 (2021)

    Google Scholar 

  26. Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vision 127, 302–321 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partly supported by JSPS KAKENHI Grant Number 20H04165.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenyu Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, C., Endo, T., Hirofuchi, T., Ikegami, T. (2023). Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks. In: Blanc-Talon, J., Delmas, P., Philips, W., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2023. Lecture Notes in Computer Science, vol 14124. Springer, Cham. https://doi.org/10.1007/978-3-031-45382-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45382-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45381-6

  • Online ISBN: 978-3-031-45382-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics