Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks

Wang, Chenyu; Endo, Toshio; Hirofuchi, Takahiro; Ikegami, Tsutomu

doi:10.1007/978-3-031-45382-3_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14124))

Included in the following conference series:

International Conference on Advanced Concepts for Intelligent Vision Systems

202 Accesses

Abstract

We presented the Pyramid Swin Transformer, a versatile and efficient architecture tailored for object detection and image classification. This time we applied it to a wider range of tasks, such as object detection, image classification, semantic segmentation, and video recognition tasks. Our architecture adeptly captures local and global contextual information by employing more shift window operations and integrating diverse window sizes. The Pyramid Swin Transformer for Multi-task is structured in four stages, each consisting of layers with varying window sizes, facilitating a robust hierarchical representation. Different numbers of layers with distinct windows and window sizes are utilized at the same scale. Our architecture has been extensively evaluated on multiple benchmarks, including achieving 85.4% top-1 accuracy on ImageNet for image classification, 51.6 \(AP^{box}\) with Mask R-CNN and 54.3 \(AP^{box}\) with Cascade Mask R-CNN on COCO for object detection, 49.0 mIoU on ADE20K for semantic segmentation, and 83.4% top-1 accuracy on Kinetics-400 for video recognition. The Pyramid Swin Transformer for Multi-task outperforms state-of-the-art models in all tasks, demonstrating its effectiveness, adaptability, and scalability across various vision tasks. This breakthrough in multi-task learning architecture opens the door to new research and applications in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, A., et al.: XCiT: cross-covariance image transformers. Adv. Neural. Inf. Process. Syst. 34, 20014–20027 (2021)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Contributors, M.: MMSegmentation: OpenMMlab semantic segmentation toolbox and benchmark (2020). https://github.com/open-mmlab/mmsegmentation
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, Y., et al.: MViTv 2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804–4814 (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Article MathSciNet Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Wang, C., Endo, T., Hirofuchi, T., Ikegami, T.: Pyramid swin transformer: different-size windows swin transformer for image classification and object detection. In: Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 583–590 (2023). https://doi.org/10.5220/0011675800003417
Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Article Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Google Scholar
Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008 (2021)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vision 127, 302–321 (2019)
Article Google Scholar

Download references

Acknowledgements

This work was partly supported by JSPS KAKENHI Grant Number 20H04165.

Author information

Authors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Chenyu Wang & Toshio Endo
National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
Chenyu Wang, Takahiro Hirofuchi & Tsutomu Ikegami

Authors

Chenyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Toshio Endo
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Hirofuchi
View author publications
You can also search for this author in PubMed Google Scholar
Tsutomu Ikegami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenyu Wang .

Editor information

Editors and Affiliations

DGA TA, Toulouse, France
Jaques Blanc-Talon
University of Auckland, Auckland, New Zealand
Patrice Delmas
Ghent University, Ghent, Belgium
Wilfried Philips
University of Antwerp, Wilrijk, Belgium
Paul Scheunders

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Endo, T., Hirofuchi, T., Ikegami, T. (2023). Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks. In: Blanc-Talon, J., Delmas, P., Philips, W., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2023. Lecture Notes in Computer Science, vol 14124. Springer, Cham. https://doi.org/10.1007/978-3-031-45382-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-45382-3_5
Published: 14 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45381-6
Online ISBN: 978-3-031-45382-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks