Abstract
We presented the Pyramid Swin Transformer, a versatile and efficient architecture tailored for object detection and image classification. This time we applied it to a wider range of tasks, such as object detection, image classification, semantic segmentation, and video recognition tasks. Our architecture adeptly captures local and global contextual information by employing more shift window operations and integrating diverse window sizes. The Pyramid Swin Transformer for Multi-task is structured in four stages, each consisting of layers with varying window sizes, facilitating a robust hierarchical representation. Different numbers of layers with distinct windows and window sizes are utilized at the same scale. Our architecture has been extensively evaluated on multiple benchmarks, including achieving 85.4% top-1 accuracy on ImageNet for image classification, 51.6 \(AP^{box}\) with Mask R-CNN and 54.3 \(AP^{box}\) with Cascade Mask R-CNN on COCO for object detection, 49.0 mIoU on ADE20K for semantic segmentation, and 83.4% top-1 accuracy on Kinetics-400 for video recognition. The Pyramid Swin Transformer for Multi-task outperforms state-of-the-art models in all tasks, demonstrating its effectiveness, adaptability, and scalability across various vision tasks. This breakthrough in multi-task learning architecture opens the door to new research and applications in the field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ali, A., et al.: XCiT: cross-covariance image transformers. Adv. Neural. Inf. Process. Syst. 34, 20014–20027 (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Contributors, M.: MMSegmentation: OpenMMlab semantic segmentation toolbox and benchmark (2020). https://github.com/open-mmlab/mmsegmentation
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, Y., et al.: MViTv 2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804–4814 (2022)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp. 10347–10357. PMLR (2021)
Wang, C., Endo, T., Hirofuchi, T., Ikegami, T.: Pyramid swin transformer: different-size windows swin transformer for image classification and object detection. In: Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 583–590 (2023). https://doi.org/10.5220/0011675800003417
Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008 (2021)
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vision 127, 302–321 (2019)
Acknowledgements
This work was partly supported by JSPS KAKENHI Grant Number 20H04165.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, C., Endo, T., Hirofuchi, T., Ikegami, T. (2023). Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks. In: Blanc-Talon, J., Delmas, P., Philips, W., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2023. Lecture Notes in Computer Science, vol 14124. Springer, Cham. https://doi.org/10.1007/978-3-031-45382-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-45382-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45381-6
Online ISBN: 978-3-031-45382-3
eBook Packages: Computer ScienceComputer Science (R0)