SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

Xiao, Han; Zheng, Wenzhao; Zuo, Sicheng; Gao, Peng; Zhou, Jie; Lu, Jiwen

doi:10.1007/978-3-031-72624-8_3

Han Xiao^13,14,
Wenzhao Zheng^13,15,
Sicheng Zuo¹³,
Peng Gao¹⁴,
Jie Zhou¹³ &
…
Jiwen Lu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

European Conference on Computer Vision

460 Accesses

Abstract

Vision transformers have demonstrated promising results and become core components in many tasks. Most existing works focus on context feature extraction and incorporate spatial information through additional positional embedding. However, they only consider the local positional information within each image token and cannot effectively model the global spatial relations of the underlying scene. To address this challenge, we propose an efficient vision transformer architecture, SpatialFormer, with explicit spatial understanding for generalizable image representation learning. Specifically, we accompany the image tokens with adaptive spatial tokens to represent the context and spatial information respectively. We initialize the spatial tokens with positional encoding to introduce general spatial priors and augment them with learnable embeddings to model adaptive spatial information. For better generalization, we employ a decoder-only overall architecture and propose a bilateral cross-attention block for efficient interactions between context and spatial tokens. SpatialFormer learns transferable image representations with explicit scene understanding, where the output spatial tokens can further serve as enhanced initial queries for task-specific decoders for better adaptations to downstream tasks. Extensive experiments on image classification, semantic segmentation, and 2D/3D object detection tasks demonstrate the efficiency and transferability of the proposed SpatialFormer architecture. Code is available at https://github.com/Euphoria16/SpatialFormer.

H. Xiao and W. Zheng—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DaViT: Dual Attention Vision Transformers

Efficient Vision Transformers with Partial Attention

Vision Transformers with Hierarchical Attention

Article Open access 19 April 2024

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. NeurIPS 33, 12449–12460 (2020)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR, pp. 6154–6162 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)
Google Scholar
Chen, C.F., Panda, R., Fan, Q.: RegionViT: regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)
Google Scholar
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Google Scholar
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. In: NeurIPS (2021)
Google Scholar
Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. In: ICLR (2022)
Google Scholar
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic DETR: end-to-end object detection with dynamic attention. In: ICCV, pp. 2988–2997 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652 (2021)
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: CVPR, pp. 12124–12134 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Graham, B., et al.: LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: ICCV, pp. 12259–12269 (2021)
Google Scholar
Grainger, R., Paniagua, T., Song, X., Cuntoor, N., Lee, M.W., Wu, T.: PaCa-ViT: learning patch-to-cluster attention in vision transformers. In: CVPR, pp. 18568–18578 (2023)
Google Scholar
Guo, J., et al.: CMT: convolutional neural networks meet vision transformers. In: CVPR, pp. 12175–12185 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: SelfOcc: self-supervised vision-based 3D occupancy prediction. In: CVPR (2024)
Google Scholar
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. arXiv preprint arXiv:2302.07817 (2023)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Gaussianformer: scene as Gaussians for vision-based 3D semantic occupancy prediction. In: ECCV (2024)
Google Scholar
Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450 (2022)
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. arXiv preprint arXiv:2206.10092 (2022)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329 (2022)
Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. NeurIPS 34, 23818–23830 (2021)
Google Scholar
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. arXiv preprint arXiv:2203.05625 (2022)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convNet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Lu, J., et al.: SOFT: softmax-free transformer with linear complexity. In: NeurIPS (2021)
Google Scholar
Ren, S., Zhou, D., He, S., Feng, J., Wang, X.: Shunted self-attention via multi-scale token aggregation. In: CVPR, pp. 10853–10862 (2022)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: ICCV (2021)
Google Scholar
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024)
Article Google Scholar
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS 33, 7537–7547 (2020)
Google Scholar
Tong, W., et al.: Scene as occupancy. In: ICCV, pp. 8406–8415 (2023)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. In: ICML, pp. 10347–10357 (2021)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV, pp. 32–42 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, C., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: OPERA: omni-supervised representation learning with hierarchical supervisions. In: ICCV, pp. 5559–5570 (2023)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)
Google Scholar
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022). https://doi.org/10.1007/s41095-022-0274-8
Article Google Scholar
Wang, W., et al.: CrossFormer: a versatile vision transformer hinging on cross-scale attention. In: ICLR (2023)
Google Scholar
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.M.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL (2021)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR, pp. 8741–8750 (2021)
Google Scholar
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3D occupancy prediction for autonomous driving. In: ICCV, pp. 21729–21740 (2023)
Google Scholar
Wu, H., et al.: CvT: introducing convolutions to vision transformers. In: CVPR, pp. 22–31 (2021)
Google Scholar
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR, pp. 4794–4803 (2022)
Google Scholar
Xiao, H., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Token-label alignment for vision transformers. In: ICCV, pp. 5495–5504 (2023)
Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, pp. 418–434 (2018)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
Yu, Q. et al.: K-means mask transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol 13689, pp. 288–307 (2022) Springer, Cham. https://doi.org/10.1007/978-3-031-19818-2_17
Zeng, S., Zheng, W., Lu, J., Yan, H.: Hardness-aware scene synthesis for semi-supervised 3D object detection. TMM (2024)
Google Scholar
Zhang, Q., Zhang, J., Xu, Y., Tao, D.: Vision transformer with quadrangle attention. TPAMI (2024)
Google Scholar
Zhang, Y., Zheng, W., Zhu, Z., Huang, G., Zhou, J., Lu, J.: A simple baseline for multi-camera 3D object detection. arXiv preprint arXiv:2208.10035 (2022)
Zhang, Y., et al.: BEVerse: unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743 (2022)
Zhao, L., et al.: LowRankOcc: tensor decomposition and low-rank recovery for vision-based 3D semantic occupancy prediction. In: CVPR. pp, 9806–9815 (2024)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021)
Google Scholar
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: OccWorld: learning a 3D occupancy world model for autonomous driving. In: ECCV (2024)
Google Scholar
Zheng, W., Lu, J., Jie, Z.: Structural deep metric learning for room layout estimation. In: ECCV (2020)
Google Scholar
Zheng, W., Song, R., Guo, X., Chen, L.: GenAD: Generative end-to-end autonomous driving. In: ECCV (2024)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. IJCV 127, 302–321 (2019). https://doi.org/10.1007/s11263-018-1140-0
Article Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Google Scholar
Zuo, S., Zheng, W., Huang, Y., Zhou, J., Lu, J.: PointOcc: cylindrical tri-perspective view for point-based 3D semantic occupancy prediction. arXiv preprint arXiv:2308.16896 (2023)

Download references

Acknowledgement

This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFB280690, and in part by the National Natural Science Foundation of China under Grant 62321005, Grant 62336004, and Grant 62125603.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Han Xiao, Wenzhao Zheng, Sicheng Zuo, Jie Zhou & Jiwen Lu
Shanghai AI Laboratory, Shanghai, China
Han Xiao & Peng Gao
UC Berkeley, Berkeley, China
Wenzhao Zheng

Authors

Han Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Sicheng Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Peng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiwen Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiwen Lu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 396 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, H., Zheng, W., Zuo, S., Gao, P., Zhou, J., Lu, J. (2025). SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72624-8_3
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding