Abstract
Vision transformers have demonstrated promising results and become core components in many tasks. Most existing works focus on context feature extraction and incorporate spatial information through additional positional embedding. However, they only consider the local positional information within each image token and cannot effectively model the global spatial relations of the underlying scene. To address this challenge, we propose an efficient vision transformer architecture, SpatialFormer, with explicit spatial understanding for generalizable image representation learning. Specifically, we accompany the image tokens with adaptive spatial tokens to represent the context and spatial information respectively. We initialize the spatial tokens with positional encoding to introduce general spatial priors and augment them with learnable embeddings to model adaptive spatial information. For better generalization, we employ a decoder-only overall architecture and propose a bilateral cross-attention block for efficient interactions between context and spatial tokens. SpatialFormer learns transferable image representations with explicit scene understanding, where the output spatial tokens can further serve as enhanced initial queries for task-specific decoders for better adaptations to downstream tasks. Extensive experiments on image classification, semantic segmentation, and 2D/3D object detection tasks demonstrate the efficiency and transferability of the proposed SpatialFormer architecture. Code is available at https://github.com/Euphoria16/SpatialFormer.
H. Xiao and W. Zheng—Equal contributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. NeurIPS 33, 12449–12460 (2020)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR, pp. 6154–6162 (2018)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)
Chen, C.F., Panda, R., Fan, Q.: RegionViT: regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. In: NeurIPS (2021)
Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. In: ICLR (2022)
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic DETR: end-to-end object detection with dynamic attention. In: ICCV, pp. 2988–2997 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652 (2021)
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: CVPR, pp. 12124–12134 (2022)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Graham, B., et al.: LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: ICCV, pp. 12259–12269 (2021)
Grainger, R., Paniagua, T., Song, X., Cuntoor, N., Lee, M.W., Wu, T.: PaCa-ViT: learning patch-to-cluster attention in vision transformers. In: CVPR, pp. 18568–18578 (2023)
Guo, J., et al.: CMT: convolutional neural networks meet vision transformers. In: CVPR, pp. 12175–12185 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: SelfOcc: self-supervised vision-based 3D occupancy prediction. In: CVPR (2024)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3D semantic occupancy prediction. arXiv preprint arXiv:2302.07817 (2023)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Gaussianformer: scene as Gaussians for vision-based 3D semantic occupancy prediction. In: ECCV (2024)
Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450 (2022)
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. arXiv preprint arXiv:2206.10092 (2022)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV (2022)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329 (2022)
Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. NeurIPS 34, 23818–23830 (2021)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. arXiv preprint arXiv:2203.05625 (2022)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convNet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Lu, J., et al.: SOFT: softmax-free transformer with linear complexity. In: NeurIPS (2021)
Ren, S., Zhou, D., He, S., Feng, J., Wang, X.: Shunted self-attention via multi-scale token aggregation. In: CVPR, pp. 10853–10862 (2022)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: ICCV (2021)
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024)
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS 33, 7537–7547 (2020)
Tong, W., et al.: Scene as occupancy. In: ICCV, pp. 8406–8415 (2023)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. In: ICML, pp. 10347–10357 (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV, pp. 32–42 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Wang, C., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: OPERA: omni-supervised representation learning with hierarchical supervisions. In: ICCV, pp. 5559–5570 (2023)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022). https://doi.org/10.1007/s41095-022-0274-8
Wang, W., et al.: CrossFormer: a versatile vision transformer hinging on cross-scale attention. In: ICLR (2023)
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.M.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL (2021)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR, pp. 8741–8750 (2021)
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3D occupancy prediction for autonomous driving. In: ICCV, pp. 21729–21740 (2023)
Wu, H., et al.: CvT: introducing convolutions to vision transformers. In: CVPR, pp. 22–31 (2021)
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR, pp. 4794–4803 (2022)
Xiao, H., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Token-label alignment for vision transformers. In: ICCV, pp. 5495–5504 (2023)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, pp. 418–434 (2018)
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
Yu, Q. et al.: K-means mask transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol 13689, pp. 288–307 (2022) Springer, Cham. https://doi.org/10.1007/978-3-031-19818-2_17
Zeng, S., Zheng, W., Lu, J., Yan, H.: Hardness-aware scene synthesis for semi-supervised 3D object detection. TMM (2024)
Zhang, Q., Zhang, J., Xu, Y., Tao, D.: Vision transformer with quadrangle attention. TPAMI (2024)
Zhang, Y., Zheng, W., Zhu, Z., Huang, G., Zhou, J., Lu, J.: A simple baseline for multi-camera 3D object detection. arXiv preprint arXiv:2208.10035 (2022)
Zhang, Y., et al.: BEVerse: unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743 (2022)
Zhao, L., et al.: LowRankOcc: tensor decomposition and low-rank recovery for vision-based 3D semantic occupancy prediction. In: CVPR. pp, 9806–9815 (2024)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021)
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: OccWorld: learning a 3D occupancy world model for autonomous driving. In: ECCV (2024)
Zheng, W., Lu, J., Jie, Z.: Structural deep metric learning for room layout estimation. In: ECCV (2020)
Zheng, W., Song, R., Guo, X., Chen, L.: GenAD: Generative end-to-end autonomous driving. In: ECCV (2024)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. IJCV 127, 302–321 (2019). https://doi.org/10.1007/s11263-018-1140-0
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Zuo, S., Zheng, W., Huang, Y., Zhou, J., Lu, J.: PointOcc: cylindrical tri-perspective view for point-based 3D semantic occupancy prediction. arXiv preprint arXiv:2308.16896 (2023)
Acknowledgement
This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFB280690, and in part by the National Natural Science Foundation of China under Grant 62321005, Grant 62336004, and Grant 62125603.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xiao, H., Zheng, W., Zuo, S., Gao, P., Zhou, J., Lu, J. (2025). SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-72624-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)