Skip to main content

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We present an Encoder-Decoder Attention Transformer, ED-AFormer, which consists of the Embedding-Free Transformer (EFT) encoder and the all-attention decoder leveraging our Embedding-Free Attention (EFA) structure. The proposed EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity, not the specific roles of the query, key and value. For the decoder, we explore the optimized structure for considering the globality, which can improve the semantic segmentation performance. In addition, we propose a novel Inference Spatial Reduction (ISR) method for the computational efficiency. Different from the previous spatial reduction attention methods, our ISR method further reduces the key-value resolution at the inference phase, which can mitigate the computation-performance trade-off gap for the efficient semantic segmentation. Our EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models in three public benchmarks, including ADE20K, Cityscapes and COCO-Stuff. Furthermore, our ISR method reduces the computational cost by up to 61% with minimal mIoU performance degradation on Cityscapes dataset. The code is available at https://github.com/hyunwoo137/EDAFormer.

H. Yu, Y. Cho, B. Kang, S. Moon and K. Kong—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your vit but faster. arXiv preprint arXiv:2210.09461 (2022)

  2. Bousselham, W., et al.: Efficient self-ensemble for semantic segmentation. arXiv preprint arXiv:2111.13280 (2021)

  3. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)

    Google Scholar 

  4. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (2017)

    Google Scholar 

  5. Chen, X., Liu, Z., Tang, H., Yi, L., Zhao, H., Han, S.: Sparsevit: revisiting activation sparsity for efficient high-resolution vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2061–2070 (2023)

    Google Scholar 

  6. Chen, Y., et al.: Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5270–5279 (2022)

    Google Scholar 

  7. Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 589–598 (2021)

    Google Scholar 

  8. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

    Google Scholar 

  9. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation, pp. 17864–17875 (2021)

    Google Scholar 

  10. Cho, Y., Kang, S.: Class attention transfer for semantic segmentation. In: 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 41–45. IEEE (2022)

    Google Scholar 

  11. Cho, Y., Yu, H., Kang, S.J.: Cross-aware early fusion with stage-divided vision and language transformer encoders for referring image segmentation. IEEE Trans. Multimed. (2023)

    Google Scholar 

  12. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers, pp. 9355–9366 (2021)

    Google Scholar 

  13. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

    Google Scholar 

  14. Dai, Z., Liu, H., Le, Q.V., Tan, M.: Coatnet: marrying convolution and attention for all data sizes, pp. 3965–3977 (2021)

    Google Scholar 

  15. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  16. Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: Object-contextual representations for semantic segmentation. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part VI 16, pp. 173–190 (2020)

    Google Scholar 

  17. Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: DaViT: dual attention vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13684, pp. 74–92 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_5

  18. Dong, X., et al.: Cswin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134 (2022)

    Google Scholar 

  19. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  20. Gong, C., et al.: Nasvit: neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In: International Conference on Learning Representations (2021)

    Google Scholar 

  21. Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12259–12269 (2021)

    Google Scholar 

  22. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: CMT: convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12175–12185 (2022)

    Google Scholar 

  23. Guo, M.H., Lu, C.Z., Hou, Q., Liu, Z., Cheng, M.M., Hu, S.M.: Segnext: rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575 (2022)

  24. Han, K., Xiao, A., Wu, E., Guo, J., XU, C., Wang, Y.: Transformer in transformer, pp. 15908–15919 (2021)

    Google Scholar 

  25. Hatamizadeh, A., et al.: Fastervit: fast vision transformers with hierarchical attention. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=kB4yBiNmXX

  26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  27. Kang, B., Moon, S., Cho, Y., Yu, H., Kang, S.J.: Metaseg: metaformer-based global contexts-aware network for efficient semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 434–443 (2024)

    Google Scholar 

  28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012)

    Google Scholar 

  29. Li, L., et al.: Semantic hierarchy-aware segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2023)

    Google Scholar 

  30. Li, Y., et al.: Efficientformer: vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191 (2022)

  31. Lin, W., Wu, Z., Chen, J., Huang, J., Jin, L.: Scale-aware modulation meet transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6015–6026 (2023)

    Google Scholar 

  32. Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to MLPs. Adv. Neural Inf. Process. Syst. 34, 9204–9215 (2021)

    Google Scholar 

  33. Liu, H., Jiang, X., Li, X., Bao, Z., Jiang, D., Ren, B.: NomMer: nominate synergistic context in vision transformer for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12073–12082 (2022)

    Google Scholar 

  34. Liu, Z., et al.: Swin transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)

    Google Scholar 

  35. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  36. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)

    Google Scholar 

  37. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  38. Lu, C., de Geus, D., Dubbelman, G.: Content-aware token sharing for efficient semantic segmentation with vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23631–23640 (2023)

    Google Scholar 

  39. Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)

  40. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)

    Google Scholar 

  41. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification, pp. 13937–13949 (2021)

    Google Scholar 

  42. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  43. Shim, J.h., Yu, H., Kong, K., Kang, S.J.: Feedformer: revisiting transformer decoder for efficient semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2263–2271 (2023)

    Google Scholar 

  44. Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: International Conference on Machine Learning, pp. 10096–10106. PMLR (2021)

    Google Scholar 

  45. Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 777–786 (2023)

    Google Scholar 

  46. Touvron, H., et al.: ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  47. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  48. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  49. Wang, W., et al.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV, pp. 7303–7313 (2021)

    Google Scholar 

  50. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

    Google Scholar 

  51. Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)

    Article  Google Scholar 

  52. Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)

  53. Wu, H., et al.: CVT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)

    Google Scholar 

  54. Wu, Y.H., Liu, Y., Zhan, X., Cheng, M.M.: P2T: pyramid pooling transformer for scene understanding (2022)

    Google Scholar 

  55. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)

    Google Scholar 

  56. Yan, H., Wu, M., Zhang, C.: Multi-scale representations by varing window attention for semantic segmentation. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=lAhWGOkpSR

  57. Yang, C., et al.: Lite vision transformer with enhanced self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11998–12008 (2022)

    Google Scholar 

  58. Yu, H., Shim, J.H., Kwak, J., Song, J.W., Kang, S.J.: Vision transformer-based retina vessel segmentation with deep adaptive gamma correction. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1456–1460. IEEE (2022)

    Google Scholar 

  59. Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)

    Google Scholar 

  60. Yuan, L., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)

    Google Scholar 

  61. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)

    Google Scholar 

  62. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)

    Google Scholar 

  63. Zhang, Q., Yang, Y.B.: Rest: an efficient transformer for visual recognition, pp. 15475–15485 (2021)

    Google Scholar 

  64. Zhang, Q., Yang, Y.B.: Rest v2: simpler, faster and stronger, pp. 36440–36452 (2022)

    Google Scholar 

  65. Zhang, W., et al.: TopFormer: token pyramid transformer for mobile semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12083–12093 (2022)

    Google Scholar 

  66. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

    Google Scholar 

  67. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)

    Google Scholar 

  68. Zhou, T., et al.: Rethinking semantic segmentation: a prototype view. In: CVPR, pp. 2582–2593 (2022)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd (IO201218-08232-01) and the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. RS-2024-00414230) and MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2023-00260091) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and National Supercomputing Center with supercomputing resources including technical support (KSC-2023-CRE-0444).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suk-Ju Kang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5242 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, H., Cho, Y., Kang, B., Moon, S., Kong, K., Kang, SJ. (2025). Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15100. Springer, Cham. https://doi.org/10.1007/978-3-031-72946-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72946-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72945-4

  • Online ISBN: 978-3-031-72946-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics