Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Yu, Hyunwoo; Cho, Yubin; Kang, Beoungwoo; Moon, Seunghun; Kong, Kyeongbo; Kang, Suk-Ju

doi:10.1007/978-3-031-72946-1_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15100))

Included in the following conference series:

European Conference on Computer Vision

326 Accesses

Abstract

We present an Encoder-Decoder Attention Transformer, ED-AFormer, which consists of the Embedding-Free Transformer (EFT) encoder and the all-attention decoder leveraging our Embedding-Free Attention (EFA) structure. The proposed EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity, not the specific roles of the query, key and value. For the decoder, we explore the optimized structure for considering the globality, which can improve the semantic segmentation performance. In addition, we propose a novel Inference Spatial Reduction (ISR) method for the computational efficiency. Different from the previous spatial reduction attention methods, our ISR method further reduces the key-value resolution at the inference phase, which can mitigate the computation-performance trade-off gap for the efficient semantic segmentation. Our EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models in three public benchmarks, including ADE20K, Cityscapes and COCO-Stuff. Furthermore, our ISR method reduces the computational cost by up to 61% with minimal mIoU performance degradation on Cityscapes dataset. The code is available at https://github.com/hyunwoo137/EDAFormer.

H. Yu, Y. Cho, B. Kang, S. Moon and K. Kong—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining

OTSeg: Multi-Prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

References

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your vit but faster. arXiv preprint arXiv:2210.09461 (2022)
Bousselham, W., et al.: Efficient self-ensemble for semantic segmentation. arXiv preprint arXiv:2111.13280 (2021)
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)
Google Scholar
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (2017)
Google Scholar
Chen, X., Liu, Z., Tang, H., Yi, L., Zhao, H., Han, S.: Sparsevit: revisiting activation sparsity for efficient high-resolution vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2061–2070 (2023)
Google Scholar
Chen, Y., et al.: Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5270–5279 (2022)
Google Scholar
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 589–598 (2021)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)
Google Scholar
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation, pp. 17864–17875 (2021)
Google Scholar
Cho, Y., Kang, S.: Class attention transfer for semantic segmentation. In: 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 41–45. IEEE (2022)
Google Scholar
Cho, Y., Yu, H., Kang, S.J.: Cross-aware early fusion with stage-divided vision and language transformer encoders for referring image segmentation. IEEE Trans. Multimed. (2023)
Google Scholar
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers, pp. 9355–9366 (2021)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Google Scholar
Dai, Z., Liu, H., Le, Q.V., Tan, M.: Coatnet: marrying convolution and attention for all data sizes, pp. 3965–3977 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: Object-contextual representations for semantic segmentation. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part VI 16, pp. 173–190 (2020)
Google Scholar
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: DaViT: dual attention vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13684, pp. 74–92 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_5
Dong, X., et al.: Cswin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gong, C., et al.: Nasvit: neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In: International Conference on Learning Representations (2021)
Google Scholar
Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12259–12269 (2021)
Google Scholar
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: CMT: convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12175–12185 (2022)
Google Scholar
Guo, M.H., Lu, C.Z., Hou, Q., Liu, Z., Cheng, M.M., Hu, S.M.: Segnext: rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575 (2022)
Han, K., Xiao, A., Wu, E., Guo, J., XU, C., Wang, Y.: Transformer in transformer, pp. 15908–15919 (2021)
Google Scholar
Hatamizadeh, A., et al.: Fastervit: fast vision transformers with hierarchical attention. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=kB4yBiNmXX
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kang, B., Moon, S., Cho, Y., Yu, H., Kang, S.J.: Metaseg: metaformer-based global contexts-aware network for efficient semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 434–443 (2024)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012)
Google Scholar
Li, L., et al.: Semantic hierarchy-aware segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Google Scholar
Li, Y., et al.: Efficientformer: vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191 (2022)
Lin, W., Wu, Z., Chen, J., Huang, J., Jin, L.: Scale-aware modulation meet transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6015–6026 (2023)
Google Scholar
Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to MLPs. Adv. Neural Inf. Process. Syst. 34, 9204–9215 (2021)
Google Scholar
Liu, H., Jiang, X., Li, X., Bao, Z., Jiang, D., Ren, B.: NomMer: nominate synergistic context in vision transformer for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12073–12082 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Lu, C., de Geus, D., Dubbelman, G.: Content-aware token sharing for efficient semantic segmentation with vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23631–23640 (2023)
Google Scholar
Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification, pp. 13937–13949 (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shim, J.h., Yu, H., Kong, K., Kang, S.J.: Feedformer: revisiting transformer decoder for efficient semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2263–2271 (2023)
Google Scholar
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: International Conference on Machine Learning, pp. 10096–10106. PMLR (2021)
Google Scholar
Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 777–786 (2023)
Google Scholar
Touvron, H., et al.: ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, W., et al.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV, pp. 7303–7313 (2021)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Article Google Scholar
Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
Wu, H., et al.: CVT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Google Scholar
Wu, Y.H., Liu, Y., Zhan, X., Cheng, M.M.: P2T: pyramid pooling transformer for scene understanding (2022)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)
Google Scholar
Yan, H., Wu, M., Zhang, C.: Multi-scale representations by varing window attention for semantic segmentation. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=lAhWGOkpSR
Yang, C., et al.: Lite vision transformer with enhanced self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11998–12008 (2022)
Google Scholar
Yu, H., Shim, J.H., Kwak, J., Song, J.W., Kang, S.J.: Vision transformer-based retina vessel segmentation with deep adaptive gamma correction. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1456–1460. IEEE (2022)
Google Scholar
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Google Scholar
Yuan, L., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Google Scholar
Zhang, Q., Yang, Y.B.: Rest: an efficient transformer for visual recognition, pp. 15475–15485 (2021)
Google Scholar
Zhang, Q., Yang, Y.B.: Rest v2: simpler, faster and stronger, pp. 36440–36452 (2022)
Google Scholar
Zhang, W., et al.: TopFormer: token pyramid transformer for mobile semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12083–12093 (2022)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar
Zhou, T., et al.: Rethinking semantic segmentation: a prototype view. In: CVPR, pp. 2582–2593 (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd (IO201218-08232-01) and the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. RS-2024-00414230) and MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2023-00260091) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and National Supercomputing Center with supercomputing resources including technical support (KSC-2023-CRE-0444).

Author information

Authors and Affiliations

Department of Electronics Engineering, Sogang University, Seoul, South Korea
Hyunwoo Yu, Yubin Cho, Beoungwoo Kang, Seunghun Moon & Suk-Ju Kang
AI Lab, CTO Division, LG Electronics, Seoul, South Korea
Yubin Cho
Department of Electrical and Electronics Engineering, Pusan National University, Busan, South Korea
Kyeongbo Kong

Authors

Hyunwoo Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yubin Cho
View author publications
You can also search for this author in PubMed Google Scholar
Beoungwoo Kang
View author publications
You can also search for this author in PubMed Google Scholar
Seunghun Moon
View author publications
You can also search for this author in PubMed Google Scholar
Kyeongbo Kong
View author publications
You can also search for this author in PubMed Google Scholar
Suk-Ju Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suk-Ju Kang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5242 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, H., Cho, Y., Kang, B., Moon, S., Kong, K., Kang, SJ. (2025). Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15100. Springer, Cham. https://doi.org/10.1007/978-3-031-72946-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72946-1_6
Published: 02 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72945-4
Online ISBN: 978-3-031-72946-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation