Abstract
Few-shot semantic segmentation (FSS) methods based on meta-learning strategies have shown promise in extracting instance knowledge from support set to infer pixel-wise labels in query set. However, a key challenge in FSS is addressing spatial inconsistency between query image and support image due to intra-class difference and inter-class similarity. Moreover, existing FSS methods often rely on multiple decoding methods for differentiated pixel-wise matching, leading to semantic inconsistency. To tackle these issues, we propose a similarity aggregation network (SANet), which effectively explores visual correspondence between support and query features while aligning semantic dimensions. Specifically, SANet introduces a mask attention module (MAM) to capture spatial relations between non-local attention features from support features and query features. Additionally, a similarity aggregation module (SAM) is proposed, which utilizes the multi-head attention mechanism and combines prior mask to calculate the aggregation similarity between each query pixel and all supporting pixels, thereby focusing the network on foreground areas. Finally, a feature fusion module (FFM) is used to adaptively fuse features at multiple scales and channels for accurate prediction. Extensive experiments on PASCAL-5i and COCO-20i demonstrate the efficiency and competitiveness of SANet.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability and Access
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5-9 October, 2015, Proceedings, Part III 18, pp 234–241 . Springer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 779–788
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Shaban A, Bansal S, Liu Z, Essa I, Boots B (2017) One-shot learning for semantic segmentation. arXiv:1709.03410
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. Adv Neural Inf Process Syst 30
Dong N, Xing EP (2018) Few-shot semantic segmentation with prototype learning. In: BMVC, vol 3, p 4
Lu Z, He S, Zhu X, Zhang L, Song Y-Z, Xiang T (2021) Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8741–8750
Wang K, Liew JH, Zou Y, Zhou D, Feng J (2019) Panet: Few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9197–9206
Zhang X, Wei Y, Yang Y, Huang TS (2020) Sg-one: Similarity guidance network for one-shot semantic segmentation. IEEE Trans Cybern 50(9):3855–3865
Tian Z, Zhao H, Shu M, Yang Z, Li R, Jia J (2020) Prior guided feature enrichment network for few-shot segmentation. IEEE Trans Patt Anal Mach Intell 44(2):1050–1065
Wang H, Zhang X, Hu Y, Yang Y, Cao X, Zhen X (2020) Few-shot semantic segmentation with democratic attention networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August, 2020, Proceedings, Part XIII 16, pp 730–746 . Springer
Zhang G, Kang G, Yang Y, Wei Y (2021) Few-shot segmentation via cycle-consistent transformer. Adv Neural Inf Process Syst 34:21984–21996
Shi X, Wei D, Zhang Y, Lu D, Ning M, Chen J, Ma K, Zheng Y (2022) Dense cross-query-and-support attention weighted mask aggregation for few-shot segmentation. In: European Conference on Computer Vision, pp 151–168 . Springer
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6-12 September, 2014, Proceedings, Part V 13, pp 740–755 . Springer
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2881–2890
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Yang Y, Chen Q, Feng Y, Huang T (2023) Mianet: aggregating unbiased instance and general information for few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7131–7140
Zhang C, Lin G, Liu F, Yao R, Shen C (2019) Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5217–5226
Lang C, Cheng G, Tu B, Han J (2022) Learning what not to segment: A new perspective on few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8057–8067
Liu W, Zhang C, Lin G, Liu F (2020) Crnet: Cross-reference networks for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4165–4173
Min J, Kang D, Cho M (2021) Hypercorrelation squeeze for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6941–6952
Sun J, Shen Z, Wang Y, Bao H, Zhou X (2021) Loftr: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8922–8931
Cao L, Guo Y, Yuan Y, Jin Q (2022) Prototype as query for few shot semantic segmentation. arXiv:2211.14764
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7794–7803
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154
Zhuge Y, Shen C (2021) Deep reasoning network for few-shot semantic segmentation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 5344–5352
Wang J, Chen Y, Dong Z, Gao M (2023) Improved yolov5 network for real-time multi-scale traffic sign detection. Neural Comput Appl 35(10):7853–7865
Iqbal E, Safarov S, Bang S (2022) Msanet: Multi-similarity and attention guidance for boosting few-shot segmentation. arXiv:2206.09667
Yang B, Liu C, Li B, Jiao J, Ye Q (2020) Prototype mixture models for few-shot semantic segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August, 2020, Proceedings, Part VIII 16, pp 763–778 . Springer
Zhang G, Navasardyan S, Chen L, Zhao Y, Wei Y, Shi H et al (2022) Mask matching transformer for few-shot segmentation. Adv Neural Inf Process Syst 35:823–836
Xu W, Huang H, Cheng M, Yu L, Wu Q, Zhang J (2023) Masked cross-image encoding for few-shot segmentation. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp 744–749. IEEE
Liu J, Bao Y, Xie G-S, Xiong H, Sonke J-J, Gavves E (2022) Dynamic prototype convolution network for few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11553–11562
Peng B, Tian Z, Wu X, Wang C, Liu S, Su J, Jia J (2023) Hierarchical dense correlation distillation for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 23641–23651
Liu H, Peng P, Chen T, Wang Q, Yao Y, Hua X-S (2023) Fecanet: Boosting few-shot semantic segmentation with feature-enhanced context-aware network. IEEE Trans Multimed
Cheng G, Lang C, Han J (2022) Holistic prototype activation for few-shot segmentation. IEEE Trans Pattern Anal Mach Intell 45(4):4650–4666
Xu Q, Zhao W, Lin G, Long C (2023) Self-calibrated cross attention network for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 655–665
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88:303–338
Hariharan B, Arbeláez P, Girshick R, Malik J (2014) Simultaneous detection and segmentation. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6-12 September, 2014, Proceedings, Part VII 13, pp 297–312 . Springer
Author information
Authors and Affiliations
Contributions
Minrui Ye: Conceptualization, Methodology, Software, Data curation, Writing - Original draft preparation. Tao Zhang: Supervision, Validation, Writing, Project administration.
Corresponding author
Ethics declarations
Conflict of Interest/Competing Interests
The authors declare that they have no conflicts of interest or competing interests relevant to the content of this manuscript.
Ethics Approval and Consent to Participate
Not applicable.
Consent for Publication
Consent for publication was obtained from all individuals included in this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ye, M., Zhang, T. SANet: similarity aggregation and semantic fusion for few-shot semantic segmentation. Appl Intell 55, 119 (2025). https://doi.org/10.1007/s10489-024-05986-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05986-x