Multiscale deep feature selection fusion network for referring image segmentation

Dai, Xianwen; Lin, Jiacheng; Nai, Ke; Li, Qingpeng; Li, Zhiyong

doi:10.1007/s11042-023-16913-6

Multiscale deep feature selection fusion network for referring image segmentation

Published: 27 September 2023

Volume 83, pages 36287–36305, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xianwen Dai¹,
Jiacheng Lin¹,
Ke Nai¹,
Qingpeng Li² &
…
Zhiyong Li ORCID: orcid.org/0000-0001-9720-5915¹

100 Accesses
Explore all metrics

Abstract

Referring image segmentation has attracted extensive attention in recent years. Previous methods have explored the difficult alignment between visual and textual features, but this problem has not been effectively addressed. This leads to the problem of insufficient interaction between visual features and textual features, which affects model performance. To this end, we propose a language-aware pixel feature fusion module (LPFFM) based on self-attention mechanism to ensure that the features of the two modalities have sufficient interaction in the space and channels. Then we apply it in the shallow to deep layers of the encoder to gradually select visual features related to the text. Secondly, we propose a second selection mechanism to further select visual features that only contain the target. For this mechanism, we design an attention contrastive loss to better suppress irrelevant background information. Further, we propose a multi-scale deep features selection fusion network (MDSFNet) based on the U-net architecture. Finally, the experimental results show that our proposed method is competitive with previous methods, improving the performance by 2.87%, 3.17%, and 3.81% on three benchmark datasets, RefCOCO, RefCOCO+, and G-ref, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global Selection and Local Attention Network for Referring Image Segmentation

Enhancing medical text detection with vision-language pre-training and efficient segmentation

Article Open access 29 February 2024

Enhancing Scene Text Detection via Fused Semantic Segmentation Network with Attention

Data Availibility

The datasets generated during and/or analysed during the current study are available in the coco2014 repository, https://cocodataset.org/#download.

References

Lin J, Li Y, Yang G (2021) Fpgan: Face de-identification method with generative adversarial networks for social robots. Neural Netw 133:132–147
Article Google Scholar
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Zhang H, Wu C, Zhang Z, Zhu Y, Zhang Z, Lin H, Sun Y, He T, Mueller J, Manmatha R, Li M, Smola AJ (2020) Resnest: Split-attention networks. CoRR arXiv:2004.08955
Vo DM, Lee S-W (2018) Semantic image segmentation using fully convolutional neural networks with multi-scale images and multi-scale dilated convolutions. Multimed Tools Appl 77:18689–18707
Article Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Article Google Scholar
Ding H, Jiang X, Shuai B, Liu AQ, Wang G (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Yu J, Yao J, Zhang J, Yu Z, Tao D (2021) Sprnet: Single-pixel reconstruction for one-stage instance segmentation. IEEE Trans Cybernet 51(4):1731–1742. https://doi.org/10.1109/TCYB.2020.2969046
Article Google Scholar
Yin C, Tang J, Yuan T, Xu Z, Wang Y (2021) Bridging the gap between semantic segmentation and instance segmentation. IEEE Trans Multimed 1–1. https://doi.org/10.1109/TMM.2021.3114541
Fu K, Zhao Q (2019) Gu IY-H: Refinet: A deep segmentation assisted refinement network for salient object detection. IEEE Trans Multimed 21(2):457–469. https://doi.org/10.1109/TMM.2018.2859746
Article Google Scholar
Moradi M, Bayat F (2021) A salient object segmentation framework using diffusion-based affinity learning. Expert Syst Appl 168:114428. https://doi.org/10.1016/j.eswa.2020.114428
Article Google Scholar
Margffoy-Tuay E, Pérez J.C, Botero E, Arbeláez P (2018) Dynamic multimodal instance segmentation guided by natural language queries. In: Proceedings of the European conference on computer vision (ECCV). pp 630–645
Shi H, Li H, Meng F, Wu Q (2018) Key-word-aware network for referring expression image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp 38–54
Chen J, Lin J, Xiao Z, Fu H, Nai K, Yang K, Li Z (2023) EPCFormer: expression prompt collaboration transformer for universal referring video object segmentation
Hu R, Rohrbach M, Darrell T (2016) Segmentation from natural language expressions. In: European conference on computer vision (ECCV). Springer, pp 108–124
Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) Referring image segmentation via recurrent refinement networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Liu C, Lin Z, Shen X, Yang J, Lu X, Yuille A (2017) Recurrent multimodal interaction for referring image segmentation. In: Proceedings of the IEEE international conference on computer vision (ICCV)
Lin J, Dai X, Nai K, Yuan J, Li Z, Zhang X, Li S (2023) Brppnet: Balanced privacy protection network for referring personal image privacy protection. Expert Syst Appl 233:120960
Article Google Scholar
Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) See-through-text grouping for referring image segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Feng G, Hu Z, Zhang L, Sun J, Lu H (2021) Bidirectional relationship inferring network for referring image localization and segmentation. IEEE Trans Neural Netw Learn Sys 1–13. https://doi.org/10.1109/TNNLS.2021.3106153
Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) Referring image segmentation via cross-modal progressive comprehension. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) Linguistic structure guided context modeling for referring image segmentation. In: European conference on computer vision (ECCV). Springer, pp 59–75
Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Krähenbühl P, Koltun V (2011) Efficient inference in fully connected crfs with gaussian edge potentials. Adv Neural Inf Process 24:109–117
Google Scholar
Chen L, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. CoRR arXiv:1706.05587
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
Ye L, Liu Z, Wang Y (2020) Dual convolutional lstm network for referring image segmentation. IEEE Trans Multimed 22(12):3224–3235. https://doi.org/10.1109/TMM.2020.2971171
Article Google Scholar
Luo G, Zhou Y, Sun X, Cao L, Wu C, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Feng G, Hu Z, Zhang L, Lu H (2021) Encoder fusion network with co-attention embedding for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 15506–15515
Ding H, Liu C, Wang S , Jiang X (2021)Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp 16321–16330
Liu C, Jiang X, Ding H (2022) Instance-specific feature propagation for referring segmentation. IEEE Trans Multimed 1–1. https://doi.org/10.1109/TMM.2022.3163578
Li Q, Zhang Y, Sun S, Wu J, Zhao X, Tan M (2022) Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 467:99–114. https://doi.org/10.1016/j.neucom.2021.09.066
Article Google Scholar
Kim N, Kim D, Lan C, Zeng W, Kwak S (2022) Restr: Convolution-free referring image segmentation using transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 18145–18154
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. CoRR arXiv:1804.02767
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543
Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555
Jing Y, Kong T, Wang W, Wang L, Li L, Tan T (2021) Locate then segment: A strong pipeline for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 9858–9867
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: European conference on computer vision (ECCV). Springer, pp 69–85
Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Nagaraja VK, Morariu VI, Davis LS (2016) Modeling context between objects for referring expression understanding. In: European conference on computer vision (ECCV). Springer, pp 792–807
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755
Kazemzadeh S, Ordonez V, Matten M, Berg T (2014) Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 787–798
Yang S, Xia M, Li G, Zhou H-Y, Yu Y (2021) Bottom-up shift and reasoning for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 11266–11275

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No.U21A20518, No.61976086, No.62106071).

Author information

Authors and Affiliations

College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
Xianwen Dai, Jiacheng Lin, Ke Nai & Zhiyong Li
School of Robotics, Hunan University, Changsha, Hunan, China
Qingpeng Li

Authors

Xianwen Dai
View author publications
You can also search for this author in PubMed Google Scholar
Jiacheng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ke Nai
View author publications
You can also search for this author in PubMed Google Scholar
Qingpeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Qingpeng Li or Zhiyong Li.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest, and they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dai, X., Lin, J., Nai, K. et al. Multiscale deep feature selection fusion network for referring image segmentation. Multimed Tools Appl 83, 36287–36305 (2024). https://doi.org/10.1007/s11042-023-16913-6

Download citation

Received: 29 April 2023
Revised: 29 August 2023
Accepted: 05 September 2023
Published: 27 September 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11042-023-16913-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiscale deep feature selection fusion network for referring image segmentation

Abstract

Access this article

Similar content being viewed by others

Global Selection and Local Attention Network for Referring Image Segmentation

Enhancing medical text detection with vision-language pre-training and efficient segmentation

Enhancing Scene Text Detection via Fused Semantic Segmentation Network with Attention

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multiscale deep feature selection fusion network for referring image segmentation

Abstract

Access this article

Similar content being viewed by others

Global Selection and Local Attention Network for Referring Image Segmentation

Enhancing medical text detection with vision-language pre-training and efficient segmentation

Enhancing Scene Text Detection via Fused Semantic Segmentation Network with Attention

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation