Vision-Aware Language Reasoning for Referring Image Segmentation

Xu, Fayou; Luo, Bing; Zhang, Chao; Xu, Li; Pu, Mingxing; Li, Bo

doi:10.1007/s11063-023-11377-z

Vision-Aware Language Reasoning for Referring Image Segmentation

Published: 02 August 2023

Volume 55, pages 11313–11331, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Fayou Xu¹,
Bing Luo¹,
Chao Zhang²,
Li Xu³,
Mingxing Pu¹ &
…
Bo Li¹

138 Accesses
Explore all metrics

Abstract

Referring image segmentation is a multimodal joint task that aims to segment linguistically indicated objects from images in paired expressions and images. However, the diversity of language annotations trends to result in semantic ambiguity, which makes the semantic representation of language feature encoding imprecise. Existing methods ignore the correction of language encoding module, so that the semantic error of language features cannot be improved in the subsequent process, resulting in semantic deviation. To this end, we propose a vision-aware language reasoning model. Intuitively, the segmentation result can be used to guide the reconstruction of language features, which could be expressed as a tree-structured recursive process. Specifically, we designed a language reasoning encoding module and a mask loopback optimization module to optimize the language encoding tree. The feature weights of tree nodes are learned through backpropagation. In order to overcome the problem that local language words and visual regions are easily introduced into noise regions in the traditional attention module, we use the global language prior information to calculate the importance of different words to further weight the visual region features, which could be embodied as language-aware vision attention module. Our experimental results on four benchmark datasets show that the proposed method achieves performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Hu R, Rohrbach M, Darrell T (2016) Segmentation from natural language expressions. arXiv:1603.06180
Chen J, Shen Y, Gao J, Liu J, Liu X (2018) Language-based image editing with recurrent attentive models. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8721–8729
Linder J, Laput G, Dontcheva M, Wilensky G, Chang W, Agarwala A, Adar E (2013) Pixeltone: a multimodal interface for image editing. In: CHI’13 extended abstracts on human factors in computing systems
Wang XE, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang Y-F, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6622–6631
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2018) Semantic understanding of scenes through the ade20k dataset. Int J Comput Vis 127:302–321
Article Google Scholar
Wu T, Huang J, Gao G, Wei X, Wei X, Luo X, Liu CH (2021) Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16760–16769
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Yang Z, Wang J, Tang Y, Chen K, Zhao H, Torr PHS (2021) Lavt: language-aware vision transformer for referring image segmentation. arXiv:2112.02244
Kim NH, Kim D, Lan C, Zeng W, Kwak S (2022) Restr: convolution-free referring image segmentation using transformers. arXiv:2203.16768
Li Z, Wang M, Mei J, Liu Y (2021) Mail: a unified mask-image-language trimodal network for referring image segmentation. arXiv:2111.10747
Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) Referring image segmentation via recurrent refinement networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5745–5753
Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) Recurrent multimodal interaction for referring image segmentation. In: 2017 IEEE international conference on computer vision (ICCV), pp 1280–1289
Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) See-through-text grouping for referring image segmentation. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 7453–7462
Hu Z, Feng G, Sun J, Zhang L, Lu H (2020) Bi-directional relationship inferring network for referring image segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4423–4432
Shi H, Li H, Meng F, Wu Q (2018) Key-word-aware network for referring expression image segmentation. In: ECCV
Feng G, Hu Z, Zhang L, Lu H (2021) Encoder fusion network with co-attention embedding for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15501–15510
Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) Referring image segmentation via cross-modal progressive comprehension. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10485–10494
Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) Linguistic structure guided context modeling for referring image segmentation. arXiv:2010.00515
Yang S, Xia M, Li G, Zhou H-Y, Yu Y (2021) Bottom-up shift and reasoning for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11261–11270
Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10494–10503
Lin L, Yan P, Xu X, Yang S, Zeng K, Li G (2022) Structured attention network for referring image segmentation. IEEE Trans Multimed 24:1922–1932
Article Google Scholar
Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: EMNLP
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. arXiv:1608.00272
Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy KP (2016) Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 11–20
Kazemzadeh S, Ordonez V, Andre Matten M, Berg TL (2014) Referitgame: referring to objects in photographs of natural scenes. In: EMNLP
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Jing Y, Kong T, Wang W, Wang L, Li L, Tan T (2021) Locate then segment: a strong pipeline for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9853–9862
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Wang Z, Lu Y, Li Q, Tao X, Guo Y, Gong M, Liu T (2021) Cris: clip-driven referring image segmentation
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Luo G, Zhou Y, Sun X, Cao L, Wu C, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10031–10040
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv:2010.11929
Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 1307–1315
Yang S, Li G, Yu Y (2020) Graph-structured referring expression reasoning in the wild. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9949–9958
Yang S, Li G, Yu Y (2019) Dynamic graph attention for referring expression comprehension. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4643–4652
Liu D, Zhang H, Zha Z, Wu F (2019) Learning to assemble neural module tree networks for visual grounding. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4672–4681
Hong R, Liu D, Mo X, He X, Zhang H (2022) Learning to compose and reason with language tree structures for visual grounding. IEEE Trans Pattern Anal Mach Intell 44:684–696
Article Google Scholar
Cao Q, Liang X, Li B, Lin L (2021) Interpretable visual question answering by reasoning on dependency trees. IEEE Trans Pattern Anal Mach Intell 43:887–901
Article Google Scholar
Cao Q, Liang X, Li B, Li G, Lin L (2018) Visual question reasoning on general dependency tree. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7249–7257
Ben-younes H, Cadène R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: 2017 IEEE international conference on computer vision (ICCV), pp 2631–2639
Margffoy-Tuay E, Pérez J, Botero E, Arbeláez P (2018) Dynamic multimodal instance segmentation guided by natural language queries. arXiv:1807.02257
Luo G, Zhou Y, Ji R, Sun X, Su J, Lin C-W, Tian Q (2020) Cascade grouped attention network for referring expression segmentation. In: Proceedings of the 28th ACM international conference on multimedia
Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes-y-Gómez M, Morales EF, Sucar LE, Pineda LV, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114:419–428
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2009) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: CoRR arXiv:1412.6980
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS

Download references

Author information

Authors and Affiliations

School of Computer and Software Engineering, Xihua University, Chengdu, 610039, China
Fayou Xu, Bing Luo, Mingxing Pu & Bo Li
Key Laboratory of Intelligent Policing, Sichuan Police College, Luzhou, 646000, China
Chao Zhang
School of Science, Xihua University, Chengdu, 610039, China
Li Xu

Authors

Fayou Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Xu
View author publications
You can also search for this author in PubMed Google Scholar
Mingxing Pu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FX completed the experiment and writing; BL provides guidance on innovation and methods, and modifies papers; CZ provided guidance on methods and revised the paper; LX completed the formalization of the formula; MP and BL supplemented and improved the experiment. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Bing Luo or Chao Zhang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, F., Luo, B., Zhang, C. et al. Vision-Aware Language Reasoning for Referring Image Segmentation. Neural Process Lett 55, 11313–11331 (2023). https://doi.org/10.1007/s11063-023-11377-z

Download citation

Accepted: 19 July 2023
Published: 02 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11377-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vision-Aware Language Reasoning for Referring Image Segmentation

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vision-Aware Language Reasoning for Referring Image Segmentation

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation