Skip to main content
Log in

Vision-Aware Language Reasoning for Referring Image Segmentation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Referring image segmentation is a multimodal joint task that aims to segment linguistically indicated objects from images in paired expressions and images. However, the diversity of language annotations trends to result in semantic ambiguity, which makes the semantic representation of language feature encoding imprecise. Existing methods ignore the correction of language encoding module, so that the semantic error of language features cannot be improved in the subsequent process, resulting in semantic deviation. To this end, we propose a vision-aware language reasoning model. Intuitively, the segmentation result can be used to guide the reconstruction of language features, which could be expressed as a tree-structured recursive process. Specifically, we designed a language reasoning encoding module and a mask loopback optimization module to optimize the language encoding tree. The feature weights of tree nodes are learned through backpropagation. In order to overcome the problem that local language words and visual regions are easily introduced into noise regions in the traditional attention module, we use the global language prior information to calculate the importance of different words to further weight the visual region features, which could be embodied as language-aware vision attention module. Our experimental results on four benchmark datasets show that the proposed method achieves performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Hu R, Rohrbach M, Darrell T (2016) Segmentation from natural language expressions. arXiv:1603.06180

  2. Chen J, Shen Y, Gao J, Liu J, Liu X (2018) Language-based image editing with recurrent attentive models. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8721–8729

  3. Linder J, Laput G, Dontcheva M, Wilensky G, Chang W, Agarwala A, Adar E (2013) Pixeltone: a multimodal interface for image editing. In: CHI’13 extended abstracts on human factors in computing systems

  4. Wang XE, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang Y-F, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6622–6631

  5. Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV

  6. Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2018) Semantic understanding of scenes through the ade20k dataset. Int J Comput Vis 127:302–321

    Article  Google Scholar 

  7. Wu T, Huang J, Gao G, Wei X, Wei X, Luo X, Liu CH (2021) Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16760–16769

  8. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762

  9. Yang Z, Wang J, Tang Y, Chen K, Zhao H, Torr PHS (2021) Lavt: language-aware vision transformer for referring image segmentation. arXiv:2112.02244

  10. Kim NH, Kim D, Lan C, Zeng W, Kwak S (2022) Restr: convolution-free referring image segmentation using transformers. arXiv:2203.16768

  11. Li Z, Wang M, Mei J, Liu Y (2021) Mail: a unified mask-image-language trimodal network for referring image segmentation. arXiv:2111.10747

  12. Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) Referring image segmentation via recurrent refinement networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5745–5753

  13. Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) Recurrent multimodal interaction for referring image segmentation. In: 2017 IEEE international conference on computer vision (ICCV), pp 1280–1289

  14. Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) See-through-text grouping for referring image segmentation. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 7453–7462

  15. Hu Z, Feng G, Sun J, Zhang L, Lu H (2020) Bi-directional relationship inferring network for referring image segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4423–4432

  16. Shi H, Li H, Meng F, Wu Q (2018) Key-word-aware network for referring expression image segmentation. In: ECCV

  17. Feng G, Hu Z, Zhang L, Lu H (2021) Encoder fusion network with co-attention embedding for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15501–15510

  18. Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) Referring image segmentation via cross-modal progressive comprehension. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10485–10494

  19. Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) Linguistic structure guided context modeling for referring image segmentation. arXiv:2010.00515

  20. Yang S, Xia M, Li G, Zhou H-Y, Yu Y (2021) Bottom-up shift and reasoning for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11261–11270

  21. Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10494–10503

  22. Lin L, Yan P, Xu X, Yang S, Zeng K, Li G (2022) Structured attention network for referring image segmentation. IEEE Trans Multimed 24:1922–1932

    Article  Google Scholar 

  23. Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: EMNLP

  24. Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. arXiv:1608.00272

  25. Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy KP (2016) Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 11–20

  26. Kazemzadeh S, Ordonez V, Andre Matten M, Berg TL (2014) Referitgame: referring to objects in photographs of natural scenes. In: EMNLP

  27. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

    Article  Google Scholar 

  28. Jing Y, Kong T, Wang W, Wang L, Li L, Tan T (2021) Locate then segment: a strong pipeline for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9853–9862

  29. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  30. Wang Z, Lu Y, Li Q, Tao X, Guo Y, Gong M, Liu T (2021) Cris: clip-driven referring image segmentation

  31. Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587

  32. Luo G, Zhou Y, Sun X, Cao L, Wu C, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10031–10040

  33. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767

  34. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv:2010.11929

  35. Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 1307–1315

  36. Yang S, Li G, Yu Y (2020) Graph-structured referring expression reasoning in the wild. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9949–9958

  37. Yang S, Li G, Yu Y (2019) Dynamic graph attention for referring expression comprehension. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4643–4652

  38. Liu D, Zhang H, Zha Z, Wu F (2019) Learning to assemble neural module tree networks for visual grounding. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4672–4681

  39. Hong R, Liu D, Mo X, He X, Zhang H (2022) Learning to compose and reason with language tree structures for visual grounding. IEEE Trans Pattern Anal Mach Intell 44:684–696

    Article  Google Scholar 

  40. Cao Q, Liang X, Li B, Lin L (2021) Interpretable visual question answering by reasoning on dependency trees. IEEE Trans Pattern Anal Mach Intell 43:887–901

    Article  Google Scholar 

  41. Cao Q, Liang X, Li B, Li G, Lin L (2018) Visual question reasoning on general dependency tree. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7249–7257

  42. Ben-younes H, Cadène R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: 2017 IEEE international conference on computer vision (ICCV), pp 2631–2639

  43. Margffoy-Tuay E, Pérez J, Botero E, Arbeláez P (2018) Dynamic multimodal instance segmentation guided by natural language queries. arXiv:1807.02257

  44. Luo G, Zhou Y, Ji R, Sun X, Su J, Lin C-W, Tian Q (2020) Cascade grouped attention network for referring expression segmentation. In: Proceedings of the 28th ACM international conference on multimedia

  45. Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes-y-Gómez M, Morales EF, Sucar LE, Pineda LV, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114:419–428

    Article  Google Scholar 

  46. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  47. Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2009) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338

    Article  Google Scholar 

  48. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP

  49. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681

    Article  Google Scholar 

  50. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: CoRR arXiv:1412.6980

  51. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch

  52. Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS

Download references

Author information

Authors and Affiliations

Authors

Contributions

FX completed the experiment and writing; BL provides guidance on innovation and methods, and modifies papers; CZ provided guidance on methods and revised the paper; LX completed the formalization of the formula; MP and BL supplemented and improved the experiment. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Bing Luo or Chao Zhang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, F., Luo, B., Zhang, C. et al. Vision-Aware Language Reasoning for Referring Image Segmentation. Neural Process Lett 55, 11313–11331 (2023). https://doi.org/10.1007/s11063-023-11377-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11377-z

Keywords

Navigation