Skip to main content
Log in

Cross-modal attention guided visual reasoning for referring image segmentation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The goal of referring image segmentation (RIS) is to generate the foreground mask of the object described by a natural language expression. The key of RIS is to learn the valid multimodal features between visual and linguistic modalities to identify the referred object accurately. In this paper, a cross-modal attention-guided visual reasoning model for referring segmentation is proposed. First, the multi-scale detailed information is captured by a pyramidal convolution module to enhance visual representation. Then, the entity words of the referring expression and relevant image regions are aligned by a cross-modal attention mechanism. Based on this, all the entities described by the expression can be identified. Finally, a fully connected multimodal graph is constructed with multimodal features and relationship cues of expressions. Visual reasoning is performed stepwisely on the graph to highlight the correct entity whiling suppressing other irrelevant ones. The experiment results on four benchmark datasets show that the proposed method achieves performance improvement (e.g., +1.13% on UNC, +3.06% on UNC+, +2.1% on G-Ref, and 1.11% on ReferIt). Also, the effectiveness and feasibility of each component of our method are verified by extensive ablation studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Ben-younes H, Cadène R, Thome M, Thome N (2017) “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”, 2017 IEEE International Conference on Computer Vision (ICCV). 2631–2639

  2. Burks AW, Warren DW, Wright JB (1954) An analysis of a logical machine using parenthesis-free notation. Math Comput 8:53–57

    Article  MathSciNet  MATH  Google Scholar 

  3. Chandra S, Usunier N, Kokkinos I (2017) “Dense and Low-Rank Gaussian CRFs Using Deep Embeddings.” 2017 IEEE International Conference on Computer Vision (ICCV). 5113–5122

  4. Chen L-C, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2015) “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”, CoRR abs/1412.7062

  5. Chen LC, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848

    Article  Google Scholar 

  6. Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) “See-Through-Text Grouping for Referring Image Segmentation”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 7453–7462

  7. Chen Y-W, Tsai Y-H, Wang T, Lin Y-Y, Yang M-H (2019) “Referring Expression Object Segmentation with Caption-Aware Consistency”, BMVC

  8. Chen Y, Rohrbach M, Yan Z, Yan S, Feng J, Kalantidis Y (2019) “Graph-Based Global Reasoning Networks.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 433–442

  9. C Deng, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) “Visual Grounding via Accumulated Attention”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7746–7755

  10. Duta IC, Liu L, Zhu F, Shao L (2020) “Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition”, ArXiv abs/2006.11538

  11. Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes-y-Gómez M, Morales EF, Sucar LE, Pineda LV, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114:419–428

    Article  Google Scholar 

  12. Feng G, Hu Z, Zhang L, Huchuan L (2021) “Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation”, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 15501–15510.

  13. Fu J, Liu J, Tian H, Fang Z, Lu H (2019) “Dual Attention Network for Scene Segmentation.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3141–3149

  14. He K, Zhang X, Ren S, Sun J (2016) “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

    Article  Google Scholar 

  16. Hu R, Rohrbach M, Darrell T (2016) “Segmentation from Natural Language Expressions”, ArXiv abs/1603.06180: 108–124

  17. Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) “Natural Language Object Retrieval” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4555–4564

  18. Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) “Modeling Relationships in Referential Expressions with Compositional Modular Networks”, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4418–4427

  19. Hu Z, Feng G, Sun J, Zhang L, Huchuan L (2020) “Bi-Directional Relationship Inferring Network for Referring Image Segmentation”. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4423–4432

  20. Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) “Referring Image Segmentation via Cross-Modal Progressive Comprehension”. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10485–10494

  21. Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) “Linguistic Structure Guided Context Modeling for Referring Image Segmentation”, ArXiv abs/2010.00515. 59–75

  22. Jing Y, Kong T, Wang W, Liang W, Li L, Tieniu Tan (2021) “Locate then Segment: A Strong Pipeline for Referring Image Segmentation”, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9853–9862.

  23. Kazemzadeh S, Ordonez V, Matten MA, Berg TL (2014) “ReferItGame: Referring to Objects in Photographs of Natural Scenes.” EMNLP, 787–798.

  24. Kingma DP, Ba J (2015) “Adam: A Method for Stochastic Optimization.” CoRR abs/1412.6980, 1–15

  25. Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) “Referring Image Segmentation via Recurrent Refinement Networks”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 5745–5753.

  26. Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common Objects in Context. ECCV:740–755

  27. Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) “Recurrent Multimodal Interaction for Referring Image Segmentation”, 2017 IEEE International Conference on Computer Vision (ICCV). 1280–1289

  28. Liu Y, Wang R, Shan S, Chen X (2018) “Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 6985–6994

  29. Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy AL (2016) “Generation and Comprehension of Unambiguous Object Descriptions”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11–20

  30. Margffoy-Tuay E, Pérez J, Botero E, Arbeláez P (2018) “Dynamic Multimodal Instance Segmentation guided by natural language queries.” ArXiv abs/1807.02257. 630–645

  31. Peng B, Al-Huda Z, Xie Z, Xi W (2020) Multi-scale region composition of hierarchical image segmentation. Multimed Tools Appl 79:32833–32855

    Article  Google Scholar 

  32. Qiu S, Zhao Y, Jiao J, Wei Y, Wei S (2020) Referring image segmentation by generative adversarial learning. IEEE Transactions on Multimedia 22:1333–1344

    Article  Google Scholar 

  33. Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149

    Article  Google Scholar 

  34. Rezaei M, Yang H, Meinel C (2019) Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation. Multimed Tools Appl 79:15329–15348

    Article  Google Scholar 

  35. Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) “Grounding of Textual Phrases in Images by Reconstruction.” ArXiv abs/1511.03745. 817–834

  36. Sadhu A, Chen K, Nevatia R (2019) “Zero-Shot Grounding of Objects From Natural Language Queries.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4693–4702

  37. Shi H, Li H, Meng F, Wu Q (2018) Key-Word-Aware Network for Referring Expression Image Segmentation. ECCV:38–54

  38. Simonyan K, Zisserman A (2015) “Very Deep Convolutional Networks for Large-Scale Image Recognition”, CoRR abs/1409.1556

  39. A Tao, Sapra K, Catanzaro B (2020) “Hierarchical Multi-Scale Attention for Semantic Segmentation”, ArXiv abs/2005.10821

  40. Wang X, Gupta AK (2018) “Videos as Space-Time Region Graphs.” ArXiv abs/1806.01810. 399–417

  41. Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) “A Fast and Accurate One-Stage Approach to Visual Grounding”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4682–4692

  42. Yang S, Li G, Yizhou Y (2019) “Dynamic Graph Attention for Referring Expression Comprehension”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4643–4652

  43. Yang S, Li G, Yizhou Y (2019) “Cross-Modal Relationship Inference for Grounding Referring Expressions”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4140–4149

  44. Ye L, Rochan M, Liu Z, Yang W (2019) “Cross-Modal Self-Attention Network for Referring Image Segmentation.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10494–10503

  45. Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) “Modeling Context in Referring Expressions.” ArXiv abs/1608.00272, 69–85

  46. Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Tamara L. Berg (2018) “MAttNet: Modular Attention Network for Referring Expression Comprehension”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 1307–1315

  47. Zhang H, Niu Y, Chang S-F (2018) “Grounding Referring Expressions in Images by Variational Context”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn: 4158–4166.

  48. Zhang H, Zhang H, Wang C, Xie J (2019) “Co-Occurrent Features in Semantic Segmentation”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 548–557

  49. Zhang H, Wu C, Zhang Z, Zhu Y, Zhang ZL, Lin H, Sun Y, He T, Mueller J, Manmatha R, Li M, Smola A (2020) “ResNeSt: Split-Attention Networks”, ArXiv abs/2004.08955

  50. B Zhuang, Wu Q, Shen C, Reid ID, van den Hengel A (2018) “Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4252–4261

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant 62076246), Fundamental Research Funds for the Central Universities (No. 2019JKF426).

Funding

This work was supported by the National Natural Science Foundation of China (Grant 62076246), Fundamental Research Funds for the Central Universities (No. 2019JKF426).

Availability of data and material

The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

Code availability

Source codes are available.

Author information

Authors and Affiliations

Authors

Contributions

Wenjing Zhang: Writing - original draft, Conceptualization, Methodology, Software.

Mengnan Hu: Writing - review & editing, Methodology.

Quange Tan: Validation, Visualization.

Qianli Zhou: Supervision, Resources.

Rong Wang: Investigation, Funding acquisition.

Corresponding author

Correspondence to Rong Wang.

Ethics declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Hu, M., Tan, Q. et al. Cross-modal attention guided visual reasoning for referring image segmentation. Multimed Tools Appl 82, 28853–28872 (2023). https://doi.org/10.1007/s11042-023-14586-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14586-9

Keywords

Navigation