Cross-modal attention guided visual reasoning for referring image segmentation

Zhang, Wenjing; Hu, Mengnan; Tan, Quange; Zhou, Qianli; Wang, Rong

doi:10.1007/s11042-023-14586-9

Cross-modal attention guided visual reasoning for referring image segmentation

Published: 01 March 2023

Volume 82, pages 28853–28872, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wenjing Zhang¹,
Mengnan Hu^1,2,
Quange Tan¹,
Qianli Zhou¹ &
…
Rong Wang^1,3

197 Accesses
Explore all metrics

Abstract

The goal of referring image segmentation (RIS) is to generate the foreground mask of the object described by a natural language expression. The key of RIS is to learn the valid multimodal features between visual and linguistic modalities to identify the referred object accurately. In this paper, a cross-modal attention-guided visual reasoning model for referring segmentation is proposed. First, the multi-scale detailed information is captured by a pyramidal convolution module to enhance visual representation. Then, the entity words of the referring expression and relevant image regions are aligned by a cross-modal attention mechanism. Based on this, all the entities described by the expression can be identified. Finally, a fully connected multimodal graph is constructed with multimodal features and relationship cues of expressions. Visual reasoning is performed stepwisely on the graph to highlight the correct entity whiling suppressing other irrelevant ones. The experiment results on four benchmark datasets show that the proposed method achieves performance improvement (e.g., +1.13% on UNC, +3.06% on UNC+, +2.1% on G-Ref, and 1.11% on ReferIt). Also, the effectiveness and feasibility of each component of our method are verified by extensive ablation studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global Selection and Local Attention Network for Referring Image Segmentation

Text-Vision Relationship Alignment for Referring Image Segmentation

Article Open access 22 February 2024

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

References

Ben-younes H, Cadène R, Thome M, Thome N (2017) “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”, 2017 IEEE International Conference on Computer Vision (ICCV). 2631–2639
Burks AW, Warren DW, Wright JB (1954) An analysis of a logical machine using parenthesis-free notation. Math Comput 8:53–57
Article MathSciNet MATH Google Scholar
Chandra S, Usunier N, Kokkinos I (2017) “Dense and Low-Rank Gaussian CRFs Using Deep Embeddings.” 2017 IEEE International Conference on Computer Vision (ICCV). 5113–5122
Chen L-C, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2015) “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”, CoRR abs/1412.7062
Chen LC, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848
Article Google Scholar
Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) “See-Through-Text Grouping for Referring Image Segmentation”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 7453–7462
Chen Y-W, Tsai Y-H, Wang T, Lin Y-Y, Yang M-H (2019) “Referring Expression Object Segmentation with Caption-Aware Consistency”, BMVC
Chen Y, Rohrbach M, Yan Z, Yan S, Feng J, Kalantidis Y (2019) “Graph-Based Global Reasoning Networks.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 433–442
C Deng, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) “Visual Grounding via Accumulated Attention”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7746–7755
Duta IC, Liu L, Zhu F, Shao L (2020) “Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition”, ArXiv abs/2006.11538
Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes-y-Gómez M, Morales EF, Sucar LE, Pineda LV, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114:419–428
Article Google Scholar
Feng G, Hu Z, Zhang L, Huchuan L (2021) “Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation”, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 15501–15510.
Fu J, Liu J, Tian H, Fang Z, Lu H (2019) “Dual Attention Network for Scene Segmentation.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3141–3149
He K, Zhang X, Ren S, Sun J (2016) “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Hu R, Rohrbach M, Darrell T (2016) “Segmentation from Natural Language Expressions”, ArXiv abs/1603.06180: 108–124
Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) “Natural Language Object Retrieval” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4555–4564
Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) “Modeling Relationships in Referential Expressions with Compositional Modular Networks”, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4418–4427
Hu Z, Feng G, Sun J, Zhang L, Huchuan L (2020) “Bi-Directional Relationship Inferring Network for Referring Image Segmentation”. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4423–4432
Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) “Referring Image Segmentation via Cross-Modal Progressive Comprehension”. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10485–10494
Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) “Linguistic Structure Guided Context Modeling for Referring Image Segmentation”, ArXiv abs/2010.00515. 59–75
Jing Y, Kong T, Wang W, Liang W, Li L, Tieniu Tan (2021) “Locate then Segment: A Strong Pipeline for Referring Image Segmentation”, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9853–9862.
Kazemzadeh S, Ordonez V, Matten MA, Berg TL (2014) “ReferItGame: Referring to Objects in Photographs of Natural Scenes.” EMNLP, 787–798.
Kingma DP, Ba J (2015) “Adam: A Method for Stochastic Optimization.” CoRR abs/1412.6980, 1–15
Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) “Referring Image Segmentation via Recurrent Refinement Networks”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 5745–5753.
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common Objects in Context. ECCV:740–755
Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) “Recurrent Multimodal Interaction for Referring Image Segmentation”, 2017 IEEE International Conference on Computer Vision (ICCV). 1280–1289
Liu Y, Wang R, Shan S, Chen X (2018) “Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 6985–6994
Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy AL (2016) “Generation and Comprehension of Unambiguous Object Descriptions”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11–20
Margffoy-Tuay E, Pérez J, Botero E, Arbeláez P (2018) “Dynamic Multimodal Instance Segmentation guided by natural language queries.” ArXiv abs/1807.02257. 630–645
Peng B, Al-Huda Z, Xie Z, Xi W (2020) Multi-scale region composition of hierarchical image segmentation. Multimed Tools Appl 79:32833–32855
Article Google Scholar
Qiu S, Zhao Y, Jiao J, Wei Y, Wei S (2020) Referring image segmentation by generative adversarial learning. IEEE Transactions on Multimedia 22:1333–1344
Article Google Scholar
Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Article Google Scholar
Rezaei M, Yang H, Meinel C (2019) Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation. Multimed Tools Appl 79:15329–15348
Article Google Scholar
Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) “Grounding of Textual Phrases in Images by Reconstruction.” ArXiv abs/1511.03745. 817–834
Sadhu A, Chen K, Nevatia R (2019) “Zero-Shot Grounding of Objects From Natural Language Queries.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4693–4702
Shi H, Li H, Meng F, Wu Q (2018) Key-Word-Aware Network for Referring Expression Image Segmentation. ECCV:38–54
Simonyan K, Zisserman A (2015) “Very Deep Convolutional Networks for Large-Scale Image Recognition”, CoRR abs/1409.1556
A Tao, Sapra K, Catanzaro B (2020) “Hierarchical Multi-Scale Attention for Semantic Segmentation”, ArXiv abs/2005.10821
Wang X, Gupta AK (2018) “Videos as Space-Time Region Graphs.” ArXiv abs/1806.01810. 399–417
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) “A Fast and Accurate One-Stage Approach to Visual Grounding”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4682–4692
Yang S, Li G, Yizhou Y (2019) “Dynamic Graph Attention for Referring Expression Comprehension”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4643–4652
Yang S, Li G, Yizhou Y (2019) “Cross-Modal Relationship Inference for Grounding Referring Expressions”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4140–4149
Ye L, Rochan M, Liu Z, Yang W (2019) “Cross-Modal Self-Attention Network for Referring Image Segmentation.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10494–10503
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) “Modeling Context in Referring Expressions.” ArXiv abs/1608.00272, 69–85
Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Tamara L. Berg (2018) “MAttNet: Modular Attention Network for Referring Expression Comprehension”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 1307–1315
Zhang H, Niu Y, Chang S-F (2018) “Grounding Referring Expressions in Images by Variational Context”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn: 4158–4166.
Zhang H, Zhang H, Wang C, Xie J (2019) “Co-Occurrent Features in Semantic Segmentation”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 548–557
Zhang H, Wu C, Zhang Z, Zhu Y, Zhang ZL, Lin H, Sun Y, He T, Mueller J, Manmatha R, Li M, Smola A (2020) “ResNeSt: Split-Attention Networks”, ArXiv abs/2004.08955
B Zhuang, Wu Q, Shen C, Reid ID, van den Hengel A (2018) “Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4252–4261

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant 62076246), Fundamental Research Funds for the Central Universities (No. 2019JKF426).

Funding

This work was supported by the National Natural Science Foundation of China (Grant 62076246), Fundamental Research Funds for the Central Universities (No. 2019JKF426).

Availability of data and material

The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

Code availability

Source codes are available.

Author information

Authors and Affiliations

School of Information and Cyber Security, People’s Public Security University of China, Beijing, 434020, China
Wenjing Zhang, Mengnan Hu, Quange Tan, Qianli Zhou & Rong Wang
Police Technology and Equipment Innovation Research Center, Shandong Police College, Jinan, 250200, China
Mengnan Hu
Key Laboratory of Security Prevention Technology and Risk Assessment of Ministry of Public Security, Beijing, 434020, China
Rong Wang

Authors

Wenjing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mengnan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Quange Tan
View author publications
You can also search for this author in PubMed Google Scholar
Qianli Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenjing Zhang: Writing - original draft, Conceptualization, Methodology, Software.

Mengnan Hu: Writing - review & editing, Methodology.

Quange Tan: Validation, Visualization.

Qianli Zhou: Supervision, Resources.

Rong Wang: Investigation, Funding acquisition.

Corresponding author

Correspondence to Rong Wang.

Ethics declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, W., Hu, M., Tan, Q. et al. Cross-modal attention guided visual reasoning for referring image segmentation. Multimed Tools Appl 82, 28853–28872 (2023). https://doi.org/10.1007/s11042-023-14586-9

Download citation

Received: 30 September 2021
Revised: 14 July 2022
Accepted: 31 January 2023
Published: 01 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11042-023-14586-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal attention guided visual reasoning for referring image segmentation

Abstract

Access this article

Similar content being viewed by others

Global Selection and Local Attention Network for Referring Image Segmentation

Text-Vision Relationship Alignment for Referring Image Segmentation

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-modal attention guided visual reasoning for referring image segmentation

Abstract

Access this article

Similar content being viewed by others

Global Selection and Local Attention Network for Referring Image Segmentation

Text-Vision Relationship Alignment for Referring Image Segmentation

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation