Abstract
The goal of referring image segmentation (RIS) is to generate the foreground mask of the object described by a natural language expression. The key of RIS is to learn the valid multimodal features between visual and linguistic modalities to identify the referred object accurately. In this paper, a cross-modal attention-guided visual reasoning model for referring segmentation is proposed. First, the multi-scale detailed information is captured by a pyramidal convolution module to enhance visual representation. Then, the entity words of the referring expression and relevant image regions are aligned by a cross-modal attention mechanism. Based on this, all the entities described by the expression can be identified. Finally, a fully connected multimodal graph is constructed with multimodal features and relationship cues of expressions. Visual reasoning is performed stepwisely on the graph to highlight the correct entity whiling suppressing other irrelevant ones. The experiment results on four benchmark datasets show that the proposed method achieves performance improvement (e.g., +1.13% on UNC, +3.06% on UNC+, +2.1% on G-Ref, and 1.11% on ReferIt). Also, the effectiveness and feasibility of each component of our method are verified by extensive ablation studies.
Similar content being viewed by others
References
Ben-younes H, Cadène R, Thome M, Thome N (2017) “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”, 2017 IEEE International Conference on Computer Vision (ICCV). 2631–2639
Burks AW, Warren DW, Wright JB (1954) An analysis of a logical machine using parenthesis-free notation. Math Comput 8:53–57
Chandra S, Usunier N, Kokkinos I (2017) “Dense and Low-Rank Gaussian CRFs Using Deep Embeddings.” 2017 IEEE International Conference on Computer Vision (ICCV). 5113–5122
Chen L-C, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2015) “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”, CoRR abs/1412.7062
Chen LC, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848
Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) “See-Through-Text Grouping for Referring Image Segmentation”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 7453–7462
Chen Y-W, Tsai Y-H, Wang T, Lin Y-Y, Yang M-H (2019) “Referring Expression Object Segmentation with Caption-Aware Consistency”, BMVC
Chen Y, Rohrbach M, Yan Z, Yan S, Feng J, Kalantidis Y (2019) “Graph-Based Global Reasoning Networks.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 433–442
C Deng, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) “Visual Grounding via Accumulated Attention”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7746–7755
Duta IC, Liu L, Zhu F, Shao L (2020) “Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition”, ArXiv abs/2006.11538
Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes-y-Gómez M, Morales EF, Sucar LE, Pineda LV, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114:419–428
Feng G, Hu Z, Zhang L, Huchuan L (2021) “Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation”, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 15501–15510.
Fu J, Liu J, Tian H, Fang Z, Lu H (2019) “Dual Attention Network for Scene Segmentation.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3141–3149
He K, Zhang X, Ren S, Sun J (2016) “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Hu R, Rohrbach M, Darrell T (2016) “Segmentation from Natural Language Expressions”, ArXiv abs/1603.06180: 108–124
Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) “Natural Language Object Retrieval” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4555–4564
Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) “Modeling Relationships in Referential Expressions with Compositional Modular Networks”, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4418–4427
Hu Z, Feng G, Sun J, Zhang L, Huchuan L (2020) “Bi-Directional Relationship Inferring Network for Referring Image Segmentation”. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4423–4432
Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) “Referring Image Segmentation via Cross-Modal Progressive Comprehension”. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10485–10494
Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) “Linguistic Structure Guided Context Modeling for Referring Image Segmentation”, ArXiv abs/2010.00515. 59–75
Jing Y, Kong T, Wang W, Liang W, Li L, Tieniu Tan (2021) “Locate then Segment: A Strong Pipeline for Referring Image Segmentation”, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9853–9862.
Kazemzadeh S, Ordonez V, Matten MA, Berg TL (2014) “ReferItGame: Referring to Objects in Photographs of Natural Scenes.” EMNLP, 787–798.
Kingma DP, Ba J (2015) “Adam: A Method for Stochastic Optimization.” CoRR abs/1412.6980, 1–15
Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) “Referring Image Segmentation via Recurrent Refinement Networks”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 5745–5753.
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common Objects in Context. ECCV:740–755
Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) “Recurrent Multimodal Interaction for Referring Image Segmentation”, 2017 IEEE International Conference on Computer Vision (ICCV). 1280–1289
Liu Y, Wang R, Shan S, Chen X (2018) “Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 6985–6994
Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy AL (2016) “Generation and Comprehension of Unambiguous Object Descriptions”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11–20
Margffoy-Tuay E, Pérez J, Botero E, Arbeláez P (2018) “Dynamic Multimodal Instance Segmentation guided by natural language queries.” ArXiv abs/1807.02257. 630–645
Peng B, Al-Huda Z, Xie Z, Xi W (2020) Multi-scale region composition of hierarchical image segmentation. Multimed Tools Appl 79:32833–32855
Qiu S, Zhao Y, Jiao J, Wei Y, Wei S (2020) Referring image segmentation by generative adversarial learning. IEEE Transactions on Multimedia 22:1333–1344
Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Rezaei M, Yang H, Meinel C (2019) Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation. Multimed Tools Appl 79:15329–15348
Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) “Grounding of Textual Phrases in Images by Reconstruction.” ArXiv abs/1511.03745. 817–834
Sadhu A, Chen K, Nevatia R (2019) “Zero-Shot Grounding of Objects From Natural Language Queries.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4693–4702
Shi H, Li H, Meng F, Wu Q (2018) Key-Word-Aware Network for Referring Expression Image Segmentation. ECCV:38–54
Simonyan K, Zisserman A (2015) “Very Deep Convolutional Networks for Large-Scale Image Recognition”, CoRR abs/1409.1556
A Tao, Sapra K, Catanzaro B (2020) “Hierarchical Multi-Scale Attention for Semantic Segmentation”, ArXiv abs/2005.10821
Wang X, Gupta AK (2018) “Videos as Space-Time Region Graphs.” ArXiv abs/1806.01810. 399–417
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) “A Fast and Accurate One-Stage Approach to Visual Grounding”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4682–4692
Yang S, Li G, Yizhou Y (2019) “Dynamic Graph Attention for Referring Expression Comprehension”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4643–4652
Yang S, Li G, Yizhou Y (2019) “Cross-Modal Relationship Inference for Grounding Referring Expressions”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4140–4149
Ye L, Rochan M, Liu Z, Yang W (2019) “Cross-Modal Self-Attention Network for Referring Image Segmentation.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10494–10503
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) “Modeling Context in Referring Expressions.” ArXiv abs/1608.00272, 69–85
Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Tamara L. Berg (2018) “MAttNet: Modular Attention Network for Referring Expression Comprehension”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn. 1307–1315
Zhang H, Niu Y, Chang S-F (2018) “Grounding Referring Expressions in Images by Variational Context”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogn: 4158–4166.
Zhang H, Zhang H, Wang C, Xie J (2019) “Co-Occurrent Features in Semantic Segmentation”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 548–557
Zhang H, Wu C, Zhang Z, Zhu Y, Zhang ZL, Lin H, Sun Y, He T, Mueller J, Manmatha R, Li M, Smola A (2020) “ResNeSt: Split-Attention Networks”, ArXiv abs/2004.08955
B Zhuang, Wu Q, Shen C, Reid ID, van den Hengel A (2018) “Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4252–4261
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant 62076246), Fundamental Research Funds for the Central Universities (No. 2019JKF426).
Funding
This work was supported by the National Natural Science Foundation of China (Grant 62076246), Fundamental Research Funds for the Central Universities (No. 2019JKF426).
Availability of data and materialThe datasets used or analyzed during the current study are available from the corresponding author on reasonable request.
Code availabilitySource codes are available.
Author information
Authors and Affiliations
Contributions
Wenjing Zhang: Writing - original draft, Conceptualization, Methodology, Software.
Mengnan Hu: Writing - review & editing, Methodology.
Quange Tan: Validation, Visualization.
Qianli Zhou: Supervision, Resources.
Rong Wang: Investigation, Funding acquisition.
Corresponding author
Ethics declarations
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, W., Hu, M., Tan, Q. et al. Cross-modal attention guided visual reasoning for referring image segmentation. Multimed Tools Appl 82, 28853–28872 (2023). https://doi.org/10.1007/s11042-023-14586-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14586-9