skip to main content
research-article

Mutually-Guided Hierarchical Multi-Modal Feature Learning for Referring Image Segmentation

Published: 25 November 2024 Publication History

Abstract

Referring image segmentation aims to locate and segment the target region based on a given textual expression query. The primary challenge is to understand semantics from visual and textual modalities and achieve alignment and matching. Prior works have attempted to address this challenge by leveraging separately pretrained unimodal models to extract global visual and textual features and perform straightforward fusion to establish cross-modal semantic associations. However, these methods often concentrate solely on the global semantics, disregarding the hierarchical semantics of expression and image and struggling with complex and open real scenarios, thus failing to capture critical cross-modal information. To address these limitations, this article introduces an innovative mutually-guided hierarchical multi-modal feature learning scheme. By leveraging the guidance of global visual features, the model mines hierarchical text features from different stages of the text encoder. Simultaneously, the guidance of global textual features is leveraged to aggregate multi-scale visual features. This mutually guided hierarchical feature learning effectively addresses the semantically inaccurate cause by free-form text and naturally occurring scale variations. Furthermore, a Segment Detail Refinement (SDR) module is designed to enhance the model’s spatial detail awareness through attention mapping of low-level visual features and cross-modal features. To evaluate the effectiveness of the proposed approach, extensive experiments are conducted on three widely used referring image object segmentation datasets. The results demonstrate the superiority of the presented method in accurately locating and segmenting objects in images.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450. Retrieved from https://doi.org/10.48550/arXiv.1607.06450
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the International Conference on European Conference on Computer Vision. Springer, 213–229.
[3]
Xiangyu Chen, Xintao Wang, Jiantao Zhou, and Chao Dong. 2023. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22367–22377.
[4]
Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin, and Ming-Hsuan Yang. 2020. Referring expression object segmentation with caption-aware consistency. In 30th British Machine Vision Conference, BMVC 2019.
[5]
Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. 2021. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1601–1610.
[6]
Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2021. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16321–16330.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth \(16\times 16\) Words: Transformers for image recognition at scale. In Proceedings of the International Conference on International Conference on Learning Representations.
[8]
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12873–12883.
[9]
Guang Feng, Zhiwei Hu, Lihe Zhang, and Huchuan Lu. 2021. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15506–15515.
[10]
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6639–6648.
[11]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR, 315–323.
[12]
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.
[13]
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and understanding the effectiveness of BERT. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.), ACL, 4143–4152. DOI:
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
[15]
Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. 2020. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4424–4433.
[16]
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. 2020. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10488–10497.
[17]
Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. 2020. Linguistic structure guided context modeling for referring image segmentation. In Proceedings of the 16th European Conference on Computer Vision (ECCV ’20). Springer, 59–75.
[18]
Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. Transgan: Two transformers can make one strong gan. Advances in Neural Information Processing Systems 34 (2021), 14745–14758.
[19]
Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. 2021. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9858–9867.
[20]
Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 562–570.
[21]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 787–798.
[22]
Dezhuang Li, Ruoqi Li, Lijun Wang, Yifan Wang, Jinqing Qi, Lu Zhang, Ting Liu, Qingquan Xu, and Huchuan Lu. 2022. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 1297–1305.
[23]
Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5745–5753.
[24]
Chen Liang, Yawei Luo, Yu Wu, and Yi Yang. 2021. Contrastive video-language segmentation. arXiv:2109.14131. Retrieved from https://doi.org/10.48550/arXiv.2109.14131
[25]
Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun Zeng, and Guanbin Li. 2021. Structured attention network for referring image segmentation. IEEE Transactions on Multimedia 24 (2021), 1922–1932.
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV ’14). Springer, 740–755.
[27]
Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. 2017. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 1271–1280.
[28]
Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4673–4682.
[29]
Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R Manmatha. 2023. PolyFormer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18653–18663.
[30]
Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11915–11925.
[31]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, Curran Associates, Inc.
[32]
Gen Luo, Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su, Chia-Wen Lin, and Qi Tian. 2020. Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, 1274–1282.
[33]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10034–10043.
[34]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 11–20. DOI:
[35]
Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. 2018. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), 630–645.
[36]
Haiyang Mei, Letian Yu, Ke Xu, Yang Wang, Xin Yang, Xiaopeng Wei, and Rynson W. H. Lau. 2023. Mirror segmentation via semantic-aware contextual contrasted feature learning. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 2s (Feb. 2023), Article 100, 22 pages. DOI:
[37]
Jie Nie, Lei Huang, Chengyu Zheng, Xiaowei Lv, and Rui Wang. 2023. Cross-scale graph interaction network for semantic segmentation of remote sensing images. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 6 (May 2023), Article 185, 18 pages. 1551–6857. DOI:
[38]
Zizheng Pan, Jianfei Cai, and Bohan Zhuang. 2022. Fast vision transformers with hilo attention. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 14541–14554.
[39]
Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 838–844.
[40]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proceedings of the International conference on machine learning. PMLR, 4055–4064.
[41]
Shengsheng Qian, Jinguang Wang, Jun Hu, Quan Fang, and Changsheng Xu. 2021. Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 153–162.
[42]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.
[43]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.
[44]
Chao Shang, Hongliang Li, Heqian Qiu, Qingbo Wu, Fanman Meng, Taijin Zhao, and King Ngi Ngan. 2023. Cross-modal recurrent semantic comprehension for referring image segmentation. IEEE Transactions on Circuits and Systems for Video Technology 33, 7 (2023), 3229–3242. DOI:
[45]
Ran Shi, Jing Ma, King Ngi Ngan, Jian Xiong, and Tong Qiao. 2022. Objective object segmentation visual quality evaluation: Quality measure and pooling method. ACM ransactions on Multimedia Computing, Communications, and Applications 18, 3 (Mar. 2022), Article 73, 19 pages. DOI:
[46]
Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. 2022. Inception transformer. In Advances in Neural Information Processing Systems. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35, Curran Associates, Inc., 23495–23509.
[47]
Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1422–1432.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.
[49]
Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2021. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5463–5474.
[50]
Wenjing Wang, Lilang Lin, Zejia Fan, and Jiaying Liu. 2023. Semi-supervised learning for mars imagery classification and segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 4 (Feb. 2023), Article 144, 23 pages. DOI:
[51]
Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, and Jiangyun Li. 2024. CM-MaskSD: Cross-modality masked self-distillation for referring image segmentation. IEEE Transactions on Multimedia 26 (2024), 6906–6916.
[52]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021a. End-to-end video instance segmentation with transformers. In IEEE/CVF Conference on Computer Vision And Pattern Recognition, 8741–8750.
[53]
Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. CRIS: CLIP-driven referring image segmentation. In 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11676–11685. DOI:
[54]
Yuto Watanabe, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. 2022. Generative adversarial network including referring image segmentation for text-guided image manipulation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4818–4822. DOI:
[55]
Zhichao Wei, Xiaohao Chen, Mingqiang Chen, and Siyu Zhu. 2023. Learning aligned cross-modal representations for referring image segmentation. Retrieved from https://dblp.org/rec/journals/corr/abs-2301-06429
[56]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19.
[57]
Huisi Wu, Zhaoze Wang, Zhuoying Li, Zhenkun Wen, and Jing Qin. 2023. Context prior guided semantic modeling for biomedical image segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 2s (Mar. 2023), Article 90, 19 pages. 1551–6857. DOI:
[58]
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4974–4984.
[59]
Yichen Yan, Xingjian He, Wenxuan Wan, and Jing Liu. 2023. MMNet: Multi-mask network for referring image segmentation. arXiv:2305.14969. Retrieved from https://doi.org/10.48550/arXiv.2305.14969
[60]
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. 2022. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18155–18165.
[61]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489.
[62]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10502–10511.
[63]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1307–1315.
[64]
Wenjing Zhang, Quange Tan, Pengxin Li, Qi Zhang, and Rong Wang. 2023. Cross-modal transformer with language query for referring image segmentation. Neurocomputing 536 (2023), 191–205.
[65]
Zicheng Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang, and Wei Ke. 2022. CoupAlign: Coupling word-pixel with sentence-mask alignments for referring image segmentation. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 14729–14742.
[66]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, et al. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890.
[67]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.

Cited By

View all
  • (2024)Noise-Resistance Learning via Multi-Granularity Consistency for Unsupervised Domain Adaptive Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3702328Online publication date: 2-Nov-2024
  • (2024)Correlation-aware Cross-modal Attention Network for Fashion Compatibility Modeling in UGC SystemsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3698772Online publication date: 5-Oct-2024
  • (2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 19-Nov-2024
  • Show More Cited By

Index Terms

  1. Mutually-Guided Hierarchical Multi-Modal Feature Learning for Referring Image Segmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 12
    December 2024
    721 pages
    EISSN:1551-6865
    DOI:10.1145/3618076
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 November 2024
    Online AM: 05 October 2024
    Accepted: 25 September 2024
    Revised: 29 June 2024
    Received: 28 February 2024
    Published in TOMM Volume 20, Issue 12

    Check for updates

    Author Tags

    1. Referring Image Object Segmentation
    2. Hierarchical Feature Representation
    3. Segment Detail Refinement

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Key Research Program of Hubei

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)195
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Noise-Resistance Learning via Multi-Granularity Consistency for Unsupervised Domain Adaptive Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3702328Online publication date: 2-Nov-2024
    • (2024)Correlation-aware Cross-modal Attention Network for Fashion Compatibility Modeling in UGC SystemsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3698772Online publication date: 5-Oct-2024
    • (2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 19-Nov-2024
    • (2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media