skip to main content
10.1145/3581783.3612117acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CARIS: Context-Aware Referring Image Segmentation

Published:27 October 2023Publication History

ABSTRACT

Referring image segmentation aims to segment the target object described by a natural-language utterance. Recent approaches typically distinguish pixels by aligning pixel-wise visual features with linguistic features extracted from the referring description. Nevertheless, such a free-form description only specifies certain discriminative attributes of the target object or its relations to a limited number of objects, which fails to represent the rich visual context adequately. The stand-alone linguistic features are therefore unable to align with all visual concepts, resulting in inaccurate segmentation. In this paper, we propose to address this issue by incorporating rich visual context into linguistic features for sufficient vision-language alignment. Specifically, we present Context-Aware Referring Image Segmentation (CARIS), a novel architecture that enhances the contextual awareness of linguistic features via sequential vision-language attention and learnable prompts. Technically, CARIS develops a context-aware mask decoder with sequential bidirectional cross-modal attention to integrate the linguistic features with visual context, which are then aligned with pixel-wise visual features. Furthermore, two groups of learnable prompts are employed to delve into additional contextual information from the input image and facilitate the alignment with non-target pixels, respectively. Extensive experiments demonstrate that CARIS achieves new state-of-the-art performances on three public benchmarks. Code is available at https://github.com/lsa1997/CARIS.

References

  1. Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. 2022. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).Google ScholarGoogle Scholar
  2. Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. 2022. Visual prompting via image inpainting. In NeurIPS.Google ScholarGoogle Scholar
  3. Ding-Jie Chen, Songhao Jia, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu. 2019. See-through-text grouping for referring image segmentation. In ICCV.Google ScholarGoogle Scholar
  4. Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In CVPR.Google ScholarGoogle Scholar
  5. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.Google ScholarGoogle Scholar
  6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  7. Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2021. Vision-language transformer and query generation for referring segmentation. In ICCV.Google ScholarGoogle Scholar
  8. Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. 2022. Decoupling zero-shot semantic segmentation. In CVPR.Google ScholarGoogle Scholar
  9. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google ScholarGoogle Scholar
  10. Guang Feng, Zhiwei Hu, Lihe Zhang, and Huchuan Lu. 2021. Encoder fusion network with co-attention embedding for referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  11. Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. 2021. Rethinking spatial dimensions of vision transformers. In ICCV.Google ScholarGoogle Scholar
  12. Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In ECCV.Google ScholarGoogle Scholar
  13. Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. 2020. Bi-directional relationship inferring network for referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  14. Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. 2020. Referring image segmentation via cross-modal progressive comprehension. In CVPR.Google ScholarGoogle Scholar
  15. Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. 2020. Linguistic structure guided context modeling for referring image segmentation. In ECCV.Google ScholarGoogle Scholar
  16. Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In ECCV.Google ScholarGoogle Scholar
  17. Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. 2021. Locate then segment: A strong pipeline for referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  18. Namyup Kim, Dongwon Kim, Cuiling Lan, Wenjun Zeng, and Suha Kwak. 2022. Restr: Convolution-free referring image segmentation using transformers. In CVPR.Google ScholarGoogle Scholar
  19. Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML.Google ScholarGoogle Scholar
  20. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).Google ScholarGoogle Scholar
  21. Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. 2022a. Language-driven Semantic Segmentation. In ICLR.Google ScholarGoogle Scholar
  22. Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022c. Grounded language-image pre-training. In CVPR.Google ScholarGoogle Scholar
  23. Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring image segmentation via recurrent refinement networks. In CVPR.Google ScholarGoogle Scholar
  24. Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. 2022b. Contextual transformer networks for visual recognition. TPAMI (2022).Google ScholarGoogle Scholar
  25. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.Google ScholarGoogle Scholar
  26. Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. 2017. Recurrent multimodal interaction for referring image segmentation. In ICCV.Google ScholarGoogle Scholar
  27. Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R Manmatha. 2023 a. PolyFormer: Referring Image Segmentation as Sequential Polygon Generation. arXiv preprint arXiv:2302.07387 (2023).Google ScholarGoogle Scholar
  28. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023 b. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys (2023).Google ScholarGoogle Scholar
  29. Sun-Ao Liu, Hongtao Xie, Hai Xu, Yongdong Zhang, and Qi Tian. 2022. Partial class activation attention for semantic segmentation. In CVPR.Google ScholarGoogle Scholar
  30. Sun-Ao Liu, Yiheng Zhang, Zhaofan Qiu, Hongtao Xie, Yongdong Zhang, and Ting Yao. 2023 c. Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation. In CVPR.Google ScholarGoogle Scholar
  31. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.Google ScholarGoogle Scholar
  32. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR.Google ScholarGoogle Scholar
  33. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.Google ScholarGoogle Scholar
  34. Gen Luo, Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su, Chia-Wen Lin, and Qi Tian. 2020a. Cascade grouped attention network for referring expression segmentation. In ACM MM.Google ScholarGoogle Scholar
  35. Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020b. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR.Google ScholarGoogle Scholar
  36. Lingfeng Ma, Hongtao Xie, Chuanbin Liu, and Yongdong Zhang. 2022. Learning cross-channel representations for semantic segmentation. TMM (2022).Google ScholarGoogle Scholar
  37. Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In CVPR.Google ScholarGoogle Scholar
  38. Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. 2018. Dynamic multimodal instance segmentation guided by natural language queries. In ECCV.Google ScholarGoogle Scholar
  39. Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In ECCV.Google ScholarGoogle Scholar
  40. Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. 2020. Single image super-resolution via a holistic attention network. In ECCV.Google ScholarGoogle Scholar
  41. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV.Google ScholarGoogle Scholar
  43. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.Google ScholarGoogle Scholar
  44. Wenqi Ren, Sifei Liu, Lin Ma, Qianqian Xu, Xiangyu Xu, Xiaochun Cao, Junping Du, and Ming-Hsuan Yang. 2019. Low-light image enhancement via a deep hybrid network. TIP (2019).Google ScholarGoogle Scholar
  45. Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word-aware network for referring expression image segmentation. In ECCV.Google ScholarGoogle Scholar
  46. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.Google ScholarGoogle Scholar
  47. Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. 2022a. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022).Google ScholarGoogle Scholar
  48. Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022b. Cris: Clip-driven referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  49. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP.Google ScholarGoogle Scholar
  50. Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, and Yizhou Yu. 2021. Bottom-up shift and reasoning for referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  51. Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. 2022. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  52. Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H.S. Torr. 2023. Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation. In AAAI.Google ScholarGoogle Scholar
  53. Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, and Tao Mei. 2023. Dual vision transformer. TPAMI (2023).Google ScholarGoogle Scholar
  54. Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, and Tao Mei. 2022. Wave-vit: Unifying wavelet and transformers for visual representation learning. In ECCV.Google ScholarGoogle Scholar
  55. Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In CVPR.Google ScholarGoogle Scholar
  56. Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In CVPR.Google ScholarGoogle Scholar
  57. Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In ECCV.Google ScholarGoogle Scholar
  58. Yiheng Zhang, Ting Yao, Zhaofan Qiu, and Tao Mei. 2023. Lightweight and progressively-scalable networks for semantic segmentation. IJCV (2023).Google ScholarGoogle Scholar
  59. Zicheng Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang, and Wei Ke. 2022. CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation. In NeurIPS.Google ScholarGoogle Scholar
  60. Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In CVPR.Google ScholarGoogle Scholar
  61. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In CVPR.Google ScholarGoogle Scholar
  62. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. IJCV (2022).Google ScholarGoogle Scholar
  63. Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR.Google ScholarGoogle Scholar

Index Terms

  1. CARIS: Context-Aware Referring Image Segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)468
        • Downloads (Last 6 weeks)95

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader