research-article

Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak Supervision

Authors:

Fangxiang Feng,

Xiaojie WangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 4312 - 4321

https://doi.org/10.1145/3664647.3680897

Published: 28 October 2024 Publication History

Abstract

Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, respectively. However, for real-world applications these two approaches are limited due to slight annotations and numerable categories during training. In this paper, we propose a framework of zero-shot PG under weak supervision. Specifically, our PG framework is built on triple alignment strategies. Firstly, we propose a region-text alignment (RTA) strategy to build region-level attribute associations via CLIP. Secondly, we propose a domain alignment (DomA) strategy by minimizing the difference between distributions of seen classes in the training and those of the pre-training. Thirdly, we propose a category alignment (CatA) strategy by considering both category semantics and region-category relations. Extensive experimental results show that our proposed PG framework outperforms previous zero-shot methods and weakly-supervised methods. Our code is available at https://github.com/LinPengyue/ZS-WSG.

References

[1]

Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. 2019. Multi-level multimodal common semantic space for image-phrase grounding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Los Angeles, USA, 12476--12486.

[2]

Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, et al. 2021. Detector-free weakly supervised grounding by separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Montreal, Canada, 1801--1812.

[3]

Kai Uwe Barthel, Nico Hezel, Konstantin Schall, and Klaus Jung. 2019. Real-time visual navigation in huge image sets using similarity graphs. In Proceedings of the 27th ACM International Conference on Multimedia. 2202--2204.

Digital Library

[4]

Hila Chefer, Shir Gur, and Lior Wolf. 2021. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Kuala Lumpur, Malaysia, 782--791.

[5]

Kan Chen, Rama Kovvuri, and Ram Nevatia. 2017. Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision. 824--832.

[6]

Kan Chen, Rama Kovvuri, and Ram Nevatia. 2017. Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Venice, Italy, 824--832.

[7]

Kang Chen, Tianli Zhao, and Xiangqian Wu. 2023. VTQA2023: ACM Multimedia 2023 Visual Text Question Answering Challenge. In Proceedings of the 31st ACM International Conference on Multimedia. 9646--9650.

[8]

Nenglun Chen, Xingjia Pan, Runnan Chen, Lei Yang, Zhiwen Lin, Yuqiang Ren,Haolei Yuan, Xiaowei Guo, Feiyue Huang, and Wenping Wang. 2021. Distributedattention for grounded image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 1966'1975.

[9]

Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, and Ajay Divakaran. 2019. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE/CVF international conference on computer vision. 2601--2610.

[10]

Thomas Eiter, Tobias Geibinger, Nelson Higuera, and Johannes Oetsch. 2023. A Logic-based Approach to Contrastive Explainability for Neurosymbolic Visual Question Answering. In IJCAI. 3668--3676.

[11]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1473--1482.

[12]

Eyal Gomel, Tal Shaharbany, and Lior Wolf. 2023. Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16044--16054.

[13]

Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, Vol. 2. IEEE, Minori, Italy, 13--23.

[14]

Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. 2020. Contrastive learning for weakly supervised phrase grounding. In European Conference on Computer Vision. Springer, 752--768.

Digital Library

[15]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings 0f the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.

[16]

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3055--3067.

[17]

Syed Ashar Javed, Shreyas Saxena, and Vineet Gandhi. 2018. Learning unsupervised visual grounding through semantic self-supervision. CoRR abs/1803.06506 (2018), http://arxiv.org/abs/1803.06506.

[18]

Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. 2022. Pseudo-q: Generating pseudo language queries for visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, New Orleans, USA, 15513--15523.

[19]

Huiju Kim, Youjin Kang, and SangKeun Lee. 2023. Examining Consistency of Visual Commonsense Reasoning based on Person Grounding. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 1026--1039.

[20]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32--73.

[21]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694--9705.

[22]

Jiahao Li, Greg Shakhnarovich, and Raymond A Yeh. 2022. Adapting clip for phrase localization without further training. CoRR arXiv:2204.03647 (2022), https://doi.org/10.48550/arXiv:2204.03647.

[23]

Mingxiao Li, Zehao Wang, Tinne Tuytelaars, and Marie-Francine Moens. 2023. Layout-aware dreamer for embodied visual referring expression grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 1386--1395.

Digital Library

[24]

Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, and Shuqiang Jiang. 2021. Ion: Instance-level object navigation. In Proceedings of the 29th ACM International Conference on Multimedia. 4343--4352.

Digital Library

[25]

Pengyue Lin, Zhihan Yu, Mingcong Lu, Fangxiang Feng, Ruifan Li, and Xiaojie Wang. 2024. Visual Prompt Tuning for Weakly Supervised Phrase Grounding. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7895--7899.

[26]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision'ECCV 2014: 13th European Conference. Springer, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740--755.

[27]

Yuqi Lin, Minghao Chen, Wenxiao Wang, Boxi Wu, Ke Li, Binbin Lin, Haifeng Liu, and Xiaofei He. 2023. Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15305--15314.

[28]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2022. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3003--3018.

[29]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Seoul, South Korea, 2611--2620.

[30]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, Nice, France, 539--547.

Digital Library

[31]

Yun Liu, Yihui Shi, Fangxiang Feng, Ruifan Li, Zhanyu Ma, and Xiaojie Wang. 2022. Improving Image Paragraph Captioning with Dual Relations. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[32]

Yongfei Liu, Bo Wan, Lin Ma, and Xuming He. 2021. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Kuala Lumpur, Malaysia, 5612--5621.

[33]

Yang Liu, Jiahua Zhang, Qingchao Chen, and Yuxin Peng. 2023. Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2828--2838.

[34]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10437--10446.

[35]

Mingcong Lu, Ruifan Li, Fangxiang Feng, Zhanyu Ma, and Xiaojie Wang. 2024. LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension. IEEE Transactions on Circuits and Systems for Video Technology (2024), 1'1. https://doi.org/10.1109/TCSVT.2024.3374786

Digital Library

[36]

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. IEEE, Santiago, Chile, 2641--2649.

Digital Library

[37]

Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and Philip S. Yu. 2024. Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers. arXiv:2404.04925 [cs.CL] https://arxiv.org/abs/2404.04925

[38]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, Vienna, Austria, 8748--8763.

[39]

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34 (2021), 12116--12128.

[40]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/ 14bfa6bb14875e45bba028a21ed38046-Paper.pdf

[41]

Arka Sadhu, Kan Chen, and Ram Nevatia. 2019. Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4694--4703.

[42]

Tal Shaharabany, Yoad Tewel, and Lior Wolf. 2022. What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. Advances in Neural Information Processing Systems 35 (2022), 28222-28237.

[43]

Tal Shaharabany and Lior Wolf. 2023. Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6925--6934.

[44]

Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jianwei Yin. 2024. Ground-VLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4766--4775.

[45]

Yibing Song, Ruifei Zhang, Zhihong Chen, Xiang Wan, and Guanbin Li. 2023. Advancing visual grounding with scene knowledge: Benchmark and method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15039--15049.

[46]

Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. 2022. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 5198--5215.

[47]

Satoshi Suzuki et al. 1985. Topological structural analysis of digitized binary images by border following. Computer vision, graphics, and image processing 30, 1 (1985), 32--46.

[48]

Feng Wang, Jieru Mei, and Alan Yuille. 2024. SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference. arXiv:2312.01597 [cs.CV] https://arxiv. org/abs/2312.01597

[49]

Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia Deng. 2016. Structured matching for phrase localization. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 696--711.

[50]

Qinxin Wang, Hao Tan, Sheng Shen, Michael Mahoney, and Zhewei Yao. 2020. MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2030--2038.

[51]

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2023. Towards explainable in-the-wild video quality assessment: a database and a language-prompted approach. In Proceedings of the 31st ACM International Conference on Multimedia. 1045--1054.

Digital Library

[52]

Siying Wu, Xueyang Fu, Feng Wu, and Zheng-Jun Zha. 2022. Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia. 4233--4241.

Digital Library

[53]

Yuechen Wu, Zhenhuan Rao, Wei Zhang, Shijian Lu, Weizhi Lu, and Zheng-Jun Zha. 2019. Exploring the Task Cooperation in Multi-goal Visual Navigation. In IJCAI. 609--615.

[54]

Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, and Dan Xu. 2022. Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4310--4319.

[55]

Dizhan Xue, Shengsheng Qian, and Changsheng Xu. 2023. Variational Causal Inference Network for Explanatory Visual Question Answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2515--2525.

[56]

Zhihan Yu and Ruifan Li. 2024. Revisiting Counterfactual Problems in Referring Expression Comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13438--13448.

[57]

Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. 2018. Top-down neural attention by excitation backprop. International Journal of Computer Vision 126, 10 (2018), 1084--1102.

Digital Library

[58]

Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haochen Shi, Jun Xiao, Yueting Zhuang, and William Yang Wang. 2020. Relational graph learning for grounded video description generation. In Proceedings of the 28th ACM International Conference on Multimedia. 380--3828.

Digital Library

[59]

Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, and Luoqi Liu. 2023. Towards Consistent Video Editing with Text-to-Image Diffusion Models. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 58508-58519. https://proceedings.neurips.cc/paper_files/paper/2023/file/b6c05f8254a00709e16fb0fdaae56cd8-Paper-Conference.pdf

[60]

Chong Zhou, Chen Change Loy, and Bo Dai. 2022. Extract free dense labels from clip. In European Conference on Computer Vision. Springer, 696--712.

Digital Library

Index Terms

Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak Supervision
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Distributed representation of tags for Active Zero Shot learning
CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)

Extreme multi-labeled classification (XMLC) refers to the problem of tagging items to its most relevant subset of class labels from an extremely large set of labels. Since it is practically difficult to obtain training instances for each of such a ...
Few-shot Node Classification with Extremely Weak Supervision
WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

Few-shot node classification aims at classifying nodes with limited labeled nodes as references. Recent few-shot node classification methods typically learn from classes with abundant labeled nodes (i.e., meta-training classes) and then generalize to ...
Zero-shot stance detection via multi-perspective contrastive learning with unlabeled data
Abstract
Stance detection is to distinguish whether the text’s author supports, opposes, or maintains a neutral stance towards a given target. In most real-world scenarios, stance detection needs to work in a zero-shot manner, i.e., predicting ...
Highlights
- We use unlabeled texts of unseen targets in training for zero-shot stance detection.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Nature Science Foundation of China
Beijing Natural Science Foundation Project

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
78
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)18

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten