research-article

QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding

Authors:

Gen Luo,

Rongrong JiAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 4177 - 4186

https://doi.org/10.1145/3664647.3681058

Published: 28 October 2024 Publication History

Get Access

Abstract

Visual grounding is a task of locating the object referred by a natural language description. To reduce annotation costs, recent researchers are devoted into one-stage weakly supervised methods for visual grounding, which typically adopt the anchor-text matching paradigm. Despite the efficiency, we identify that anchor representations are often noisy and insufficient to describe object information, which inevitably hinders the vision-language alignments. In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch. Different from previous work, QueryMatch represents candidate objects with a set of query features, which inherently establish accurate one-to-one associations with visual objects. In this case, QueryMatch re-formulates weakly supervised visual grounding as a query-text matching problem, which can be optimized via the query-based contrastive learning. Based on QueryMatch, we further propose an innovative strategy for effective weakly supervised learning, namely Active Query Selection (AQS). In particular, AQS aims to enhance the effectiveness of query-based contrastive learning by actively selecting high-quality query features. Through this strategy, AQS can greatly benefit the weakly supervised learning of QueryMatch. To validate our approach, we conduct extensive experiments on three benchmark datasets of two grounding tasks, i.e., referring expression comprehension (REC) and segmentation (RES). Experimental results not only show the state-of-art performance of QueryMatch in two tasks, e.g., over +5% [email protected] on RefCOCO in REC and over +20% mIOU on RefCOCO in RES, but also confirm the effectiveness of AQS in weakly supervised learning. Source codes are available at https://github.com/TensorThinker/QueryMatch.

References

[1]

Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, et al. 2021. Detector-free weakly supervised grounding by separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1801--1812.

Abstract

References

Index Terms

Recommendations

Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning

Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction

Deep semi-supervised learning with contrastive learning and partial label propagation for image data

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations