research-article

Suspected Objects Matter: Rethinking Model's Prediction for One-stage Visual Grounding

Authors:

Yang Jiao,

Zequn Jie,

Jingjing Chen,

Lin Ma,

Yu-Gang JiangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 17 - 26

https://doi.org/10.1145/3581783.3611721

Published: 27 October 2023 Publication History

Get Access

Abstract

Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring their relationships in the one-stage paradigm is non-trivial because: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. Toward this end, we propose a Suspected Object Transformation mechanism (SOT), which can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders to encourage the target object selection among the suspected ones. Suspected objects are dynamically discovered from a learned activation map adapted to the model's current discrimination ability during training. Afterward, on top of suspected objects, a Keyword-Aware Discrimination module (KAD) and an Exploration by Random Connection strategy (ERC) are concurrently proposed to help the model rethink its initial prediction. On the one hand, KAD leverages keywords contributing high to suspected object discrimination. On the other hand, ERC allows the model to seek the correct object instead of being trapped in a situation that always exploits the current false prediction. Extensive experiments demonstrate the effectiveness of our proposed method.

References

[1]

Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1036--1044.

Abstract

References

Cited By

Index Terms

Recommendations

Improving One-Stage Visual Grounding by Recursive Sub-query Construction

Task-context-dependent linear representation of multiple visual objects in human parietal cortex

Visual Grounding in Remote Sensing Images

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations