LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers

Huang, Ziling; Satoh, Shin’ichi

doi:10.1007/978-3-031-72667-5_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15065))

Included in the following conference series:

European Conference on Computer Vision

458 Accesses

Abstract

Given an image and text description, visual grounding will find target region in the image explained by the text. It has two task settings: referring expression comprehension (REC) to estimate bounding-box and referring expression segmentation (RES) to predict segmentation mask. Currently the most promising visual grounding approaches are to learn REC and RES jointly by giving rich ground truth of both bounding-box and segmentation mask of the target object. However, we argue that a very simple but strong constraint has been overlooked by the existing approaches: given an image and a text description, REC and RES refer to the same object. We propose Location Aware Transformer (LoA-Trans) making this constraint explicit by a center prompt, where the system first predicts the center of the target object by Location-Aware Network, and feeds it as a common prompt to both REC and RES. In this way, the system constrains that REC and RES refer to the same object. To mitigate possible inaccuracies in center estimation, we introduce a query selection mechanism. Instead of random initialization queries for bounding-box and segmentation mask decoding, the query selection mechanism generates possible object locations other than the estimated center and use them as location-aware queries as a remedy for possible inaccurate center estimation. We also introduce a TaskSyn Network in the decoder to better coordination between REC and RES. Our method achieved state-of-the-art performance on three commonly used datasets: Refcoco, Refcoco+, and Refcocog. Extensive ablation studies demonstrated the validity of each of the proposed components.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SeqTR: A Simple Yet Universal Network for Visual Grounding

APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension

Text-Vision Relationship Alignment for Referring Image Segmentation

Article Open access 22 February 2024

References

Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: Transvg: end-to-end visual grounding with transformers. In: ICCV (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)
Google Scholar
Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR (2021)
Google Scholar
Girshick, R.: Fast r-cnn. In: ICCV (2015)
Google Scholar
Hong, R., Liu, D., Mo, X., He, X., Zhang, H.: Learning to compose and reason with language tree structures for visual grounding. In: IEEE TPAMI (2019)
Google Scholar
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)
Google Scholar
Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: CVPR (2020)
Google Scholar
Huang, S., et al.: Referring image segmentation via cross-modal progressive comprehension. In: CVPR (2020)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. In: NeurIPS (2021)
Google Scholar
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., Li, B.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR (2020)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Google Scholar
Liu, C., Ding, H., Jiang, X.: Gres: generalized referring expression segmentation. In: CVPR (2023)
Google Scholar
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)
Google Scholar
Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding. In: ICCV (2019)
Google Scholar
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: CVPR (2023)
Google Scholar
Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing. In: CVPR (2019)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)
Google Scholar
Luo, G., et al.: Cascade grouped attention network for referring expression segmentation. In: ACM MM (2020)
Google Scholar
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: CVPR (2020)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Google Scholar
Margffoy-Tuay, E., Pérez, J.C., Botero, E., Arbeláez, P.: Dynamic multimodal instance segmentation guided by natural language queries. In: ECCV (2018)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR (2019)
Google Scholar
Su, W., et al.: Language adaptive weight generation for multi-task visual grounding. In: CVPR (2023)
Google Scholar
Sun, M., Xiao, J., Lim, E.G.: Iterative shrinking for referring expression grounding using deep reinforcement learning. In: CVPR (2021)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)
Google Scholar
Vaswani, A.,et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, Z., et al.: Cris: clip-driven referring image segmentation. In: CVPR (2022)
Google Scholar
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: ICCV (2019)
Google Scholar
Yang, S., Xia, M., Li, G., Zhou, H.Y., Yu, Y.: Bottom-up shift and reasoning for referring image segmentation. In: CVPR (2021)
Google Scholar
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR (2022)
Google Scholar
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: ECCV (2020)
Google Scholar
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: ICCV (2019)
Google Scholar
Ye, J., et al.: Shifting more attention to visual backbone: query-modulated refinement networks for end-to-end visual grounding. In: CVPR (2022)
Google Scholar
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)
Google Scholar
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR (2018)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016)
Google Scholar
Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: CVPR (2018)
Google Scholar
Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. In: IEEE TNNLS (2021)
Google Scholar
Zhu, C., et al.: Seqtr: a simple yet universal network for visual grounding. In: ECCV (2022)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgement

This work is partly supported by JSPS KAKENHI Grant Number JP23K24876 and JST ASPIRE Program Grant Number JPMJAP2303.

Author information

Authors and Affiliations

The University of Tokyo, Tokyo, Japan
Ziling Huang & Shin’ichi Satoh
National Institute of Informatics, Tokyo, Japan
Ziling Huang & Shin’ichi Satoh

Authors

Ziling Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shin’ichi Satoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziling Huang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1333 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z., Satoh, S. (2025). LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72667-5_23
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers