Abstract
Given an image and text description, visual grounding will find target region in the image explained by the text. It has two task settings: referring expression comprehension (REC) to estimate bounding-box and referring expression segmentation (RES) to predict segmentation mask. Currently the most promising visual grounding approaches are to learn REC and RES jointly by giving rich ground truth of both bounding-box and segmentation mask of the target object. However, we argue that a very simple but strong constraint has been overlooked by the existing approaches: given an image and a text description, REC and RES refer to the same object. We propose Location Aware Transformer (LoA-Trans) making this constraint explicit by a center prompt, where the system first predicts the center of the target object by Location-Aware Network, and feeds it as a common prompt to both REC and RES. In this way, the system constrains that REC and RES refer to the same object. To mitigate possible inaccuracies in center estimation, we introduce a query selection mechanism. Instead of random initialization queries for bounding-box and segmentation mask decoding, the query selection mechanism generates possible object locations other than the estimated center and use them as location-aware queries as a remedy for possible inaccurate center estimation. We also introduce a TaskSyn Network in the decoder to better coordination between REC and RES. Our method achieved state-of-the-art performance on three commonly used datasets: Refcoco, Refcoco+, and Refcocog. Extensive ablation studies demonstrated the validity of each of the proposed components.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: Transvg: end-to-end visual grounding with transformers. In: ICCV (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)
Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR (2021)
Girshick, R.: Fast r-cnn. In: ICCV (2015)
Hong, R., Liu, D., Mo, X., He, X., Zhang, H.: Learning to compose and reason with language tree structures for visual grounding. In: IEEE TPAMI (2019)
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)
Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: CVPR (2020)
Huang, S., et al.: Referring image segmentation via cross-modal progressive comprehension. In: CVPR (2020)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. In: NeurIPS (2021)
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., Li, B.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR (2020)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Liu, C., Ding, H., Jiang, X.: Gres: generalized referring expression segmentation. In: CVPR (2023)
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)
Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding. In: ICCV (2019)
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: CVPR (2023)
Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing. In: CVPR (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)
Luo, G., et al.: Cascade grouped attention network for referring expression segmentation. In: ACM MM (2020)
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: CVPR (2020)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Margffoy-Tuay, E., Pérez, J.C., Botero, E., Arbeláez, P.: Dynamic multimodal instance segmentation guided by natural language queries. In: ECCV (2018)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR (2019)
Su, W., et al.: Language adaptive weight generation for multi-task visual grounding. In: CVPR (2023)
Sun, M., Xiao, J., Lim, E.G.: Iterative shrinking for referring expression grounding using deep reinforcement learning. In: CVPR (2021)
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)
Vaswani, A.,et al.: Attention is all you need. In: NeurIPS (2017)
Wang, Z., et al.: Cris: clip-driven referring image segmentation. In: CVPR (2022)
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: ICCV (2019)
Yang, S., Xia, M., Li, G., Zhou, H.Y., Yu, Y.: Bottom-up shift and reasoning for referring image segmentation. In: CVPR (2021)
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR (2022)
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: ECCV (2020)
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: ICCV (2019)
Ye, J., et al.: Shifting more attention to visual backbone: query-modulated refinement networks for end-to-end visual grounding. In: CVPR (2022)
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR (2018)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016)
Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: CVPR (2018)
Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. In: IEEE TNNLS (2021)
Zhu, C., et al.: Seqtr: a simple yet universal network for visual grounding. In: ECCV (2022)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Acknowledgement
This work is partly supported by JSPS KAKENHI Grant Number JP23K24876 and JST ASPIRE Program Grant Number JPMJAP2303.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, Z., Satoh, S. (2025). LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-72667-5_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)