APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension

Luo, Yaxin; Ji, Jiayi; Chen, Xiaofu; Zhang, Yuxin; Ren, Tianhe; Luo, Gen

doi:10.1007/978-3-031-72624-8_12

Yaxin Luo¹³,
Jiayi Ji¹⁴,
Xiaofu Chen¹³,
Yuxin Zhang¹⁴,
Tianhe Ren¹⁵ &
…
Gen Luo¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

European Conference on Computer Vision

357 Accesses

Abstract

Referring Expression Comprehension (REC) aims to ground the target object based on a given referring expression, which requires expensive instance-level annotations for training. To address this issue, recent advances explore an efficient one-stage weakly supervised REC model called RefCLIP. Particularly, RefCLIP utilizes anchor features of pre-trained one-stage detection networks to represent candidate objects and conducts anchor-text ranking to locate the referent. Despite the effectiveness, we identify that visual semantics of RefCLIP are ambiguous and insufficient for weakly supervised REC modeling. To address this issue, we propose a novel method that enriches visual semantics with various prompt information, called anchor-based prompt learning (APL). Specifically, APL contains an innovative anchor-based prompt encoder (APE) to produce discriminative prompts covering three aspects of REC modeling, e.g., position, color and category. These prompts are dynamically fused into anchor features to improve the visual description power. In addition, we propose two novel auxiliary objectives to achieve accurate vision-language alignment in APL, namely text reconstruction loss and visual alignment loss. To validate APL, we conduct extensive experiments on four REC benchmarks, namely RefCOCO, RefCOCO+, RefCOCOg and ReferIt. Experimental results not only show the state-of-the-art performance of APL against existing methods on four benchmarks, e.g., +6.44% over RefCLIP on RefCOCO, but also confirm its strong generalization ability on weakly supervised referring expression segmentation. Source codes released at: https://github.com/Yaxin9Luo/APL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing Anchor-Based Weakly Supervised Referring Expression Comprehension with Cross-Modality Attention

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers

SPK: Semantic and Positional Knowledge for Zero-Shot Referring Expression Comprehension

Notes

1.
The pseudo-box is produced via the anchor-text matching.
2.
We adopt average pooling and K-means to capture colors of objects accurately.
3.
Validation and testing images in REC task are removed.

References

Agarap, A.F.M.: Deep learning using rectified linear units (relu) (2018)
Google Scholar
Arbelle, A., et al.: Detector-free weakly supervised grounding by separation (2021)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners, pp. 1877–1901 (2020)
Google Scholar
Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding (2018)
Google Scholar
Chen, Y., Liu, Y., Dong, L., Wang, S., Zhu, C., Zeng, M., Zhang, Y.: AdaPrompt: adaptive model training for prompt-based NLP. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6057–6068. Association for Computational Linguistics (2022)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers, pp. 1769–1779 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020)
Google Scholar
Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.: Contrastive learning for weakly supervised phrase grounding, pp. 752–768 (2020)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP (2019)
Google Scholar
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions, pp. 108–124 (2016)
Google Scholar
Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models (2022)
Google Scholar
Jia, M., et al.: Visual prompt tuning (2022)
Google Scholar
Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? Trans. Assoc. Comput. Linguist. (TACL) 8, 423–438 (2020)
Article Google Scholar
Jin, L., Luo, G., Zhou, Y., Sun, X., Jiang, G., Shu, A., Ji, R.: RefCLIP: a universal teacher for weakly supervised referring expression comprehension, pp. 2681–2690 (2023)
Google Scholar
Jocher, G., et al.: Ultralytics/yolov5: V7.0 - yolov5 sota realtime instance segmentation. Zenodo (2022)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision, pp. 5583–5594 (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)
Google Scholar
Lee, J., Lee, S., Nam, J., Yu, S., Do, J., Taghavi, T.: Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21870–21881 (2023)
Google Scholar
Li, H., Sun, M., Xiao, J., Lim, E.G., Zhao, Y.: Fully and weakly supervised referring expression segmentation with end-to-end learning (2022)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation (2022)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 9694–9705. Curran Associates, Inc. (2021)
Google Scholar
Li, S.M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Xiao, C.: Test time prompt tuning for zero-shot generalization in vision language models (2022)
Google Scholar
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., Li, B.: A real-time cross-modality correlation filtering method for referring expression comprehension, pp. 10880–10889 (2020)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Y., et al.: Clip is also an efficient segmenter: a text-driven approach for weakly supervised semantic segmentation (2022)
Google Scholar
Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding, pp. 4670–4679 (2019)
Google Scholar
Liu, F., et al.: Referring image segmentation using text supervision (2023)
Google Scholar
Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing, pp. 1950–1959 (2019)
Google Scholar
Liu, X., Li, L., Wang, S., Zha, Z.J., Li, Z., Tian, Q., Huang, Q.: Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding (2022)
Google Scholar
Liu, X., Li, L., Wang, S., Zha, Z.J., Meng, D., Huang, Q.: Adaptive reconstruction network for weakly supervised referring expression grounding, pp. 2611–2620 (2019)
Google Scholar
Liu, X., Li, L., Wang, S., Zha, Z.J., Su, L., Huang, Q.: Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding, pp. 539–547 (2019)
Google Scholar
Liu, Y., Wan, B., Ma, L., He, X.: Relation-aware instance refinement for weakly supervised visual grounding, pp. 5612–5621 (2021)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, pp. 13–23 (2019)
Google Scholar
Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5206–5215 (2022)
Google Scholar
Luo, G., Zhou, Y., Ji, R.: Towards language-guided visual recognition via dynamic convolutions. Int. J. Comput. Vision 132(1), 1–19 (2024)
Article Google Scholar
Luo, G., Zhou, Y., Ji, R., Sun, X., Su, J., Lin, C.W., Tian, Q.: Cascade grouped attention network for referring expression segmentation, pp. 1274–1282 (2020)
Google Scholar
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., Ji, R.: Cheap and quick: Efficient vision-language instruction tuning for large language models. In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (2023)
Google Scholar
Luo, G., et al.: What goes beyond multi-modal fusion in one-stage referring expression comprehension: an empirical study (2022)
Google Scholar
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation, pp. 10034–10043 (2020)
Google Scholar
Luo, G., et al.: Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans. Image Process. 31, 3386–3398 (2022)
Article Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Google Scholar
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Chapter Google Scholar
Niu, Y., Zhang, H., Lu, Z., Chang, S.F.: Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 347–359 (2021)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Petroni, F., et al.: Language models as knowledge bases? (2019)
Google Scholar
Qin, J., Wu, J., Xiao, X., Li, L., Wang, X.: Activation modulation and recalibration scheme for weakly supervised semantic segmentation (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (JMLR) 21(140), 1–67 (2019)
MathSciNet Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction, pp. 817–834 (2016)
Google Scholar
Shaharabany, T., Tewel, Y., Wolf, L.: What is where by looking: weakly-supervised open-world phrase-grounding without text inputs (2022)
Google Scholar
Shin, T., Razeghi, Y., IV, R.L.L., Wallace, E., Singh, S.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts, pp. 4222–4235 (2020)
Google Scholar
Strudel, R., Laptev, I., Schmid, C.: Weakly-supervised segmentation of referring expressions (2022)
Google Scholar
Sun, M., Xiao, J., Lim, E.G., Liu, S., Goulermas, J.Y.: Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 4189–4195 (2021)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation, pp. 14090–14100 (2021)
Google Scholar
Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: GroupViT: semantic segmentation emerges from text supervision (2022)
Google Scholar
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: MAttNet: modular attention network for referring expression comprehension, pp. 1307–1315 (2018)
Google Scholar
Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)
Google Scholar
Zhao, F., Li, J., Zhao, J., Feng, J.: Weakly supervised phrase localization with multi-scale anchored transformer network, pp. 5696–5705 (2018)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models (2021)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Google Scholar
Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. IEEE Trans. Neural Netw. Learn. Syst. 34(1), 134–143 (2021)
Article Google Scholar
Zhu, C., et al.: SeqTR: a simple yet universal network for visual grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13695, pp. 598–615. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_35

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 623B2088).

Author information

Authors and Affiliations

Technical University of Denmark, Lyngby, Denmark
Yaxin Luo & Xiaofu Chen
Xiamen University, Xiamen, China
Jiayi Ji & Yuxin Zhang
International Digital Economy Academy, Shanghai, China
Tianhe Ren
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Gen Luo

Authors

Yaxin Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jiayi Ji
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tianhe Ren
View author publications
You can also search for this author in PubMed Google Scholar
Gen Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gen Luo .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, Y., Ji, J., Chen, X., Zhang, Y., Ren, T., Luo, G. (2025). APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-72624-8_12
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension