Abstract
Referring Expression Comprehension (REC) aims to ground the target object based on a given referring expression, which requires expensive instance-level annotations for training. To address this issue, recent advances explore an efficient one-stage weakly supervised REC model called RefCLIP. Particularly, RefCLIP utilizes anchor features of pre-trained one-stage detection networks to represent candidate objects and conducts anchor-text ranking to locate the referent. Despite the effectiveness, we identify that visual semantics of RefCLIP are ambiguous and insufficient for weakly supervised REC modeling. To address this issue, we propose a novel method that enriches visual semantics with various prompt information, called anchor-based prompt learning (APL). Specifically, APL contains an innovative anchor-based prompt encoder (APE) to produce discriminative prompts covering three aspects of REC modeling, e.g., position, color and category. These prompts are dynamically fused into anchor features to improve the visual description power. In addition, we propose two novel auxiliary objectives to achieve accurate vision-language alignment in APL, namely text reconstruction loss and visual alignment loss. To validate APL, we conduct extensive experiments on four REC benchmarks, namely RefCOCO, RefCOCO+, RefCOCOg and ReferIt. Experimental results not only show the state-of-the-art performance of APL against existing methods on four benchmarks, e.g., +6.44% over RefCLIP on RefCOCO, but also confirm its strong generalization ability on weakly supervised referring expression segmentation. Source codes released at: https://github.com/Yaxin9Luo/APL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The pseudo-box is produced via the anchor-text matching.
- 2.
We adopt average pooling and K-means to capture colors of objects accurately.
- 3.
Validation and testing images in REC task are removed.
References
Agarap, A.F.M.: Deep learning using rectified linear units (relu) (2018)
Arbelle, A., et al.: Detector-free weakly supervised grounding by separation (2021)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014)
Brown, T.B., et al.: Language models are few-shot learners, pp. 1877–1901 (2020)
Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding (2018)
Chen, Y., Liu, Y., Dong, L., Wang, S., Zhu, C., Zeng, M., Zhang, Y.: AdaPrompt: adaptive model training for prompt-based NLP. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6057–6068. Association for Computational Linguistics (2022)
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers, pp. 1769–1779 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020)
Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.: Contrastive learning for weakly supervised phrase grounding, pp. 752–768 (2020)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP (2019)
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions, pp. 108–124 (2016)
Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models (2022)
Jia, M., et al.: Visual prompt tuning (2022)
Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? Trans. Assoc. Comput. Linguist. (TACL) 8, 423–438 (2020)
Jin, L., Luo, G., Zhou, Y., Sun, X., Jiang, G., Shu, A., Ji, R.: RefCLIP: a universal teacher for weakly supervised referring expression comprehension, pp. 2681–2690 (2023)
Jocher, G., et al.: Ultralytics/yolov5: V7.0 - yolov5 sota realtime instance segmentation. Zenodo (2022)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision, pp. 5583–5594 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)
Lee, J., Lee, S., Nam, J., Yu, S., Do, J., Taghavi, T.: Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21870–21881 (2023)
Li, H., Sun, M., Xiao, J., Lim, E.G., Zhao, Y.: Fully and weakly supervised referring expression segmentation with end-to-end learning (2022)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation (2022)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 9694–9705. Curran Associates, Inc. (2021)
Li, S.M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Xiao, C.: Test time prompt tuning for zero-shot generalization in vision language models (2022)
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., Li, B.: A real-time cross-modality correlation filtering method for referring expression comprehension, pp. 10880–10889 (2020)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Y., et al.: Clip is also an efficient segmenter: a text-driven approach for weakly supervised semantic segmentation (2022)
Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding, pp. 4670–4679 (2019)
Liu, F., et al.: Referring image segmentation using text supervision (2023)
Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing, pp. 1950–1959 (2019)
Liu, X., Li, L., Wang, S., Zha, Z.J., Li, Z., Tian, Q., Huang, Q.: Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding (2022)
Liu, X., Li, L., Wang, S., Zha, Z.J., Meng, D., Huang, Q.: Adaptive reconstruction network for weakly supervised referring expression grounding, pp. 2611–2620 (2019)
Liu, X., Li, L., Wang, S., Zha, Z.J., Su, L., Huang, Q.: Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding, pp. 539–547 (2019)
Liu, Y., Wan, B., Ma, L., He, X.: Relation-aware instance refinement for weakly supervised visual grounding, pp. 5612–5621 (2021)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, pp. 13–23 (2019)
Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5206–5215 (2022)
Luo, G., Zhou, Y., Ji, R.: Towards language-guided visual recognition via dynamic convolutions. Int. J. Comput. Vision 132(1), 1–19 (2024)
Luo, G., Zhou, Y., Ji, R., Sun, X., Su, J., Lin, C.W., Tian, Q.: Cascade grouped attention network for referring expression segmentation, pp. 1274–1282 (2020)
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., Ji, R.: Cheap and quick: Efficient vision-language instruction tuning for large language models. In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (2023)
Luo, G., et al.: What goes beyond multi-modal fusion in one-stage referring expression comprehension: an empirical study (2022)
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation, pp. 10034–10043 (2020)
Luo, G., et al.: Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans. Image Process. 31, 3386–3398 (2022)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Niu, Y., Zhang, H., Lu, Z., Chang, S.F.: Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 347–359 (2021)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Petroni, F., et al.: Language models as knowledge bases? (2019)
Qin, J., Wu, J., Xiao, X., Li, L., Wang, X.: Activation modulation and recalibration scheme for weakly supervised semantic segmentation (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (JMLR) 21(140), 1–67 (2019)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction, pp. 817–834 (2016)
Shaharabany, T., Tewel, Y., Wolf, L.: What is where by looking: weakly-supervised open-world phrase-grounding without text inputs (2022)
Shin, T., Razeghi, Y., IV, R.L.L., Wallace, E., Singh, S.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts, pp. 4222–4235 (2020)
Strudel, R., Laptev, I., Schmid, C.: Weakly-supervised segmentation of referring expressions (2022)
Sun, M., Xiao, J., Lim, E.G., Liu, S., Goulermas, J.Y.: Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 4189–4195 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation, pp. 14090–14100 (2021)
Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: GroupViT: semantic segmentation emerges from text supervision (2022)
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: MAttNet: modular attention network for referring expression comprehension, pp. 1307–1315 (2018)
Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)
Zhao, F., Li, J., Zhao, J., Feng, J.: Weakly supervised phrase localization with multi-scale anchored transformer network, pp. 5696–5705 (2018)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models (2021)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. IEEE Trans. Neural Netw. Learn. Syst. 34(1), 134–143 (2021)
Zhu, C., et al.: SeqTR: a simple yet universal network for visual grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13695, pp. 598–615. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_35
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 623B2088).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Luo, Y., Ji, J., Chen, X., Zhang, Y., Ren, T., Luo, G. (2025). APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-72624-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)