Skip to main content

APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

  • 357 Accesses

Abstract

Referring Expression Comprehension (REC) aims to ground the target object based on a given referring expression, which requires expensive instance-level annotations for training. To address this issue, recent advances explore an efficient one-stage weakly supervised REC model called RefCLIP. Particularly, RefCLIP utilizes anchor features of pre-trained one-stage detection networks to represent candidate objects and conducts anchor-text ranking to locate the referent. Despite the effectiveness, we identify that visual semantics of RefCLIP are ambiguous and insufficient for weakly supervised REC modeling. To address this issue, we propose a novel method that enriches visual semantics with various prompt information, called anchor-based prompt learning (APL). Specifically, APL contains an innovative anchor-based prompt encoder (APE) to produce discriminative prompts covering three aspects of REC modeling, e.g., position, color and category. These prompts are dynamically fused into anchor features to improve the visual description power. In addition, we propose two novel auxiliary objectives to achieve accurate vision-language alignment in APL, namely text reconstruction loss and visual alignment loss. To validate APL, we conduct extensive experiments on four REC benchmarks, namely RefCOCO, RefCOCO+, RefCOCOg and ReferIt. Experimental results not only show the state-of-the-art performance of APL against existing methods on four benchmarks, e.g., +6.44% over RefCLIP on RefCOCO, but also confirm its strong generalization ability on weakly supervised referring expression segmentation. Source codes released at: https://github.com/Yaxin9Luo/APL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The pseudo-box is produced via the anchor-text matching.

  2. 2.

    We adopt average pooling and K-means to capture colors of objects accurately.

  3. 3.

    Validation and testing images in REC task are removed.

References

  1. Agarap, A.F.M.: Deep learning using rectified linear units (relu) (2018)

    Google Scholar 

  2. Arbelle, A., et al.: Detector-free weakly supervised grounding by separation (2021)

    Google Scholar 

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014)

    Google Scholar 

  4. Brown, T.B., et al.: Language models are few-shot learners, pp. 1877–1901 (2020)

    Google Scholar 

  5. Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding (2018)

    Google Scholar 

  6. Chen, Y., Liu, Y., Dong, L., Wang, S., Zhu, C., Zeng, M., Zhang, Y.: AdaPrompt: adaptive model training for prompt-based NLP. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6057–6068. Association for Computational Linguistics (2022)

    Google Scholar 

  7. Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers, pp. 1769–1779 (2021)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020)

    Google Scholar 

  9. Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.: Contrastive learning for weakly supervised phrase grounding, pp. 752–768 (2020)

    Google Scholar 

  10. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP (2019)

    Google Scholar 

  11. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions, pp. 108–124 (2016)

    Google Scholar 

  12. Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models (2022)

    Google Scholar 

  13. Jia, M., et al.: Visual prompt tuning (2022)

    Google Scholar 

  14. Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? Trans. Assoc. Comput. Linguist. (TACL) 8, 423–438 (2020)

    Article  Google Scholar 

  15. Jin, L., Luo, G., Zhou, Y., Sun, X., Jiang, G., Shu, A., Ji, R.: RefCLIP: a universal teacher for weakly supervised referring expression comprehension, pp. 2681–2690 (2023)

    Google Scholar 

  16. Jocher, G., et al.: Ultralytics/yolov5: V7.0 - yolov5 sota realtime instance segmentation. Zenodo (2022)

    Google Scholar 

  17. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)

    Google Scholar 

  18. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision, pp. 5583–5594 (2021)

    Google Scholar 

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)

    Google Scholar 

  20. Lee, J., Lee, S., Nam, J., Yu, S., Do, J., Taghavi, T.: Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21870–21881 (2023)

    Google Scholar 

  21. Li, H., Sun, M., Xiao, J., Lim, E.G., Zhao, Y.: Fully and weakly supervised referring expression segmentation with end-to-end learning (2022)

    Google Scholar 

  22. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation (2022)

    Google Scholar 

  23. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 9694–9705. Curran Associates, Inc. (2021)

    Google Scholar 

  24. Li, S.M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Xiao, C.: Test time prompt tuning for zero-shot generalization in vision language models (2022)

    Google Scholar 

  25. Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., Li, B.: A real-time cross-modality correlation filtering method for referring expression comprehension, pp. 10880–10889 (2020)

    Google Scholar 

  26. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  27. Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  28. Lin, Y., et al.: Clip is also an efficient segmenter: a text-driven approach for weakly supervised semantic segmentation (2022)

    Google Scholar 

  29. Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding, pp. 4670–4679 (2019)

    Google Scholar 

  30. Liu, F., et al.: Referring image segmentation using text supervision (2023)

    Google Scholar 

  31. Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing, pp. 1950–1959 (2019)

    Google Scholar 

  32. Liu, X., Li, L., Wang, S., Zha, Z.J., Li, Z., Tian, Q., Huang, Q.: Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding (2022)

    Google Scholar 

  33. Liu, X., Li, L., Wang, S., Zha, Z.J., Meng, D., Huang, Q.: Adaptive reconstruction network for weakly supervised referring expression grounding, pp. 2611–2620 (2019)

    Google Scholar 

  34. Liu, X., Li, L., Wang, S., Zha, Z.J., Su, L., Huang, Q.: Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding, pp. 539–547 (2019)

    Google Scholar 

  35. Liu, Y., Wan, B., Ma, L., He, X.: Relation-aware instance refinement for weakly supervised visual grounding, pp. 5612–5621 (2021)

    Google Scholar 

  36. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, pp. 13–23 (2019)

    Google Scholar 

  37. Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5206–5215 (2022)

    Google Scholar 

  38. Luo, G., Zhou, Y., Ji, R.: Towards language-guided visual recognition via dynamic convolutions. Int. J. Comput. Vision 132(1), 1–19 (2024)

    Article  Google Scholar 

  39. Luo, G., Zhou, Y., Ji, R., Sun, X., Su, J., Lin, C.W., Tian, Q.: Cascade grouped attention network for referring expression segmentation, pp. 1274–1282 (2020)

    Google Scholar 

  40. Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., Ji, R.: Cheap and quick: Efficient vision-language instruction tuning for large language models. In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (2023)

    Google Scholar 

  41. Luo, G., et al.: What goes beyond multi-modal fusion in one-stage referring expression comprehension: an empirical study (2022)

    Google Scholar 

  42. Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation, pp. 10034–10043 (2020)

    Google Scholar 

  43. Luo, G., et al.: Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans. Image Process. 31, 3386–3398 (2022)

    Article  Google Scholar 

  44. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)

    Google Scholar 

  45. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48

    Chapter  Google Scholar 

  46. Niu, Y., Zhang, H., Lu, Z., Chang, S.F.: Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 347–359 (2021)

    Google Scholar 

  47. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  48. Petroni, F., et al.: Language models as knowledge bases? (2019)

    Google Scholar 

  49. Qin, J., Wu, J., Xiao, X., Li, L., Wang, X.: Activation modulation and recalibration scheme for weakly supervised semantic segmentation (2021)

    Google Scholar 

  50. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  51. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (JMLR) 21(140), 1–67 (2019)

    MathSciNet  Google Scholar 

  52. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  53. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  54. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  55. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  56. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction, pp. 817–834 (2016)

    Google Scholar 

  57. Shaharabany, T., Tewel, Y., Wolf, L.: What is where by looking: weakly-supervised open-world phrase-grounding without text inputs (2022)

    Google Scholar 

  58. Shin, T., Razeghi, Y., IV, R.L.L., Wallace, E., Singh, S.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts, pp. 4222–4235 (2020)

    Google Scholar 

  59. Strudel, R., Laptev, I., Schmid, C.: Weakly-supervised segmentation of referring expressions (2022)

    Google Scholar 

  60. Sun, M., Xiao, J., Lim, E.G., Liu, S., Goulermas, J.Y.: Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 4189–4195 (2021)

    Article  Google Scholar 

  61. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  62. Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation, pp. 14090–14100 (2021)

    Google Scholar 

  63. Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: GroupViT: semantic segmentation emerges from text supervision (2022)

    Google Scholar 

  64. Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: MAttNet: modular attention network for referring expression comprehension, pp. 1307–1315 (2018)

    Google Scholar 

  65. Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  66. Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)

    Google Scholar 

  67. Zhao, F., Li, J., Zhao, J., Feng, J.: Weakly supervised phrase localization with multi-scale anchored transformer network, pp. 5696–5705 (2018)

    Google Scholar 

  68. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models (2021)

    Google Scholar 

  69. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)

    Google Scholar 

  70. Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. IEEE Trans. Neural Netw. Learn. Syst. 34(1), 134–143 (2021)

    Article  Google Scholar 

  71. Zhu, C., et al.: SeqTR: a simple yet universal network for visual grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13695, pp. 598–615. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_35

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 623B2088).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gen Luo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, Y., Ji, J., Chen, X., Zhang, Y., Ren, T., Luo, G. (2025). APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72624-8_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72623-1

  • Online ISBN: 978-3-031-72624-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics