Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training

Wan, David; Cho, Jaemin; Stengel-Eskin, Elias; Bansal, Mohit

doi:10.1007/978-3-031-72986-7_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15137))

Included in the following conference series:

European Conference on Computer Vision

391 Accesses

Abstract

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a “visual prompt”, where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data with visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer. CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to $11.1\%$ on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG’s applicability to spatial reasoning, with $10\%$ improvement on What’sUp, as well as to compositional generalization – improving accuracy by $11.5\%$ and $7.5\%$ on two challenging splits from SugarCrepe – and to image-text alignment for generated images, where we improve by 8.4 AUROC and 6.8 F1 points on SeeTRUE. CRG also allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of $3.2\%$ in accuracy. Our analysis explores alternative masking strategies for CRG, empirically validating CRG’s design choices (Project page: https://contrastive-region-guidance.github.io/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

References

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models (2022)
Google Scholar
Bai, J., et al.: Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
Bai, Y., et al.: Sequential modeling enables scalable learning for large vision models (2023)
Google Scholar
Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.A.: Visual Prompting via Image Inpainting. In: NeurIPS (2022)
Google Scholar
Cai, M., et al.: Making large multimodal models understand arbitrary visual prompts (2023). http://arxiv.org/abs/2312.00784
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, X., et al.: PaLI-X: on scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023)
Chen, X., Zhao, Z., Zhang, Y., Duan, M., Qi, D., Zhao, H.: FocalClick: towards practical interactive image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1300–1309 (2022)
Google Scholar
Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., Zhou, J.: HALC: object hallucination reduction via adaptive focal-contrast decoding. arXiv preprint arXiv:2403.00425 (2024)
Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: YOLO-world: real-time open-vocabulary object detection. In: CVPR (2024)
Google Scholar
Cho, J.W., Kim, D.J., Ryu, H., Kweon, I.S.: Generative bias for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11681–11690 (2023)
Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=vvoWPYqZJA
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
He, M., et al.: Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021). https://openreview.net/forum?id=qw8AKxfYbI
Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: industrial-strength natural language processing in python (2020). https://doi.org/10.5281/zenodo.1212303, https://spacy.io
Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: SugarCrepe: fixing hackable benchmarks for vision-language compositionality. In: Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Misra, I., Synnaeve, G., Carion, N.: MDETR–modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763 (2021)
Kamath, A., Hessel, J., Chang, K.W.: What’s “up” with vision-language models? Investigating their struggle with spatial reasoning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9161–9175 (2023)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Google Scholar
Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: MaPLe: multi-modal prompt learning. In: CVPR (2023). https://doi.org/10.1109/CVPR52729.2023.01832
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Kornblith, S., Li, L., Wang, Z., Nguyen, T.: Guiding image captioning models toward more specific captions. In: ICCV (2023). http://arxiv.org/abs/2307.16686
Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding (2023). http://arxiv.org/abs/2311.16922
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR (2022)
Google Scholar
Li, X.L., et al.: Contrastive decoding: open-ended text generation as optimization. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., et al.: LLaVA-next: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=w0H2xGHlkw
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Y., Guo, Y., Yin, J., Song, X., Liu, W., Nie, L., Zhang, M.: Answer questions with right image regions: a visual attention regularization approach. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 18(4), 1–18 (2022)
Google Scholar
Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: CREPE: can vision-language foundation models reason compositionally? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10910–10921 (2023)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O.M., Yuille, A.L., Murphy, K.P.: Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20 (2015). https://api.semanticscholar.org/CorpusID:8745888
O’Brien, S., Lewis, M.: Contrastive decoding improves reasoning in large language models (2023)
Google Scholar
Achiam, J., et al.: GPT-4 technical report (2023)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123(1), 74–93 (2017)
Article MathSciNet Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021). http://proceedings.mlr.press/v139/radford21a.html
Ray, A., Radenovic, F., Dubey, A., Plummer, B., Krishna, R., Saenko, K.: Cola: a benchmark for compositional text-to-image retrieval. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Google Scholar
Ribeiro, M., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. In: DeNero, J., Finlayson, M., Reddy, S. (eds.) Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 97–101. Association for Computational Linguistics, San Diego (2016). https://doi.org/10.18653/v1/N16-3020, https://aclanthology.org/N16-3020
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=08Yk-n5l2Al
Salesforce AI Research: Xgen-mm-phi3-mini-instruct model card (2024). https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1
Sanchez, G., Fan, H., Spangher, A., Levi, E., Ammanamanchi, P.S., Biderman, S.: Stay on topic with classifier-free guidance. arXiv preprint arXiv:2306.17806 (2023)
Selvaraju, R.R., et al.: Taking a hint: leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600 (2019)
Google Scholar
Shi, W., Han, X., Lewis, M., Tsvetkov, Y., Zettlemoyer, L., tau Yih, S.W.: Trusting your evidence: hallucinate less with context-aware decoding (2023)
Google Scholar
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? Visual prompt engineering for vlms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11987–11997 (2023)
Google Scholar
Singh, A., et al.: FLAVA: a foundational language and vision alignment model. In: CVPR (2022)
Google Scholar
Sun, Z., et al.: Alpha-CLIP: a CLIP model focusing on wherever you want (2023)
Google Scholar
Thrush, T., et al.: Winoground: probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248 (2022)
Google Scholar
Wang, S., et al.: Imagen editor and editbench: advancing and evaluating text-guided image inpainting. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18359–18369. IEEE Computer Society, Los Alamitos (2023). https://doi.org/10.1109/CVPR52729.2023.01761, https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01761
Wu, J., Mooney, R.: Self-critical reasoning for robust visual question answering. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V (2023). http://arxiv.org/abs/2310.11441
Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: colorful prompt tuning for pre-trained vision-language models (2021). http://arxiv.org/abs/2109.11797
Yarom, M., et al.: What you see is what you read? Improving text-image alignment evaluation. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=j5AoleAIru
Ying, Z., Hase, P., Bansal, M.: VisFIS: visual feature importance supervision with right-for-the-right-reason objectives. Adv. Neural. Inf. Process. Syst. 35, 17057–17072 (2022)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. In: NeurIPS (2021). http://arxiv.org/abs/2106.02636
Zhang, H., et al.: GLIPv2: unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836 (2022)
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016)
Google Scholar
Zhang, S., et al.: GPT4RoI: instruction tuning large language model on region-of-interest (2023)
Google Scholar
Zhang, Y., Qian, S., Peng, B., Liu, S., Jia, J.: Prompt highlighter: interactive control for multi-modal LLMs. arXiv preprint arXiv:2312.04302 (2023)
Zhao, L., Deng, Y., Zhang, W., Gu, Q.: Mitigating object hallucination in large vision-language models via classifier-free guidance (2024)
Google Scholar
Zou, X., et al.: Segment everything everywhere all at once. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=UHBrWeFWlL

Download references

Acknowledgements

We thank Peter Hase for the thoughtful discussion, and the anonymous reviewers for their feedback. This work was supported by DARPA ECOLE Program No. HR00112390060, NSF-AI Engage Institute DRL-2112635, DARPA Machine Commonsense (MCS) Grant N66001-19-2-4031, ARO Award W911NF2110220, ONR Grant N00014-23-1-2356, and a Bloomberg Data Science Ph.D. Fellowship. The views contained in this article are those of the authors and not of the funding agency.

Author information

Authors and Affiliations

UNC Chapel Hill, Chapel Hill, USA
David Wan, Jaemin Cho, Elias Stengel-Eskin & Mohit Bansal

Authors

David Wan
View author publications
You can also search for this author in PubMed Google Scholar
Jaemin Cho
View author publications
You can also search for this author in PubMed Google Scholar
Elias Stengel-Eskin
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Bansal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Wan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10541 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wan, D., Cho, J., Stengel-Eskin, E., Bansal, M. (2025). Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15137. Springer, Cham. https://doi.org/10.1007/978-3-031-72986-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-72986-7_12
Published: 02 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72985-0
Online ISBN: 978-3-031-72986-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training