How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs

Tu, Haoqin; Cui, Chenhang; Wang, Zijun; Zhou, Yiyang; Zhao, Bingchen; Han, Junlin; Zhou, Wangchunshu; Yao, Huaxiu; Xie, Cihang

doi:10.1007/978-3-031-72983-6_3

Haoqin Tu¹³,
Chenhang Cui¹⁵,
Zijun Wang¹³,
Yiyang Zhou¹⁴,
Bingchen Zhao¹⁶,
Junlin Han¹⁷,
Wangchunshu Zhou¹⁸,
Huaxiu Yao¹⁴ &
…
Cihang Xie¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15109))

Included in the following conference series:

European Conference on Computer Vision

364 Accesses

Abstract

This work focuses on benchmarking the capabilities of vision large language models (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite Unicorn, covering out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel visual question-answering (VQA) datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language input of VLLMs. Our evaluation of 22 diverse models, ranging from open-source VLLMs to GPT-4V and Gemini Pro, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.

H. Tu, C. Cui and Z. Wang—Equal Technical Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

TrojVLM: Backdoor Attack Against Vision Language Models

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

References

Aldahdooh, A., Hamidouche, W., Deforges, O.: Reveal of vision transformers robustness against adversarial attacks. arXiv preprint arXiv:2106.03734 (2021)
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b
Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. arXiv preprint arXiv:2110.02797 (2021)
Bitton, Y., et al.: VisIT-bench: a benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023)
Brain, G., (2017). https://www.kaggle.com/competitions/nips-2017-non-targeted-adversarial-attack
Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., Tsaneva-Atanasova, K.: Artificial intelligence, bias and clinical safety. BMJ Q. Saf. 28, 231–237 (2019)
Article Google Scholar
Chen, R., Zhang, H., Liang, S., Li, J., Cao, X.: Less is more: fewer interpretable region via submodular subset selection. In: ICLR (2024)
Google Scholar
Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicunalmsys.org. Accessed 14 Apr 2023
Cui, C., et al.: Holistic analysis of hallucination in GPT-4V (ision): bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv arXiv:2305.06500 (2023). https://api.semanticscholar.org/CorpusID:258615266
Diao, S., Zhou, W., Zhang, X., Wang, J.: Write and paint: generative vision-language models are unified modal learners. In: ICLR (2023)
Google Scholar
Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751 (2023)
Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936 (2022)
Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? In: SIGGRAPH (2012)
Google Scholar
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gao, P., et al.: LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608 (2023)
Jigsaw, G., (2023). https://perspectiveapi.com/
Koley, S., Bhunia, A.K., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z.: Picture that sketch: photorealistic image generation from abstract sketches. In: CVPR (2023)
Google Scholar
Li, X., Fang, Y., Liu, M., Ling, Z., Tu, Z., Su, H.: Distilling large vision-language model with out-of-distribution generalizability. In: ICCV (2023)
Google Scholar
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
OpenAI: ChatGPT can now see, hear, and speak (2023). https://openai.com/blog/chatgpt-can-now-see-hear-and-speak
OpenAI: GPT-4 technical report, Technical report. OpenAI (2023)
Google Scholar
OpenAI: GPT-4V(ision) technical work and authors, Technical report. OpenAI (2023). https://cdn.openai.com/contributions/gpt-4v.pdf
Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak aligned large language models. In: The Second Workshop on New Frontiers in Adversarial Machine Learning (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)
Google Scholar
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: PandaGPT: one model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Tamkin, A., Brundage, M., Clark, J., Ganguli, D.: Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503 (2021)
Tang, F., Gao, W., Peng, L., Zhan, J.: AGIBench: a multi-granularity, multimodal, human-referenced, auto-scoring benchmark for large language models. arXiv preprint arXiv:2309.06495 (2023)
Tatman, R., (2017). https://www.kaggle.com/datasets/rtatman/english-word-frequency
DeepMind Interactive Agents Team, et al.: Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763 (2021)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Tong, S., Jones, E., Steinhardt, J.: Mass-producing failures of multimodal systems with language models. arXiv preprint arXiv:2306.12105 (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Tu, H., Li, Y., Mi, F., Yang, Z.: ReSee: responding through seeing fine-grained visual knowledge in open-domain dialogue. arXiv preprint arXiv:2305.13602 (2023)
Tu, H., Yang, B., Zhao, X.: ZeroGEN: zero-shot multimodal controllable text generation with multiple oracles. arXiv preprint arXiv:2306.16649 (2023)
Tu, H., Zhao, B., Wei, C., Xie, C.: Sight beyond text: multi-modal training enhances LLMs in truthfulness and ethics. arXiv preprint arXiv:2309.07120 (2023)
Vidgen, B., et al.: SimpleSafetyTests: a test suite for identifying critical safety risks in large language models (2023)
Google Scholar
Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wang, Z., et al.: JARVIS-1: open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997 (2023)
Wei, Z., Wang, Y., Wang, Y.: Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, J., Lin, X., Yu, Z., Xing, X.: GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)
Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)
Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)
Zhang, L., Zhai, X., Zhao, Z., Wen, X., Zhao, B.: What if the TV was off? Examining counterfactual reasoning abilities of multi-modal language models. In: ICCVW (2023)
Google Scholar
Zhang, P., et al.: InternLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhao, B., et al.: OOD-CV: a benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13668, pp. 163–180. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_10
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: NeurIPS (2024)
Google Scholar
Zhou, W., et al.: Agents: an open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870 (2023)
Zhou, W., Zeng, Y., Diao, S., Zhang, X.: VLUE: a multi-task multi-dimension benchmark for evaluating vision-language pre-training. In: ICML (2022)
Google Scholar
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

Download references

Acknowledgements

This work is partially supported by a gift from Open Philanthropy and Cisco Faculty Research Award. We thank the Center for AI Safety, the Microsoft Accelerate Foundation Models Research Program, and the OpenAI Researcher Access Program for supporting our computing needs.

Author information

Authors and Affiliations

UC Santa Cruz, Santa Cruz, USA
Haoqin Tu, Zijun Wang & Cihang Xie
UNC-Chapel Hill, Chapel Hill, USA
Yiyang Zhou & Huaxiu Yao
National University of Singapore, Singapore, Singapore
Chenhang Cui
University of Edinburgh, Edinburgh, UK
Bingchen Zhao
University of Oxford, Oxford, UK
Junlin Han
AIWaves Inc., Hangzhou, China
Wangchunshu Zhou

Authors

Haoqin Tu
View author publications
You can also search for this author in PubMed Google Scholar
Chenhang Cui
View author publications
You can also search for this author in PubMed Google Scholar
Zijun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiyang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bingchen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Junlin Han
View author publications
You can also search for this author in PubMed Google Scholar
Wangchunshu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Huaxiu Yao
View author publications
You can also search for this author in PubMed Google Scholar
Cihang Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haoqin Tu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5486 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tu, H. et al. (2025). How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15109. Springer, Cham. https://doi.org/10.1007/978-3-031-72983-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72983-6_3
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72982-9
Online ISBN: 978-3-031-72983-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics