Abstract
This work focuses on benchmarking the capabilities of vision large language models (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite Unicorn, covering out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel visual question-answering (VQA) datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language input of VLLMs. Our evaluation of 22 diverse models, ranging from open-source VLLMs to GPT-4V and Gemini Pro, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.
H. Tu, C. Cui and Z. Wang—Equal Technical Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aldahdooh, A., Hamidouche, W., Deforges, O.: Reveal of vision transformers robustness against adversarial attacks. arXiv preprint arXiv:2106.03734 (2021)
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b
Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. arXiv preprint arXiv:2110.02797 (2021)
Bitton, Y., et al.: VisIT-bench: a benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023)
Brain, G., (2017). https://www.kaggle.com/competitions/nips-2017-non-targeted-adversarial-attack
Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., Tsaneva-Atanasova, K.: Artificial intelligence, bias and clinical safety. BMJ Q. Saf. 28, 231–237 (2019)
Chen, R., Zhang, H., Liang, S., Li, J., Cao, X.: Less is more: fewer interpretable region via submodular subset selection. In: ICLR (2024)
Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicunalmsys.org. Accessed 14 Apr 2023
Cui, C., et al.: Holistic analysis of hallucination in GPT-4V (ision): bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv arXiv:2305.06500 (2023). https://api.semanticscholar.org/CorpusID:258615266
Diao, S., Zhou, W., Zhang, X., Wang, J.: Write and paint: generative vision-language models are unified modal learners. In: ICLR (2023)
Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751 (2023)
Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936 (2022)
Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? In: SIGGRAPH (2012)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gao, P., et al.: LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608 (2023)
Jigsaw, G., (2023). https://perspectiveapi.com/
Koley, S., Bhunia, A.K., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z.: Picture that sketch: photorealistic image generation from abstract sketches. In: CVPR (2023)
Li, X., Fang, Y., Liu, M., Ling, Z., Tu, Z., Su, H.: Distilling large vision-language model with out-of-distribution generalizability. In: ICCV (2023)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
OpenAI: ChatGPT can now see, hear, and speak (2023). https://openai.com/blog/chatgpt-can-now-see-hear-and-speak
OpenAI: GPT-4 technical report, Technical report. OpenAI (2023)
OpenAI: GPT-4V(ision) technical work and authors, Technical report. OpenAI (2023). https://cdn.openai.com/contributions/gpt-4v.pdf
Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak aligned large language models. In: The Second Workshop on New Frontiers in Adversarial Machine Learning (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: PandaGPT: one model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Tamkin, A., Brundage, M., Clark, J., Ganguli, D.: Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503 (2021)
Tang, F., Gao, W., Peng, L., Zhan, J.: AGIBench: a multi-granularity, multimodal, human-referenced, auto-scoring benchmark for large language models. arXiv preprint arXiv:2309.06495 (2023)
Tatman, R., (2017). https://www.kaggle.com/datasets/rtatman/english-word-frequency
DeepMind Interactive Agents Team, et al.: Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763 (2021)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Tong, S., Jones, E., Steinhardt, J.: Mass-producing failures of multimodal systems with language models. arXiv preprint arXiv:2306.12105 (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Tu, H., Li, Y., Mi, F., Yang, Z.: ReSee: responding through seeing fine-grained visual knowledge in open-domain dialogue. arXiv preprint arXiv:2305.13602 (2023)
Tu, H., Yang, B., Zhao, X.: ZeroGEN: zero-shot multimodal controllable text generation with multiple oracles. arXiv preprint arXiv:2306.16649 (2023)
Tu, H., Zhao, B., Wei, C., Xie, C.: Sight beyond text: multi-modal training enhances LLMs in truthfulness and ethics. arXiv preprint arXiv:2309.07120 (2023)
Vidgen, B., et al.: SimpleSafetyTests: a test suite for identifying critical safety risks in large language models (2023)
Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wang, Z., et al.: JARVIS-1: open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997 (2023)
Wei, Z., Wang, Y., Wang, Y.: Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, J., Lin, X., Yu, Z., Xing, X.: GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)
Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)
Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)
Zhang, L., Zhai, X., Zhao, Z., Wen, X., Zhao, B.: What if the TV was off? Examining counterfactual reasoning abilities of multi-modal language models. In: ICCVW (2023)
Zhang, P., et al.: InternLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhao, B., et al.: OOD-CV: a benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13668, pp. 163–180. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_10
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: NeurIPS (2024)
Zhou, W., et al.: Agents: an open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870 (2023)
Zhou, W., Zeng, Y., Diao, S., Zhang, X.: VLUE: a multi-task multi-dimension benchmark for evaluating vision-language pre-training. In: ICML (2022)
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)
Acknowledgements
This work is partially supported by a gift from Open Philanthropy and Cisco Faculty Research Award. We thank the Center for AI Safety, the Microsoft Accelerate Foundation Models Research Program, and the OpenAI Researcher Access Program for supporting our computing needs.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tu, H. et al. (2025). How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15109. Springer, Cham. https://doi.org/10.1007/978-3-031-72983-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-72983-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72982-9
Online ISBN: 978-3-031-72983-6
eBook Packages: Computer ScienceComputer Science (R0)