Skip to main content

How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15109))

Included in the following conference series:

  • 364 Accesses

Abstract

This work focuses on benchmarking the capabilities of vision large language models (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite Unicorn, covering out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel visual question-answering (VQA) datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language input of VLLMs. Our evaluation of 22 diverse models, ranging from open-source VLLMs to GPT-4V and Gemini Pro, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.

H. Tu, C. Cui and Z. Wang—Equal Technical Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aldahdooh, A., Hamidouche, W., Deforges, O.: Reveal of vision transformers robustness against adversarial attacks. arXiv preprint arXiv:2106.03734 (2021)

  2. Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  3. Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  4. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  5. Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b

  6. Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. arXiv preprint arXiv:2110.02797 (2021)

  7. Bitton, Y., et al.: VisIT-bench: a benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023)

  8. Brain, G., (2017). https://www.kaggle.com/competitions/nips-2017-non-targeted-adversarial-attack

  9. Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., Tsaneva-Atanasova, K.: Artificial intelligence, bias and clinical safety. BMJ Q. Saf. 28, 231–237 (2019)

    Article  Google Scholar 

  10. Chen, R., Zhang, H., Liang, S., Li, J., Cao, X.: Less is more: fewer interpretable region via submodular subset selection. In: ICLR (2024)

    Google Scholar 

  11. Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)

    Google Scholar 

  12. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicunalmsys.org. Accessed 14 Apr 2023

  13. Cui, C., et al.: Holistic analysis of hallucination in GPT-4V (ision): bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023)

  14. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv arXiv:2305.06500 (2023). https://api.semanticscholar.org/CorpusID:258615266

  15. Diao, S., Zhou, W., Zhang, X., Wang, J.: Write and paint: generative vision-language models are unified modal learners. In: ICLR (2023)

    Google Scholar 

  16. Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751 (2023)

  17. Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936 (2022)

  18. Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? In: SIGGRAPH (2012)

    Google Scholar 

  19. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  20. Gao, P., et al.: LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  21. Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608 (2023)

  22. Jigsaw, G., (2023). https://perspectiveapi.com/

  23. Koley, S., Bhunia, A.K., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z.: Picture that sketch: photorealistic image generation from abstract sketches. In: CVPR (2023)

    Google Scholar 

  24. Li, X., Fang, Y., Liu, M., Ling, Z., Tu, Z., Su, H.: Distilling large vision-language model with out-of-distribution generalizability. In: ICCV (2023)

    Google Scholar 

  25. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

  26. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  27. Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600 (2023)

  28. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  29. OpenAI: ChatGPT can now see, hear, and speak (2023). https://openai.com/blog/chatgpt-can-now-see-hear-and-speak

  30. OpenAI: GPT-4 technical report, Technical report. OpenAI (2023)

    Google Scholar 

  31. OpenAI: GPT-4V(ision) technical work and authors, Technical report. OpenAI (2023). https://cdn.openai.com/contributions/gpt-4v.pdf

  32. Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak aligned large language models. In: The Second Workshop on New Frontiers in Adversarial Machine Learning (2023)

    Google Scholar 

  33. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  34. Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)

    Google Scholar 

  35. Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: PandaGPT: one model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)

  36. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)

  37. Tamkin, A., Brundage, M., Clark, J., Ganguli, D.: Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503 (2021)

  38. Tang, F., Gao, W., Peng, L., Zhan, J.: AGIBench: a multi-granularity, multimodal, human-referenced, auto-scoring benchmark for large language models. arXiv preprint arXiv:2309.06495 (2023)

  39. Tatman, R., (2017). https://www.kaggle.com/datasets/rtatman/english-word-frequency

  40. DeepMind Interactive Agents Team, et al.: Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763 (2021)

  41. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  42. Tong, S., Jones, E., Steinhardt, J.: Mass-producing failures of multimodal systems with language models. arXiv preprint arXiv:2306.12105 (2023)

  43. Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  44. Tu, H., Li, Y., Mi, F., Yang, Z.: ReSee: responding through seeing fine-grained visual knowledge in open-domain dialogue. arXiv preprint arXiv:2305.13602 (2023)

  45. Tu, H., Yang, B., Zhao, X.: ZeroGEN: zero-shot multimodal controllable text generation with multiple oracles. arXiv preprint arXiv:2306.16649 (2023)

  46. Tu, H., Zhao, B., Wei, C., Xie, C.: Sight beyond text: multi-modal training enhances LLMs in truthfulness and ethics. arXiv preprint arXiv:2309.07120 (2023)

  47. Vidgen, B., et al.: SimpleSafetyTests: a test suite for identifying critical safety risks in large language models (2023)

    Google Scholar 

  48. Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)

  49. Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)

  50. Wang, Z., et al.: JARVIS-1: open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997 (2023)

  51. Wei, Z., Wang, Y., Wang, Y.: Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023)

  52. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  53. Yu, J., Lin, X., Yu, Z., Xing, X.: GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)

  54. Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)

  55. Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)

  56. Zhang, L., Zhai, X., Zhao, Z., Wen, X., Zhao, B.: What if the TV was off? Examining counterfactual reasoning abilities of multi-modal language models. In: ICCVW (2023)

    Google Scholar 

  57. Zhang, P., et al.: InternLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)

  58. Zhao, B., et al.: OOD-CV: a benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13668, pp. 163–180. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_10

  59. Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: NeurIPS (2024)

    Google Scholar 

  60. Zhou, W., et al.: Agents: an open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870 (2023)

  61. Zhou, W., Zeng, Y., Diao, S., Zhang, X.: VLUE: a multi-task multi-dimension benchmark for evaluating vision-language pre-training. In: ICML (2022)

    Google Scholar 

  62. Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)

  63. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  64. Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

Download references

Acknowledgements

This work is partially supported by a gift from Open Philanthropy and Cisco Faculty Research Award. We thank the Center for AI Safety, the Microsoft Accelerate Foundation Models Research Program, and the OpenAI Researcher Access Program for supporting our computing needs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haoqin Tu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5486 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tu, H. et al. (2025). How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15109. Springer, Cham. https://doi.org/10.1007/978-3-031-72983-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72983-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72982-9

  • Online ISBN: 978-3-031-72983-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics