Skip to main content

Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15131))

Included in the following conference series:

  • 237 Accesses

Abstract

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.

Y. Li, H. Guo and K. Zhou—Equal contribution.

Warning: this paper contains example data that may be offensive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anil, R., et al.: Gemini: a family of highly capable multimodal models. CoRR abs/2312.11805 (2023)

    Google Scholar 

  2. Askell, A., et al.: A general language assistant as a laboratory for alignment. CoRR abs/2112.00861 (2021)

    Google Scholar 

  3. Carlini, N., et al.: Are aligned neural networks adversarially aligned? CoRR abs/2306.15447 (2023)

    Google Scholar 

  4. Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. CoRR abs/2310.08419 (2023)

    Google Scholar 

  5. Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. CoRR abs/2310.09478 (2023)

    Google Scholar 

  6. Chen, J., et al.: PixArt-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. CoRR abs/2310.00426 (2023)

    Google Scholar 

  7. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/

  8. Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? CoRR abs/2309.11751 (2023)

    Google Scholar 

  9. Ganguli, D., et al.: Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. CoRR abs/2209.07858 (2022)

    Google Scholar 

  10. Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. CoRR abs/2311.05608 (2023)

    Google Scholar 

  11. Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 10–16 December 2023, New Orleans, LA, USA (2023)

    Google Scholar 

  12. Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. CoRR abs/2307.04657 (2023)

    Google Scholar 

  13. Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red teaming visual language models. CoRR abs/2401.12915 (2024)

    Google Scholar 

  14. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. CoRR abs/2310.03744 (2023)

    Google Scholar 

  15. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. CoRR abs/2304.08485 (2023)

    Google Scholar 

  16. Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. CoRR abs/2311.17600 (2023)

    Google Scholar 

  17. Niu, Z., Ren, H., Gao, X., Hua, G., Jin, R.: Jailbreaking attack against multimodal large language model. CoRR abs/2402.02309 (2024)

    Google Scholar 

  18. OpenAI: GPT-4v(ision) system card (2023)

    Google Scholar 

  19. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

    Google Scholar 

  20. Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak large language models. CoRR abs/2306.13213 (2023)

    Google Scholar 

  21. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  22. Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, 2–6 October 2023, pp. 3679–3687. IEEE (2023)

    Google Scholar 

  23. Shayegani, E., Dong, Y., Abu-Ghazaleh, N.B.: Jailbreak in pieces: compositional adversarial attacks on multi-modal language models. CoRR abs/2307.14539 (2023)

    Google Scholar 

  24. Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: “Do anything now”: characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR abs/2308.03825 (2023)

    Google Scholar 

  25. Subhash, V., Bialas, A., Pan, W., Doshi-Velez, F.: Why do universal adversarial attacks work on large language models?: geometry might be the answer. CoRR abs/2309.00254 (2023)

    Google Scholar 

  26. Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. CoRR abs/2307.09288 (2023)

    Google Scholar 

  27. Tu, H., et al.: How many unicorns are in this image? A safety evaluation benchmark for vision LLMs. CoRR abs/2311.16101 (2023)

    Google Scholar 

  28. Wang, J.G., Wang, J., Li, M., Neel, S.: Pandora’s white-box: increased training data leakage in open LLMs. arXiv preprint arXiv:2402.17012 (2024)

  29. Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: how does LLM safety training fail? In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 10–16 December 2023, New Orleans, LA, USA (2023)

    Google Scholar 

  30. Wei, Z., Wang, Y., Wang, Y.: Jailbreak and guard aligned language models with only few in-context demonstrations. CoRR abs/2310.06387 (2023)

    Google Scholar 

  31. Wu, Y., Li, X., Liu, Y., Zhou, P., Sun, L.: Jailbreaking GPT-4V via self-adversarial attacks with system prompts. CoRR abs/2311.09127 (2023)

    Google Scholar 

  32. Yin, S., et al.: A survey on multimodal large language models. CoRR abs/2306.13549 (2023)

    Google Scholar 

  33. Zhao, W.X., et al.: A survey of large language models. CoRR abs/2303.18223 (2023)

    Google Scholar 

  34. Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. CoRR abs/2305.16934 (2023)

    Google Scholar 

  35. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 (2023)

    Google Scholar 

  36. Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. CoRR abs/2307.15043 (2023)

    Google Scholar 

Download references

Acknowledgement

This work was partially supported by National Natural Science Foundation of China under Grant No. 62222215, Beijing Natural Science Foundation under Grant No. L233008 and 4222027.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wayne Xin Zhao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7149 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Guo, H., Zhou, K., Zhao, W.X., Wen, JR. (2025). Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15131. Springer, Cham. https://doi.org/10.1007/978-3-031-73464-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73464-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73463-2

  • Online ISBN: 978-3-031-73464-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics