Abstract
In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.
Y. Li, H. Guo and K. Zhou—Equal contribution.
Warning: this paper contains example data that may be offensive.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anil, R., et al.: Gemini: a family of highly capable multimodal models. CoRR abs/2312.11805 (2023)
Askell, A., et al.: A general language assistant as a laboratory for alignment. CoRR abs/2112.00861 (2021)
Carlini, N., et al.: Are aligned neural networks adversarially aligned? CoRR abs/2306.15447 (2023)
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. CoRR abs/2310.08419 (2023)
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. CoRR abs/2310.09478 (2023)
Chen, J., et al.: PixArt-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. CoRR abs/2310.00426 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? CoRR abs/2309.11751 (2023)
Ganguli, D., et al.: Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. CoRR abs/2209.07858 (2022)
Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. CoRR abs/2311.05608 (2023)
Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 10–16 December 2023, New Orleans, LA, USA (2023)
Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. CoRR abs/2307.04657 (2023)
Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red teaming visual language models. CoRR abs/2401.12915 (2024)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. CoRR abs/2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. CoRR abs/2304.08485 (2023)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. CoRR abs/2311.17600 (2023)
Niu, Z., Ren, H., Gao, X., Hua, G., Jin, R.: Jailbreaking attack against multimodal large language model. CoRR abs/2402.02309 (2024)
OpenAI: GPT-4v(ision) system card (2023)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak large language models. CoRR abs/2306.13213 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, 2–6 October 2023, pp. 3679–3687. IEEE (2023)
Shayegani, E., Dong, Y., Abu-Ghazaleh, N.B.: Jailbreak in pieces: compositional adversarial attacks on multi-modal language models. CoRR abs/2307.14539 (2023)
Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: “Do anything now”: characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR abs/2308.03825 (2023)
Subhash, V., Bialas, A., Pan, W., Doshi-Velez, F.: Why do universal adversarial attacks work on large language models?: geometry might be the answer. CoRR abs/2309.00254 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. CoRR abs/2307.09288 (2023)
Tu, H., et al.: How many unicorns are in this image? A safety evaluation benchmark for vision LLMs. CoRR abs/2311.16101 (2023)
Wang, J.G., Wang, J., Li, M., Neel, S.: Pandora’s white-box: increased training data leakage in open LLMs. arXiv preprint arXiv:2402.17012 (2024)
Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: how does LLM safety training fail? In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 10–16 December 2023, New Orleans, LA, USA (2023)
Wei, Z., Wang, Y., Wang, Y.: Jailbreak and guard aligned language models with only few in-context demonstrations. CoRR abs/2310.06387 (2023)
Wu, Y., Li, X., Liu, Y., Zhou, P., Sun, L.: Jailbreaking GPT-4V via self-adversarial attacks with system prompts. CoRR abs/2311.09127 (2023)
Yin, S., et al.: A survey on multimodal large language models. CoRR abs/2306.13549 (2023)
Zhao, W.X., et al.: A survey of large language models. CoRR abs/2303.18223 (2023)
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. CoRR abs/2305.16934 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 (2023)
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. CoRR abs/2307.15043 (2023)
Acknowledgement
This work was partially supported by National Natural Science Foundation of China under Grant No. 62222215, Beijing Natural Science Foundation under Grant No. L233008 and 4222027.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Guo, H., Zhou, K., Zhao, W.X., Wen, JR. (2025). Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15131. Springer, Cham. https://doi.org/10.1007/978-3-031-73464-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-73464-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73463-2
Online ISBN: 978-3-031-73464-9
eBook Packages: Computer ScienceComputer Science (R0)