Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Li, Yifan; Guo, Hangyu; Zhou, Kun; Zhao, Wayne Xin; Wen, Ji-Rong

doi:10.1007/978-3-031-73464-9_11

Yifan Li^13,15,
Hangyu Guo^13,15,
Kun Zhou^14,15,
Wayne Xin Zhao^13,15 &
…
Ji-Rong Wen^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15131))

Included in the following conference series:

European Conference on Computer Vision

242 Accesses

Abstract

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.

Y. Li, H. Guo and K. Zhou—Equal contribution.

Warning: this paper contains example data that may be offensive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

TrojVLM: Backdoor Attack Against Vision Language Models

References

Anil, R., et al.: Gemini: a family of highly capable multimodal models. CoRR abs/2312.11805 (2023)
Google Scholar
Askell, A., et al.: A general language assistant as a laboratory for alignment. CoRR abs/2112.00861 (2021)
Google Scholar
Carlini, N., et al.: Are aligned neural networks adversarially aligned? CoRR abs/2306.15447 (2023)
Google Scholar
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. CoRR abs/2310.08419 (2023)
Google Scholar
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. CoRR abs/2310.09478 (2023)
Google Scholar
Chen, J., et al.: PixArt-$\alpha $: fast training of diffusion transformer for photorealistic text-to-image synthesis. CoRR abs/2310.00426 (2023)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? CoRR abs/2309.11751 (2023)
Google Scholar
Ganguli, D., et al.: Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. CoRR abs/2209.07858 (2022)
Google Scholar
Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. CoRR abs/2311.05608 (2023)
Google Scholar
Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 10–16 December 2023, New Orleans, LA, USA (2023)
Google Scholar
Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. CoRR abs/2307.04657 (2023)
Google Scholar
Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red teaming visual language models. CoRR abs/2401.12915 (2024)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. CoRR abs/2310.03744 (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. CoRR abs/2304.08485 (2023)
Google Scholar
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. CoRR abs/2311.17600 (2023)
Google Scholar
Niu, Z., Ren, H., Gao, X., Hua, G., Jin, R.: Jailbreaking attack against multimodal large language model. CoRR abs/2402.02309 (2024)
Google Scholar
OpenAI: GPT-4v(ision) system card (2023)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Google Scholar
Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak large language models. CoRR abs/2306.13213 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, 2–6 October 2023, pp. 3679–3687. IEEE (2023)
Google Scholar
Shayegani, E., Dong, Y., Abu-Ghazaleh, N.B.: Jailbreak in pieces: compositional adversarial attacks on multi-modal language models. CoRR abs/2307.14539 (2023)
Google Scholar
Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: “Do anything now”: characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR abs/2308.03825 (2023)
Google Scholar
Subhash, V., Bialas, A., Pan, W., Doshi-Velez, F.: Why do universal adversarial attacks work on large language models?: geometry might be the answer. CoRR abs/2309.00254 (2023)
Google Scholar
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. CoRR abs/2307.09288 (2023)
Google Scholar
Tu, H., et al.: How many unicorns are in this image? A safety evaluation benchmark for vision LLMs. CoRR abs/2311.16101 (2023)
Google Scholar
Wang, J.G., Wang, J., Li, M., Neel, S.: Pandora’s white-box: increased training data leakage in open LLMs. arXiv preprint arXiv:2402.17012 (2024)
Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: how does LLM safety training fail? In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 10–16 December 2023, New Orleans, LA, USA (2023)
Google Scholar
Wei, Z., Wang, Y., Wang, Y.: Jailbreak and guard aligned language models with only few in-context demonstrations. CoRR abs/2310.06387 (2023)
Google Scholar
Wu, Y., Li, X., Liu, Y., Zhou, P., Sun, L.: Jailbreaking GPT-4V via self-adversarial attacks with system prompts. CoRR abs/2311.09127 (2023)
Google Scholar
Yin, S., et al.: A survey on multimodal large language models. CoRR abs/2306.13549 (2023)
Google Scholar
Zhao, W.X., et al.: A survey of large language models. CoRR abs/2303.18223 (2023)
Google Scholar
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. CoRR abs/2305.16934 (2023)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 (2023)
Google Scholar
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. CoRR abs/2307.15043 (2023)
Google Scholar

Download references

Acknowledgement

This work was partially supported by National Natural Science Foundation of China under Grant No. 62222215, Beijing Natural Science Foundation under Grant No. L233008 and 4222027.

Author information

Authors and Affiliations

Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yifan Li, Hangyu Guo, Wayne Xin Zhao & Ji-Rong Wen
School of Information, Renmin University of China, Beijing, China
Kun Zhou & Ji-Rong Wen
Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao & Ji-Rong Wen

Authors

Yifan Li
View author publications
You can also search for this author in PubMed Google Scholar
Hangyu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Kun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Xin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Rong Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wayne Xin Zhao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7149 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Guo, H., Zhou, K., Zhao, W.X., Wen, JR. (2025). Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15131. Springer, Cham. https://doi.org/10.1007/978-3-031-73464-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-73464-9_11
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73463-2
Online ISBN: 978-3-031-73464-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

TrojVLM: Backdoor Attack Against Vision Language Models

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7149 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

TrojVLM: Backdoor Attack Against Vision Language Models

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7149 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation