AdaShield : Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting

Wang, Yu; Liu, Xiaogeng; Li, Yu; Chen, Muhao; Xiao, Chaowei

doi:10.1007/978-3-031-72661-3_5

Yu Wang^13,14,
Xiaogeng Liu¹³,
Yu Li¹⁴,
Muhao Chen¹⁵ &
…
Chaowei Xiao¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15078))

Included in the following conference series:

European Conference on Computer Vision

463 Accesses

Abstract

With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g. “harmful text”) has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose Adaptive Shield Prompting (AdaShield), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs’ robustness against structure-based jailbreak attacks without compromising the model’s general capabilities evaluated on standard benign tasks. Our code is available at https://rain305f.github.io/AdaShield-Project.

Y. Wang and X. Liu—Equal contributor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.99; Price excludes VAT (USA)

Softcover Book: USD 161.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Jatmo: Prompt Injection Defense by Task-Specific Finetuning

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models

Article 19 February 2025

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Notes

1.
We also compare our method with the defense method MLLMP [45] against structure-based attacks and on benign datasets, which has just released its code on 02/29/2024. The complete results are provided in the appendix.

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Awadalla, A., et al.: OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390 (2023)
Bai, J., et al.: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023)
Cao, H., Liu, Z., Lu, X., Yao, Y., Li, Y.: Instructmol: multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208 (2023)
Cao, H., et al.: Presto: progressive pretraining enhances synthetic chemistry outcomes. arXiv preprint arXiv:2406.13193 (2024)
Carlini, N., et al.: Are aligned neural networks adversarially aligned? (2023)
Google Scholar
Cha, S., Lee, J., Lee, Y., Yang, C.: Visually Dehallucinative Instruction Generation: Know What You Don’t Know. arXiv preprint arXiv:2303.16199 (2024)
Chen, J., et al.: MiniGPT-V2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback. arXiv preprint arXiv:2311.10081 (2023)
Costa, J.C., Roxo, T., Proença, H., Inácio, P.R.M.: How Deep Learning Sees the World: A Survey on Adversarial Attacksn and Defenses. arXiv preprint arXiv:2305.10862 (2023)
Dong, X., et al.: InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model. arXiv preprint arXiv:2401.16420 (2024)
Dong, Y., et al.: How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751 (2023)
Dong, Z., Zhou, Z., Yang, C., Shao, J., Qiao, Y.: Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. arXiv preprint arXiv:2402.09283 (2024)
Fu, C., et al.: MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023)
Fu, C., et al.: A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv preprint arXiv:2312.12436 (2023)
Ge, J., Luo, H., Qian, S., Gan, Y., Fu, J., Zhan, S.: Chain of Thought Prompt Tuning in Vision Language Models. arXiv preprint arXiv:2304.07919 (2023)
Gong, Y., et al.: FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. arXiv preprint arXiv:2311.05608 (2023)
Gu, X., et al.: Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv preprint arXiv:2402.08567 (2024)
Guo, P., Yang, Z., Lin, X., Zhao, Q., Zhang, Q.: PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks. arXiv preprint arXiv:2401.10586 (2024)
Han, D., Jia, X., Bai, Y., Gu, J., Liu, Y., Cao, X.: OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization. arXiv preprint arXiv:2312.04403 (2023)
Ji, Y., et al.: Large Language Models as Automated Aligners for benchmarking Vision-Language Models. arXiv preprint arXiv:2311.14580 (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: NeurIPS (2022)
Google Scholar
Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial machine learning at scale. In: ICLR (2017)
Google Scholar
Li, H., et al.: Freestyleret: retrieving images from style-diversified queries. arXiv preprint arXiv:2312.02428 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Google Scholar
Li, L., et al.: Silkie: Preference Distillation for Large Visual Language Models. arXiv preprint arXiv:2312.10665 (2023)
Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red Teaming Visual Language Models. arXiv preprint arXiv:2401.12915 (2024)
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv preprint arXiv:2311.10122 (2023)
Liu, H., et al.: A Survey on Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2402.00253 (2024)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023)
Google Scholar
Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024)
Google Scholar
Liu, S., et al.: Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. arXiv preprint arXiv:2212.10789 (2024)
Liu, X., et al.: AgentBench: evaluating LLMs as agents. In: ICLR (2024)
Google Scholar
Liu, X., Xu, N., Chen, M., Xiao, C.: Generating stealthy jailbreak prompts on aligned large language models. In: ICLR (2024)
Google Scholar
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-Relevant Images Jailbreak Large Multi-Modal Models (2023)
Google Scholar
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Safety of Multimodal Large Language Models on Images and Text. arXiv preprint arXiv:2402.00357 (2024)
Lu, X., et al.: Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. arXiv preprint arXiv:2403.08192 (2024)
Lyu, H., et al.: GPT-4v(ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547 (2023)
Mao, C., Chiquier, M., Wang, H., Yang, J., Vondrick, C.: Adversarial attacks are reversible with natural supervision. In: ICCV (2021)
Google Scholar
Meta: Llama usage policy (2023). Accessed 10 2023
Google Scholar
Naveed, H., et al.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2024)
Niu, Z., Ren, H., Gao, X., Hua, G., Jin, R.: Jailbreaking Attack against Multimodal Large Language Model. arXiv preprint arXiv:2402.02309 (2024)
OpenAI: OpenAI usage policy (2023). Accessed 10 2023
Google Scholar
Pi, R., et al.: MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance. arXiv preprint arXiv:2401.02906 (2024)
Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual Adversarial Examples Jailbreak Aligned Large Language Models. arXiv preprint arXiv:2306.13213 (2023)
Rizwan, N., Bhaskar, P., Das, M., Majhi, S.S., Saha, P., Mukherjee, A.: Zero shot VLMs for hate meme detection: Are we there yet? arXiv preprint arXiv:2402.12198 (2024)
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)
Google Scholar
Shayegani, E., Dong, Y., Abu-Ghazaleh, N.: Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. arXiv preprint arXiv:2307.14539 (2023)
Shayegani, E., Mamun, M.A.A., Fu, Y., Zaree, P., Dong, Y., Abu-Ghazaleh, N.: Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844 (2023)
Sun, Z., et al.: Aligning Large Multimodal Models with Factually Augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)
Wang, B., et al.: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv preprint arXiv:2306.11698 (2024)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wei, T., et al.: Skywork: A More Open Bilingual Foundation Model. arXiv preprint arXiv:2310.19341 (2023)
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Ye, Q., et al.: mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257 (2023)
Yin, S., et al.: A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549 (2023)
Yin, S., et al.: Woodpecker: Hallucination Correction for Multimodal Large Language Models. arXiv preprint arXiv:2310.16045 (2023)
Yu, T., et al.: RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. arXiv preprint arXiv:2312.00849 (2023)
Yu, W., et al.: MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490 (2023)
Zhang, D., et al.: MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv preprint arXiv:2401.13601 (2024)
Zhang, H., Li, X., Bing, L.: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858 (2023)
Zhang, R., et al.: LLaMA-Adapter: Efficient Finetuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, X., et al.: A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766 (2023)
Zhang, X., Li, R., Yu, J., Xu, Y., Li, W., Zhang, J.: Editguard: versatile image watermarking for tamper localization and copyright protection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11964–11974 (2024)
Google Scholar
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: NeurIPS (2023)
Google Scholar
Zheng, C., et al.: Prompt-Driven LLM Safeguarding via Directed Representation Optimization. arXiv preprint arXiv:2401.18018 (2024)
Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In: NeurIPS (2023)
Google Scholar
Zhou, H., et al.: A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. arXiv preprint arXiv:2311.05112 (2023)
Zhu, B., et al.: LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv preprint arXiv:2310.01852 (2023)
Zong, Y., Bohdal, O., Yu, T., Yang, Y., Timothy, H.: Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models. arXiv preprint arXiv:2402.02207 (2024)

Download references

Acknowledgements

We sincerely thank the reviewers for their insightful comments and valuable feedback, which significantly improved the quality of our manuscript.

Author information

Authors and Affiliations

University of Wisconsin-Madison, Madison, USA
Yu Wang, Xiaogeng Liu & Chaowei Xiao
IDEA, Shenzhen, China
Yu Wang & Yu Li
University of California, Davis, Davis, USA
Muhao Chen

Authors

Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Li
View author publications
You can also search for this author in PubMed Google Scholar
Muhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chaowei Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Ethics declarations

Disclaimer

This paper contains offensive content that may be disturbing to some readers.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2123 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Liu, X., Li, Y., Chen, M., Xiao, C. (2025). AdaShield : Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72661-3_5
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AdaShield : Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting