Abstract
With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g. “harmful text”) has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose Adaptive Shield Prompting (AdaShield), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs’ robustness against structure-based jailbreak attacks without compromising the model’s general capabilities evaluated on standard benign tasks. Our code is available at https://rain305f.github.io/AdaShield-Project.
Y. Wang and X. Liu—Equal contributor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We also compare our method with the defense method MLLMP [45] against structure-based attacks and on benign datasets, which has just released its code on 02/29/2024. The complete results are provided in the appendix.
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Awadalla, A., et al.: OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390 (2023)
Bai, J., et al.: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023)
Cao, H., Liu, Z., Lu, X., Yao, Y., Li, Y.: Instructmol: multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208 (2023)
Cao, H., et al.: Presto: progressive pretraining enhances synthetic chemistry outcomes. arXiv preprint arXiv:2406.13193 (2024)
Carlini, N., et al.: Are aligned neural networks adversarially aligned? (2023)
Cha, S., Lee, J., Lee, Y., Yang, C.: Visually Dehallucinative Instruction Generation: Know What You Don’t Know. arXiv preprint arXiv:2303.16199 (2024)
Chen, J., et al.: MiniGPT-V2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback. arXiv preprint arXiv:2311.10081 (2023)
Costa, J.C., Roxo, T., Proença, H., Inácio, P.R.M.: How Deep Learning Sees the World: A Survey on Adversarial Attacksn and Defenses. arXiv preprint arXiv:2305.10862 (2023)
Dong, X., et al.: InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model. arXiv preprint arXiv:2401.16420 (2024)
Dong, Y., et al.: How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751 (2023)
Dong, Z., Zhou, Z., Yang, C., Shao, J., Qiao, Y.: Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. arXiv preprint arXiv:2402.09283 (2024)
Fu, C., et al.: MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023)
Fu, C., et al.: A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv preprint arXiv:2312.12436 (2023)
Ge, J., Luo, H., Qian, S., Gan, Y., Fu, J., Zhan, S.: Chain of Thought Prompt Tuning in Vision Language Models. arXiv preprint arXiv:2304.07919 (2023)
Gong, Y., et al.: FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. arXiv preprint arXiv:2311.05608 (2023)
Gu, X., et al.: Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv preprint arXiv:2402.08567 (2024)
Guo, P., Yang, Z., Lin, X., Zhao, Q., Zhang, Q.: PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks. arXiv preprint arXiv:2401.10586 (2024)
Han, D., Jia, X., Bai, Y., Gu, J., Liu, Y., Cao, X.: OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization. arXiv preprint arXiv:2312.04403 (2023)
Ji, Y., et al.: Large Language Models as Automated Aligners for benchmarking Vision-Language Models. arXiv preprint arXiv:2311.14580 (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: NeurIPS (2022)
Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial machine learning at scale. In: ICLR (2017)
Li, H., et al.: Freestyleret: retrieving images from style-diversified queries. arXiv preprint arXiv:2312.02428 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Li, L., et al.: Silkie: Preference Distillation for Large Visual Language Models. arXiv preprint arXiv:2312.10665 (2023)
Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red Teaming Visual Language Models. arXiv preprint arXiv:2401.12915 (2024)
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv preprint arXiv:2311.10122 (2023)
Liu, H., et al.: A Survey on Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2402.00253 (2024)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023)
Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024)
Liu, S., et al.: Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. arXiv preprint arXiv:2212.10789 (2024)
Liu, X., et al.: AgentBench: evaluating LLMs as agents. In: ICLR (2024)
Liu, X., Xu, N., Chen, M., Xiao, C.: Generating stealthy jailbreak prompts on aligned large language models. In: ICLR (2024)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-Relevant Images Jailbreak Large Multi-Modal Models (2023)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Safety of Multimodal Large Language Models on Images and Text. arXiv preprint arXiv:2402.00357 (2024)
Lu, X., et al.: Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. arXiv preprint arXiv:2403.08192 (2024)
Lyu, H., et al.: GPT-4v(ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547 (2023)
Mao, C., Chiquier, M., Wang, H., Yang, J., Vondrick, C.: Adversarial attacks are reversible with natural supervision. In: ICCV (2021)
Meta: Llama usage policy (2023). Accessed 10 2023
Naveed, H., et al.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2024)
Niu, Z., Ren, H., Gao, X., Hua, G., Jin, R.: Jailbreaking Attack against Multimodal Large Language Model. arXiv preprint arXiv:2402.02309 (2024)
OpenAI: OpenAI usage policy (2023). Accessed 10 2023
Pi, R., et al.: MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance. arXiv preprint arXiv:2401.02906 (2024)
Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual Adversarial Examples Jailbreak Aligned Large Language Models. arXiv preprint arXiv:2306.13213 (2023)
Rizwan, N., Bhaskar, P., Das, M., Majhi, S.S., Saha, P., Mukherjee, A.: Zero shot VLMs for hate meme detection: Are we there yet? arXiv preprint arXiv:2402.12198 (2024)
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)
Shayegani, E., Dong, Y., Abu-Ghazaleh, N.: Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. arXiv preprint arXiv:2307.14539 (2023)
Shayegani, E., Mamun, M.A.A., Fu, Y., Zaree, P., Dong, Y., Abu-Ghazaleh, N.: Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844 (2023)
Sun, Z., et al.: Aligning Large Multimodal Models with Factually Augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)
Wang, B., et al.: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv preprint arXiv:2306.11698 (2024)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wei, T., et al.: Skywork: A More Open Bilingual Foundation Model. arXiv preprint arXiv:2310.19341 (2023)
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Ye, Q., et al.: mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257 (2023)
Yin, S., et al.: A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549 (2023)
Yin, S., et al.: Woodpecker: Hallucination Correction for Multimodal Large Language Models. arXiv preprint arXiv:2310.16045 (2023)
Yu, T., et al.: RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. arXiv preprint arXiv:2312.00849 (2023)
Yu, W., et al.: MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490 (2023)
Zhang, D., et al.: MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv preprint arXiv:2401.13601 (2024)
Zhang, H., Li, X., Bing, L.: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858 (2023)
Zhang, R., et al.: LLaMA-Adapter: Efficient Finetuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, X., et al.: A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766 (2023)
Zhang, X., Li, R., Yu, J., Xu, Y., Li, W., Zhang, J.: Editguard: versatile image watermarking for tamper localization and copyright protection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11964–11974 (2024)
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: NeurIPS (2023)
Zheng, C., et al.: Prompt-Driven LLM Safeguarding via Directed Representation Optimization. arXiv preprint arXiv:2401.18018 (2024)
Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In: NeurIPS (2023)
Zhou, H., et al.: A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. arXiv preprint arXiv:2311.05112 (2023)
Zhu, B., et al.: LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv preprint arXiv:2310.01852 (2023)
Zong, Y., Bohdal, O., Yu, T., Yang, Y., Timothy, H.: Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models. arXiv preprint arXiv:2402.02207 (2024)
Acknowledgements
We sincerely thank the reviewers for their insightful comments and valuable feedback, which significantly improved the quality of our manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclaimer
This paper contains offensive content that may be disturbing to some readers.
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Y., Liu, X., Li, Y., Chen, M., Xiao, C. (2025). AdaShield : Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-72661-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)