Skip to main content

AdaShield : Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15078))

Included in the following conference series:

  • 219 Accesses

Abstract

With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g. “harmful text”) has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose Adaptive Shield Prompting (AdaShield), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs’ robustness against structure-based jailbreak attacks without compromising the model’s general capabilities evaluated on standard benign tasks. Our code is available at https://rain305f.github.io/AdaShield-Project.

Y. Wang and X. Liu—Equal contributor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We also compare our method with the defense method MLLMP [45] against structure-based attacks and on benign datasets, which has just released its code on 02/29/2024. The complete results are provided in the appendix.

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Awadalla, A., et al.: OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390 (2023)

  3. Bai, J., et al.: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023)

  4. Cao, H., Liu, Z., Lu, X., Yao, Y., Li, Y.: Instructmol: multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208 (2023)

  5. Cao, H., et al.: Presto: progressive pretraining enhances synthetic chemistry outcomes. arXiv preprint arXiv:2406.13193 (2024)

  6. Carlini, N., et al.: Are aligned neural networks adversarially aligned? (2023)

    Google Scholar 

  7. Cha, S., Lee, J., Lee, Y., Yang, C.: Visually Dehallucinative Instruction Generation: Know What You Don’t Know. arXiv preprint arXiv:2303.16199 (2024)

  8. Chen, J., et al.: MiniGPT-V2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  9. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023)

  10. Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback. arXiv preprint arXiv:2311.10081 (2023)

  11. Costa, J.C., Roxo, T., Proença, H., Inácio, P.R.M.: How Deep Learning Sees the World: A Survey on Adversarial Attacksn and Defenses. arXiv preprint arXiv:2305.10862 (2023)

  12. Dong, X., et al.: InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model. arXiv preprint arXiv:2401.16420 (2024)

  13. Dong, Y., et al.: How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751 (2023)

  14. Dong, Z., Zhou, Z., Yang, C., Shao, J., Qiao, Y.: Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. arXiv preprint arXiv:2402.09283 (2024)

  15. Fu, C., et al.: MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023)

  16. Fu, C., et al.: A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv preprint arXiv:2312.12436 (2023)

  17. Ge, J., Luo, H., Qian, S., Gan, Y., Fu, J., Zhan, S.: Chain of Thought Prompt Tuning in Vision Language Models. arXiv preprint arXiv:2304.07919 (2023)

  18. Gong, Y., et al.: FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. arXiv preprint arXiv:2311.05608 (2023)

  19. Gu, X., et al.: Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv preprint arXiv:2402.08567 (2024)

  20. Guo, P., Yang, Z., Lin, X., Zhao, Q., Zhang, Q.: PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks. arXiv preprint arXiv:2401.10586 (2024)

  21. Han, D., Jia, X., Bai, Y., Gu, J., Liu, Y., Cao, X.: OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization. arXiv preprint arXiv:2312.04403 (2023)

  22. Ji, Y., et al.: Large Language Models as Automated Aligners for benchmarking Vision-Language Models. arXiv preprint arXiv:2311.14580 (2023)

  23. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: NeurIPS (2022)

    Google Scholar 

  24. Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial machine learning at scale. In: ICLR (2017)

    Google Scholar 

  25. Li, H., et al.: Freestyleret: retrieving images from style-diversified queries. arXiv preprint arXiv:2312.02428 (2023)

  26. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)

    Google Scholar 

  27. Li, L., et al.: Silkie: Preference Distillation for Large Visual Language Models. arXiv preprint arXiv:2312.10665 (2023)

  28. Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red Teaming Visual Language Models. arXiv preprint arXiv:2401.12915 (2024)

  29. Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv preprint arXiv:2311.10122 (2023)

  30. Liu, H., et al.: A Survey on Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2402.00253 (2024)

  31. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023)

    Google Scholar 

  32. Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024)

    Google Scholar 

  33. Liu, S., et al.: Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. arXiv preprint arXiv:2212.10789 (2024)

  34. Liu, X., et al.: AgentBench: evaluating LLMs as agents. In: ICLR (2024)

    Google Scholar 

  35. Liu, X., Xu, N., Chen, M., Xiao, C.: Generating stealthy jailbreak prompts on aligned large language models. In: ICLR (2024)

    Google Scholar 

  36. Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-Relevant Images Jailbreak Large Multi-Modal Models (2023)

    Google Scholar 

  37. Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Safety of Multimodal Large Language Models on Images and Text. arXiv preprint arXiv:2402.00357 (2024)

  38. Lu, X., et al.: Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. arXiv preprint arXiv:2403.08192 (2024)

  39. Lyu, H., et al.: GPT-4v(ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547 (2023)

  40. Mao, C., Chiquier, M., Wang, H., Yang, J., Vondrick, C.: Adversarial attacks are reversible with natural supervision. In: ICCV (2021)

    Google Scholar 

  41. Meta: Llama usage policy (2023). Accessed 10 2023

    Google Scholar 

  42. Naveed, H., et al.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2024)

  43. Niu, Z., Ren, H., Gao, X., Hua, G., Jin, R.: Jailbreaking Attack against Multimodal Large Language Model. arXiv preprint arXiv:2402.02309 (2024)

  44. OpenAI: OpenAI usage policy (2023). Accessed 10 2023

    Google Scholar 

  45. Pi, R., et al.: MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance. arXiv preprint arXiv:2401.02906 (2024)

  46. Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual Adversarial Examples Jailbreak Aligned Large Language Models. arXiv preprint arXiv:2306.13213 (2023)

  47. Rizwan, N., Bhaskar, P., Das, M., Majhi, S.S., Saha, P., Mukherjee, A.: Zero shot VLMs for hate meme detection: Are we there yet? arXiv preprint arXiv:2402.12198 (2024)

  48. Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)

    Google Scholar 

  49. Shayegani, E., Dong, Y., Abu-Ghazaleh, N.: Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. arXiv preprint arXiv:2307.14539 (2023)

  50. Shayegani, E., Mamun, M.A.A., Fu, Y., Zaree, P., Dong, Y., Abu-Ghazaleh, N.: Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844 (2023)

  51. Sun, Z., et al.: Aligning Large Multimodal Models with Factually Augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)

  52. Wang, B., et al.: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv preprint arXiv:2306.11698 (2024)

  53. Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)

  54. Wei, T., et al.: Skywork: A More Open Bilingual Foundation Model. arXiv preprint arXiv:2310.19341 (2023)

  55. Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)

  56. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  57. Ye, Q., et al.: mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257 (2023)

  58. Yin, S., et al.: A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549 (2023)

  59. Yin, S., et al.: Woodpecker: Hallucination Correction for Multimodal Large Language Models. arXiv preprint arXiv:2310.16045 (2023)

  60. Yu, T., et al.: RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. arXiv preprint arXiv:2312.00849 (2023)

  61. Yu, W., et al.: MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490 (2023)

  62. Zhang, D., et al.: MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv preprint arXiv:2401.13601 (2024)

  63. Zhang, H., Li, X., Bing, L.: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858 (2023)

  64. Zhang, R., et al.: LLaMA-Adapter: Efficient Finetuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199 (2023)

  65. Zhang, X., et al.: A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766 (2023)

  66. Zhang, X., Li, R., Yu, J., Xu, Y., Li, W., Zhang, J.: Editguard: versatile image watermarking for tamper localization and copyright protection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11964–11974 (2024)

    Google Scholar 

  67. Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: NeurIPS (2023)

    Google Scholar 

  68. Zheng, C., et al.: Prompt-Driven LLM Safeguarding via Directed Representation Optimization. arXiv preprint arXiv:2401.18018 (2024)

  69. Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In: NeurIPS (2023)

    Google Scholar 

  70. Zhou, H., et al.: A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. arXiv preprint arXiv:2311.05112 (2023)

  71. Zhu, B., et al.: LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv preprint arXiv:2310.01852 (2023)

  72. Zong, Y., Bohdal, O., Yu, T., Yang, Y., Timothy, H.: Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models. arXiv preprint arXiv:2402.02207 (2024)

Download references

Acknowledgements

We sincerely thank the reviewers for their insightful comments and valuable feedback, which significantly improved the quality of our manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Wang .

Editor information

Editors and Affiliations

Ethics declarations

Disclaimer

This paper contains offensive content that may be disturbing to some readers.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2123 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Liu, X., Li, Y., Chen, M., Xiao, C. (2025). AdaShield : Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72661-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72660-6

  • Online ISBN: 978-3-031-72661-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics