Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation

Gou, Yunhao; Chen, Kai; Liu, Zhili; Hong, Lanqing; Xu, Hang; Li, Zhenguo; Yeung, Dit-Yan; Kwok, James T.; Zhang, Yu

doi:10.1007/978-3-031-72643-9_23

Yunhao Gou ORCID: orcid.org/0000-0002-1352-794X^13,14,
Kai Chen¹⁴,
Zhili Liu^14,15,
Lanqing Hong¹⁵,
Hang Xu¹⁵,
Zhenguo Li¹⁵,
Dit-Yan Yeung¹⁴,
James T. Kwok¹⁴ &
…
Yu Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15075))

Included in the following conference series:

European Conference on Computer Vision

408 Accesses

Abstract

Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that ECSO enhances model safety significantly (e.g., 37.6% improvement on the MM-SafetyBench (SD+OCR) and 71.3% on VLSafe with LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.

Y. Gou, K. Chen and Z. Liu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Adversarial Prompt Tuning for Vision-Language Models

Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Notes

1.
https://chatgpt.ust.hk.
2.
More detailed description on the dataset can be found in Appendix A.3.

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arxiv:2204.14198 (2022)
Bagdasaryan, E., Hsieh, T.Y., Nassi, B., Shmatikov, V.: (ab) using images and sounds for indirect instruction injection in multi-modal LLMs. arXiv preprint arXiv:2307.10490 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bailey, L., Ong, E., Russell, S., Emmons, S.: Image hijacks: adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236 (2023)
Chen, K., et al.: Gaining wisdom from setbacks: aligning large language models via mistake analysis. arXiv preprint arXiv:2310.10477 (2023)
Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
Chen, Y., Mendes, E., Das, S., Xu, W., Ritter, A.: Can language models be instructed to protect personal information? arXiv preprint arXiv:2310.02224 (2023)
Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: Dress: instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Dai, J., et al.: Safe RLHF: safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773 (2023)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arxiv:2305.06500 (2023)
Dong, Y., et al.: How robust is Google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751 (2023)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Fu, X., et al.: Misusing tools in large language models with visual adversarial examples. arXiv preprint arXiv:2310.03185 (2023)
Gong, Y., et al.: FigStep: jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608 (2023)
Gou, Y., et al.: Mixture of cluster-conditional LoRA experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379 (2023)
Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
Li, M., Li, L., Yin, Y., Ahmed, M., Liu, Z., Liu, Q.: Red teaming visual language models. arXiv preprint arXiv:2401.12915 (2024)
Li, Y., et al.: Automated evaluation of large vision-language models on self-driving corner cases. arXiv preprint arXiv:2404.10595 (2024)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Sferrazza, C., Abbeel, P.: Languages are rewards: hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600 (2023)
Liu, X., Zhu, Y., Lan, Y., Yang, C., Qiao, Y.: Safety of multimodal large language models on images and text. arXiv preprint arXiv:2402.00357 (2024)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Liu, Z., et al.: Mixture of insightful experts (mote): the synergy of thought chains and expert mixtures in self-alignment. arXiv preprint arXiv:2405.00557 (2024)
Luo, H., Gu, J., Liu, F., Torr, P.: An image is worth 1000 lies: transferability of adversarial images across prompts on vision-language models. In: ICLR (2024)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeruIPS (2022)
Google Scholar
Pi, R., et al.: MLLM-protector: ensuring MLLM’s safety without hurting performance. arXiv preprint arXiv:2401.02906 (2024)
Qi, X., Huang, K., Panda, A., Wang, M., Mittal, P.: Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213 (2023)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: your language model is secretly a reward model. In: NeurIPS (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Saunders, W., et al.: Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802 (2022)
Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV (2023)
Google Scholar
Shayegani, E., Dong, Y., Abu-Ghazaleh, N.: Plug and pray: exploiting off-the-shelf components of multi-modal models. arXiv preprint arXiv:2307.14539 (2023)
Sun, H., Zhang, Z., Deng, J., Cheng, J., Huang, M.: Safety assessment of Chinese large language models. arXiv preprint arXiv:2304.10436 (2023)
Taori, R., et al.: Stanford alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Tu, H., et al.: How many unicorns are in this image? a safety evaluation benchmark for vision LLMs. arXiv preprint arXiv:2311.16101 (2023)
Wang, P., et al.: InferAligner: inference-time alignment for harmlessness through cross-model guidance. arXiv preprint arXiv:2401.11206 (2024)
Wang, Y., et al.: Aligning large language models with human: a survey. arXiv preprint arXiv:2307.12966 (2023)
Wu, P., Xie, S.: V*: guided visual search as a core mechanism in multimodal LLMs. arXiv preprint arXiv:2312.14135 (2023)
Wu, Y., Li, X., Liu, Y., Zhou, P., Sun, L.: Jailbreaking GPT-4V via self-adversarial attacks with system prompts. arXiv preprint arXiv:2311.09127 (2023)
Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Zhang, P., et al.: InterNLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
Zhao, Y., et al.: On evaluating adversarial robustness of large vision-language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Zong, Y., Bohdal, O., Yu, T., Yang, Y., Hospedales, T.: Safety fine-tuning at (almost) no cost: a baseline for vision large language models. arXiv preprint arXiv:2402.02207 (2024)

Download references

Acknowledgement

We gratefully acknowledge the support of MindSpore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research. This work was partially supported by NSFC key grant 62136005, NSFC general grant 62076118, Shenzhen fundamental research program JCYJ20210324105000003, the Research Grants Council of the Hong Kong Special Administrative Region (Grants C7004-22G-1 and 16202523), and the Research Grants Council of Hong Kong through the Research Impact Fund project R6003-21.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China
Yunhao Gou & Yu Zhang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Yunhao Gou, Kai Chen, Zhili Liu, Dit-Yan Yeung & James T. Kwok
Huawei Noah’s Ark Lab, New Territories, Hong Kong
Zhili Liu, Lanqing Hong, Hang Xu & Zhenguo Li

Authors

Yunhao Gou
View author publications
You can also search for this author in PubMed Google Scholar
Kai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhili Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lanqing Hong
View author publications
You can also search for this author in PubMed Google Scholar
Hang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenguo Li
View author publications
You can also search for this author in PubMed Google Scholar
Dit-Yan Yeung
View author publications
You can also search for this author in PubMed Google Scholar
James T. Kwok
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 602 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gou, Y. et al. (2025). Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15075. Springer, Cham. https://doi.org/10.1007/978-3-031-72643-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72643-9_23
Published: 22 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72642-2
Online ISBN: 978-3-031-72643-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation