Abstract
Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever “toxic” linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.
S. Poppi, T. Poppi, and F. Cocchi—Equal contribution.
Warning: This paper includes explicit sexual content, racially insensitive language, and other material that may be disturbing or offensive to certain readers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The prompt template is in the form: “Below is an input string. Write a response that appropriately converts the input in its unsafe version
### Input:
### Response:”.
- 2.
We use the stablediffusionapi/newrealityxl-global-nsfw model available on HuggingFace, which has a high probability of generating NSFW images.
- 3.
- 4.
- 5.
Specifically, we map each of the 20 NSFW concepts of ViSU into one of the seven categories defined in I2P. Further details are given in the supplementary material.
- 6.
References
Bakker, M., et al.: Fine-tuning language models to find agreement among humans with diverse preferences. In: NeurIPS (2022)
Bedapudi, P.: NudeNet: neural nets for nudity classification, detection, and selective censoring (2019)
Birhane, A., Prabhu, V.U.: Large image datasets: a pyrrhic win for computer vision? In: WACV (2021)
Birhane, A., Prabhu, V.U., Kahembwe, E.: Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021)
Caffagni, D., et al.: The revolution of multimodal large language models: a survey. In: ACL Findings (2024)
Cao, Y., Yang, J.: Towards making systems forget with machine unlearning. In: IEEE Symposium on Security and Privacy (2015)
Cauteruccio, F., Corradini, E., Terracina, G., Ursino, D., Virgili, L.: Extraction and analysis of text patterns from nsfw adult content in reddit. Data Knowl. Eng. 138, 101979 (2022)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023)
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: NeurIPS (2017)
Crone, D.L., Bode, S., Murawski, C., Laham, S.M.: The Socio-Moral Image Database (SMID): a novel stimulus set for the study of social, moral and affective processes. PLoS ONE 13(1), e0190954 (2018)
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314 (2023)
Gadre, S.Y., et al.: DataComp: in search of the next generation of multimodal datasets. In: NeurIPS (2024)
Gandhi, S., et al.: scalable detection of offensive and non-compliant content/logo in product images. In: WACV (2020)
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: ICCV (2023)
Gao, P., et al.: LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Ginart, A., Guan, M., Valiant, G., Zou, J.Y.: Making AI forget you: data deletion in machine learning. In: NeurIPS (2019)
Golatkar, A., Achille, A., Soatto, S.: Eternal sunshine of the spotless net: selective forgetting in deep networks. In: CVPR (2020)
Golatkar, A., Achille, A., Wang, Y.X., Roth, A., Kearns, M., Soatto, S.: Mixed differential privacy in computer vision. In: CVPR (2022)
Hidayatullah, A.F., Hakim, A.M., Sembada, A.A.: Adult content classification on indonesian tweets using LSTM neural network. In: ICACSIS (2019)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: ICCV (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Liu, Y., Singh, A., Freeman, C.D., Co-Reyes, J.D., Liu, P.J.: Improving large language model fine-tuning for solving math problems. arXiv preprint arXiv:2310.10047 (2023)
Markov, T., et al.: A holistic approach to undesired content detection in the real world. In: AAAI (2023)
Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: CVPR (2022)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Oord, A.V.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Poppi, S., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Multi-class unlearning for image classification via weight filtering. IEEE Intell. Syst. (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: your language model is secretly a reward model. In: NeurIPS (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Schramowski, P., Brack, M., Deiseroth, B., Kersting, K.: Safe latent diffusion: mitigating inappropriate degeneration in diffusion models. In: CVPR (2023)
Schramowski, P., Tauchmann, C., Kersting, K.: Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In: ACM FAccT (2022)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: NeurIPS Workshops (2021)
Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Trager, M., Perera, P., Zancato, L., Achille, A., Bhatia, P., Soatto, S.: Linear spaces of meanings: compositional structures in vision-language models. In: ICCV (2023)
Tunstall, L., et al.: Zephyr: direct distillation of LM alignment. arXiv preprint arXiv:2310.16944 (2023)
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, Y., et al.: Self-instruct: aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022)
Zhang, E., Wang, K., Xu, X., Wang, Z., Shi, H.: Forget-me-not: learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591 (2023)
Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zheng, L., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 (2023)
Zhou, C., et al.: LIMA: less is more for alignment. arXiv preprint arXiv:2305.11206 (2023)
Acknowledgments
We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. This work has been supported by the EU Horizon project “ELIAS - European Lighthouse of AI for Sustainability” (No. 101120237), and by the the PNRR projects “FAIR - Future Artificial Intelligence Research” (M4C2 - PE00000013) and “ITSERR - Italian Strengthening of Esfri RI Resilience” (CUP B53C22001770006), both funded by the EU - NextGenerationEU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Poppi, S., Poppi, T., Cocchi, F., Cornia, M., Baraldi, L., Cucchiara, R. (2025). Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15111. Springer, Cham. https://doi.org/10.1007/978-3-031-73668-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-73668-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73667-4
Online ISBN: 978-3-031-73668-1
eBook Packages: Computer ScienceComputer Science (R0)