Skip to main content

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever “toxic” linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

S. Poppi, T. Poppi, and F. Cocchi—Equal contribution.

Warning: This paper includes explicit sexual content, racially insensitive language, and other material that may be disturbing or offensive to certain readers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The prompt template is in the form: “Below is an input string. Write a response that appropriately converts the input in its unsafe version ### Input: ### Response:”.

  2. 2.

    We use the stablediffusionapi/newrealityxl-global-nsfw model available on HuggingFace, which has a high probability of generating NSFW images.

  3. 3.

    https://github.com/conversationai/perspectiveapi.

  4. 4.

    https://github.com/EBazarov/nsfw_data_source_urls.

  5. 5.

    Specifically, we map each of the 20 NSFW concepts of ViSU into one of the seven categories defined in I2P. Further details are given in the supplementary material.

  6. 6.

    https://huggingface.co/datasets/aimagelab/ViSU-Text.

References

  1. Bakker, M., et al.: Fine-tuning language models to find agreement among humans with diverse preferences. In: NeurIPS (2022)

    Google Scholar 

  2. Bedapudi, P.: NudeNet: neural nets for nudity classification, detection, and selective censoring (2019)

    Google Scholar 

  3. Birhane, A., Prabhu, V.U.: Large image datasets: a pyrrhic win for computer vision? In: WACV (2021)

    Google Scholar 

  4. Birhane, A., Prabhu, V.U., Kahembwe, E.: Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021)

  5. Caffagni, D., et al.: The revolution of multimodal large language models: a survey. In: ACL Findings (2024)

    Google Scholar 

  6. Cao, Y., Yang, J.: Towards making systems forget with machine unlearning. In: IEEE Symposium on Security and Privacy (2015)

    Google Scholar 

  7. Cauteruccio, F., Corradini, E., Terracina, G., Ursino, D., Virgili, L.: Extraction and analysis of text patterns from nsfw adult content in reddit. Data Knowl. Eng. 138, 101979 (2022)

    Article  Google Scholar 

  8. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023)

    Google Scholar 

  9. Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: NeurIPS (2017)

    Google Scholar 

  10. Crone, D.L., Bode, S., Murawski, C., Laham, S.M.: The Socio-Moral Image Database (SMID): a novel stimulus set for the study of social, moral and affective processes. PLoS ONE 13(1), e0190954 (2018)

    Article  Google Scholar 

  11. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314 (2023)

  12. Gadre, S.Y., et al.: DataComp: in search of the next generation of multimodal datasets. In: NeurIPS (2024)

    Google Scholar 

  13. Gandhi, S., et al.: scalable detection of offensive and non-compliant content/logo in product images. In: WACV (2020)

    Google Scholar 

  14. Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: ICCV (2023)

    Google Scholar 

  15. Gao, P., et al.: LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  16. Ginart, A., Guan, M., Valiant, G., Zou, J.Y.: Making AI forget you: data deletion in machine learning. In: NeurIPS (2019)

    Google Scholar 

  17. Golatkar, A., Achille, A., Soatto, S.: Eternal sunshine of the spotless net: selective forgetting in deep networks. In: CVPR (2020)

    Google Scholar 

  18. Golatkar, A., Achille, A., Wang, Y.X., Roth, A., Kearns, M., Soatto, S.: Mixed differential privacy in computer vision. In: CVPR (2022)

    Google Scholar 

  19. Hidayatullah, A.F., Hakim, A.M., Sembada, A.A.: Adult content classification on indonesian tweets using LSTM neural network. In: ICACSIS (2019)

    Google Scholar 

  20. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  21. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  23. Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: ICCV (2023)

    Google Scholar 

  24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  26. Liu, Y., Singh, A., Freeman, C.D., Co-Reyes, J.D., Liu, P.J.: Improving large language model fine-tuning for solving math problems. arXiv preprint arXiv:2310.10047 (2023)

  27. Markov, T., et al.: A holistic approach to undesired content detection in the real world. In: AAAI (2023)

    Google Scholar 

  28. Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: CVPR (2022)

    Google Scholar 

  29. Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  30. Oord, A.V.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  31. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

    Google Scholar 

  32. Poppi, S., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Multi-class unlearning for image classification via weight filtering. IEEE Intell. Syst. (2024)

    Google Scholar 

  33. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  34. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  35. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: your language model is secretly a reward model. In: NeurIPS (2023)

    Google Scholar 

  36. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  37. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  38. Schramowski, P., Brack, M., Deiseroth, B., Kersting, K.: Safe latent diffusion: mitigating inappropriate degeneration in diffusion models. In: CVPR (2023)

    Google Scholar 

  39. Schramowski, P., Tauchmann, C., Kersting, K.: Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In: ACM FAccT (2022)

    Google Scholar 

  40. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  41. Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: NeurIPS Workshops (2021)

    Google Scholar 

  42. Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)

    Google Scholar 

  43. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  44. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  45. Trager, M., Perera, P., Zancato, L., Achille, A., Bhatia, P., Soatto, S.: Linear spaces of meanings: compositional structures in vision-language models. In: ICCV (2023)

    Google Scholar 

  46. Tunstall, L., et al.: Zephyr: direct distillation of LM alignment. arXiv preprint arXiv:2310.16944 (2023)

  47. Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)

  48. Wang, Y., et al.: Self-instruct: aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022)

  49. Zhang, E., Wang, K., Xu, X., Wang, Z., Shi, H.: Forget-me-not: learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591 (2023)

  50. Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)

  51. Zheng, L., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 (2023)

  52. Zhou, C., et al.: LIMA: less is more for alignment. arXiv preprint arXiv:2305.11206 (2023)

Download references

Acknowledgments

We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. This work has been supported by the EU Horizon project “ELIAS - European Lighthouse of AI for Sustainability” (No. 101120237), and by the the PNRR projects “FAIR - Future Artificial Intelligence Research” (M4C2 - PE00000013) and “ITSERR - Italian Strengthening of Esfri RI Resilience” (CUP B53C22001770006), both funded by the EU - NextGenerationEU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samuele Poppi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 17644 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Poppi, S., Poppi, T., Cocchi, F., Cornia, M., Baraldi, L., Cucchiara, R. (2025). Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15111. Springer, Cham. https://doi.org/10.1007/978-3-031-73668-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73668-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73667-4

  • Online ISBN: 978-3-031-73668-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics