Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Poppi, Samuele; Poppi, Tobia; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

doi:10.1007/978-3-031-73668-1_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15111))

Included in the following conference series:

European Conference on Computer Vision

280 Accesses
3 Citations

Abstract

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever “toxic” linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

S. Poppi, T. Poppi, and F. Cocchi—Equal contribution.

Warning: This paper includes explicit sexual content, racially insensitive language, and other material that may be disturbing or offensive to certain readers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adversarial Prompt Tuning for Vision-Language Models

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Notes

1.
The prompt template is in the form: “Below is an input string. Write a response that appropriately converts the input in its unsafe version ### Input: ### Response:”.
2.
We use the stablediffusionapi/newrealityxl-global-nsfw model available on HuggingFace, which has a high probability of generating NSFW images.
3.
https://github.com/conversationai/perspectiveapi.
4.
https://github.com/EBazarov/nsfw_data_source_urls.
5.
Specifically, we map each of the 20 NSFW concepts of ViSU into one of the seven categories defined in I2P. Further details are given in the supplementary material.
6.
https://huggingface.co/datasets/aimagelab/ViSU-Text.

References

Bakker, M., et al.: Fine-tuning language models to find agreement among humans with diverse preferences. In: NeurIPS (2022)
Google Scholar
Bedapudi, P.: NudeNet: neural nets for nudity classification, detection, and selective censoring (2019)
Google Scholar
Birhane, A., Prabhu, V.U.: Large image datasets: a pyrrhic win for computer vision? In: WACV (2021)
Google Scholar
Birhane, A., Prabhu, V.U., Kahembwe, E.: Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021)
Caffagni, D., et al.: The revolution of multimodal large language models: a survey. In: ACL Findings (2024)
Google Scholar
Cao, Y., Yang, J.: Towards making systems forget with machine unlearning. In: IEEE Symposium on Security and Privacy (2015)
Google Scholar
Cauteruccio, F., Corradini, E., Terracina, G., Ursino, D., Virgili, L.: Extraction and analysis of text patterns from nsfw adult content in reddit. Data Knowl. Eng. 138, 101979 (2022)
Article Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023)
Google Scholar
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: NeurIPS (2017)
Google Scholar
Crone, D.L., Bode, S., Murawski, C., Laham, S.M.: The Socio-Moral Image Database (SMID): a novel stimulus set for the study of social, moral and affective processes. PLoS ONE 13(1), e0190954 (2018)
Article Google Scholar
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314 (2023)
Gadre, S.Y., et al.: DataComp: in search of the next generation of multimodal datasets. In: NeurIPS (2024)
Google Scholar
Gandhi, S., et al.: scalable detection of offensive and non-compliant content/logo in product images. In: WACV (2020)
Google Scholar
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: ICCV (2023)
Google Scholar
Gao, P., et al.: LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Ginart, A., Guan, M., Valiant, G., Zou, J.Y.: Making AI forget you: data deletion in machine learning. In: NeurIPS (2019)
Google Scholar
Golatkar, A., Achille, A., Soatto, S.: Eternal sunshine of the spotless net: selective forgetting in deep networks. In: CVPR (2020)
Google Scholar
Golatkar, A., Achille, A., Wang, Y.X., Roth, A., Kearns, M., Soatto, S.: Mixed differential privacy in computer vision. In: CVPR (2022)
Google Scholar
Hidayatullah, A.F., Hakim, A.M., Sembada, A.A.: Adult content classification on indonesian tweets using LSTM neural network. In: ICACSIS (2019)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: ICCV (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Liu, Y., Singh, A., Freeman, C.D., Co-Reyes, J.D., Liu, P.J.: Improving large language model fine-tuning for solving math problems. arXiv preprint arXiv:2310.10047 (2023)
Markov, T., et al.: A holistic approach to undesired content detection in the real world. In: AAAI (2023)
Google Scholar
Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: CVPR (2022)
Google Scholar
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Oord, A.V.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Google Scholar
Poppi, S., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Multi-class unlearning for image classification via weight filtering. IEEE Intell. Syst. (2024)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: your language model is secretly a reward model. In: NeurIPS (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Schramowski, P., Brack, M., Deiseroth, B., Kersting, K.: Safe latent diffusion: mitigating inappropriate degeneration in diffusion models. In: CVPR (2023)
Google Scholar
Schramowski, P., Tauchmann, C., Kersting, K.: Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In: ACM FAccT (2022)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: NeurIPS Workshops (2021)
Google Scholar
Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)
Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Trager, M., Perera, P., Zancato, L., Achille, A., Bhatia, P., Soatto, S.: Linear spaces of meanings: compositional structures in vision-language models. In: ICCV (2023)
Google Scholar
Tunstall, L., et al.: Zephyr: direct distillation of LM alignment. arXiv preprint arXiv:2310.16944 (2023)
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, Y., et al.: Self-instruct: aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022)
Zhang, E., Wang, K., Xu, X., Wang, Z., Shi, H.: Forget-me-not: learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591 (2023)
Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zheng, L., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 (2023)
Zhou, C., et al.: LIMA: less is more for alignment. arXiv preprint arXiv:2305.11206 (2023)

Download references

Acknowledgments

We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. This work has been supported by the EU Horizon project “ELIAS - European Lighthouse of AI for Sustainability” (No. 101120237), and by the the PNRR projects “FAIR - Future Artificial Intelligence Research” (M4C2 - PE00000013) and “ITSERR - Italian Strengthening of Esfri RI Resilience” (CUP B53C22001770006), both funded by the EU - NextGenerationEU.

Author information

Authors and Affiliations

University of Modena and Reggio Emilia, Modena, Italy
Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi & Rita Cucchiara
University of Pisa, Pisa, Italy
Samuele Poppi, Tobia Poppi & Federico Cocchi
IIT-CNR, Pisa, Italy
Rita Cucchiara

Authors

Samuele Poppi
View author publications
You can also search for this author in PubMed Google Scholar
Tobia Poppi
View author publications
You can also search for this author in PubMed Google Scholar
Federico Cocchi
View author publications
You can also search for this author in PubMed Google Scholar
Marcella Cornia
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Baraldi
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samuele Poppi .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 17644 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Poppi, S., Poppi, T., Cocchi, F., Cornia, M., Baraldi, L., Cucchiara, R. (2025). Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15111. Springer, Cham. https://doi.org/10.1007/978-3-031-73668-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-73668-1_20
Published: 01 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73667-4
Online ISBN: 978-3-031-73668-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models