skip to main content
10.1145/3461702.3462578acmconferencesArticle/Chapter ViewAbstractPublication PagesaiesConference Proceedingsconference-collections
research-article

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers

Published:30 July 2021Publication History

ABSTRACT

This work considers universal adversarial triggers, a method of adversarially disrupting natural language models, and questions if it is possible to use such triggers to affect both the topic and stance of conditional text generation models. In considering four "controversial" topics, this work demonstrates success at identifying triggers that cause the GPT-2 model to produce text about targeted topics as well as influence the stance the text takes towards the topic. We show that, while the more fringe topics are more challenging to identify triggers for, they do appear to more effectively discriminate aspects like stance. We view this both as an indication of the dangerous potential for controllability and, perhaps, a reflection of the nature of the disconnect between conflicting views on these topics, something that future work could use to question the nature of filter bubbles and if they are reflected within models trained on internet content. In demonstrating the feasibility and ease of such an attack, this work seeks to raise the awareness that neural language models are susceptible to this influence--even if the model is already deployed and adversaries lack internal model access--and advocates the immediate safeguarding against this type of adversarial attack in order to prevent potential harm to human users.

References

  1. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and Natural Noise Both Break Neural Machine Translation. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT '21). Association for Computing Machinery, New York, NY, USA, 610--623. https://doi.org/10.1145/3442188.3445922Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5185--5198. https://doi.org/10.18653/v1/2020.acl-main.463Google ScholarGoogle Scholar
  4. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in Neural Information Processing Systems (2016).Google ScholarGoogle Scholar
  5. Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, Vol. 356, 6334 (2017), 183--186.Google ScholarGoogle Scholar
  6. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2020. Extracting Training Data from Large Language Models. arXiv preprint arXiv:2012.07805 (2020).Google ScholarGoogle Scholar
  7. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 31--36.Google ScholarGoogle ScholarCross RefCross Ref
  8. Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3719--3728.Google ScholarGoogle ScholarCross RefCross Ref
  9. Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, Vol. 115, 16 (2018), E3635--E3644.Google ScholarGoogle ScholarCross RefCross Ref
  10. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. In Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning.Google ScholarGoogle Scholar
  11. Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of NAACL-HLT. 609--614.Google ScholarGoogle Scholar
  12. Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, Vol. 42, 1 (1990), 335 -- 346. https://doi.org/10.1016/0167--2789(90)90087--6Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Beth L. Hoffman, Elizabeth M. Felter, Kar-Hai Chu, Ariel Shensa, Chad Hermann, Todd Wolynn, Daria Williams, and Brian A. Primack. 2019. It's not all about autism: The emerging landscape of anti-vaccination sentiment on Facebook. Vaccine, Vol. 37, 16 (2019), 2216 -- 2223. https://doi.org/10.1016/j.vaccine.2019.03.003Google ScholarGoogle ScholarCross RefCross Ref
  14. Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA) (AIES '19). Association for Computing Machinery, New York, NY, USA, 37--44. https://doi.org/10.1145/3306618.3314267Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2021--2031.Google ScholarGoogle ScholarCross RefCross Ref
  16. Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. 166--172.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016).Google ScholarGoogle Scholar
  18. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* '19). Association for Computing Machinery, New York, NY, USA, 220--229. https://doi.org/10.1145/3287560.3287596Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Eli Pariser. 2011. The filter bubble: How the new personalized web is changing what we read and how we think .Penguin.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. PublicHealth. 2021. Vaccine Myths Debunked. https://www.publichealth.org/public-awareness/understanding-vaccines/vaccine-myths-debunked/Google ScholarGoogle Scholar
  21. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.Google ScholarGoogle Scholar
  22. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 856--865.Google ScholarGoogle ScholarCross RefCross Ref
  23. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725. https://doi.org/10.18653/v1/P16--1162Google ScholarGoogle ScholarCross RefCross Ref
  24. Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. 2019. What are the biases in my word embedding?. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 305--311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2153--2162.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender Bias in Contextualized Word Embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 629--634.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4847--4853.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
          July 2021
          1077 pages
          ISBN:9781450384735
          DOI:10.1145/3461702

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 July 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate61of162submissions,38%
        • Article Metrics

          • Downloads (Last 12 months)74
          • Downloads (Last 6 weeks)7

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader