research-article

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers

Authors:
Hunter Scott Heidenreich

Drexel University, Philadelphia, PA, USA

Drexel University, Philadelphia, PA, USA
View Profile

,
Jake Ryland Williams

Drexel University, Philadelphia, PA, USA

Drexel University, Philadelphia, PA, USA
View Profile

AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and SocietyJuly 2021Pages 566–573https://doi.org/10.1145/3461702.3462578

Published:30 July 2021Publication History

AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

Pages 566–573

ABSTRACT

This work considers universal adversarial triggers, a method of adversarially disrupting natural language models, and questions if it is possible to use such triggers to affect both the topic and stance of conditional text generation models. In considering four "controversial" topics, this work demonstrates success at identifying triggers that cause the GPT-2 model to produce text about targeted topics as well as influence the stance the text takes towards the topic. We show that, while the more fringe topics are more challenging to identify triggers for, they do appear to more effectively discriminate aspects like stance. We view this both as an indication of the dangerous potential for controllability and, perhaps, a reflection of the nature of the disconnect between conflicting views on these topics, something that future work could use to question the nature of filter bubbles and if they are reflected within models trained on internet content. In demonstrating the feasibility and ease of such an attack, this work seeks to raise the awareness that neural language models are susceptible to this influence--even if the model is already deployed and adversaries lack internal model access--and advocates the immediate safeguarding against this type of adversarial attack in order to prevent potential harm to human users.

References

Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and Natural Noise Both Break Neural Machine Translation. In International Conference on Learning Representations.Google Scholar
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT '21). Association for Computing Machinery, New York, NY, USA, 610--623. https://doi.org/10.1145/3442188.3445922Google ScholarDigital Library
Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5185--5198. https://doi.org/10.18653/v1/2020.acl-main.463Google Scholar
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in Neural Information Processing Systems (2016).Google Scholar
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, Vol. 356, 6334 (2017), 183--186.Google Scholar
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2020. Extracting Training Data from Large Language Models. arXiv preprint arXiv:2012.07805 (2020).Google Scholar
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 31--36.Google ScholarCross Ref
Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3719--3728.Google ScholarCross Ref
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, Vol. 115, 16 (2018), E3635--E3644.Google ScholarCross Ref
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. In Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning.Google Scholar
Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of NAACL-HLT. 609--614.Google Scholar
Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, Vol. 42, 1 (1990), 335 -- 346. https://doi.org/10.1016/0167--2789(90)90087--6Google ScholarDigital Library
Beth L. Hoffman, Elizabeth M. Felter, Kar-Hai Chu, Ariel Shensa, Chad Hermann, Todd Wolynn, Daria Williams, and Brian A. Primack. 2019. It's not all about autism: The emerging landscape of anti-vaccination sentiment on Facebook. Vaccine, Vol. 37, 16 (2019), 2216 -- 2223. https://doi.org/10.1016/j.vaccine.2019.03.003Google ScholarCross Ref
Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA) (AIES '19). Association for Computing Machinery, New York, NY, USA, 37--44. https://doi.org/10.1145/3306618.3314267Google ScholarDigital Library
Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2021--2031.Google ScholarCross Ref
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. 166--172.Google ScholarCross Ref
Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016).Google Scholar
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* '19). Association for Computing Machinery, New York, NY, USA, 220--229. https://doi.org/10.1145/3287560.3287596Google ScholarDigital Library
Eli Pariser. 2011. The filter bubble: How the new personalized web is changing what we read and how we think .Penguin.Google ScholarDigital Library
PublicHealth. 2021. Vaccine Myths Debunked. https://www.publichealth.org/public-awareness/understanding-vaccines/vaccine-myths-debunked/Google Scholar
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.Google Scholar
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 856--865.Google ScholarCross Ref
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725. https://doi.org/10.18653/v1/P16--1162Google ScholarCross Ref
Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. 2019. What are the biases in my word embedding?. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 305--311.Google ScholarDigital Library
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2153--2162.Google ScholarCross Ref
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender Bias in Contextualized Word Embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 629--634.Google ScholarCross Ref
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4847--4853.Google ScholarCross Ref

Index Terms

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Social aspects of security and privacy

Recommendations

Adversarial Attacks to Distributed Voltage Control in Power Distribution Networks with DERs
e-Energy '18: Proceedings of the Ninth International Conference on Future Energy Systems

It has been recently proposed that the reactive power injection of distributed energy resources (DERs) can be used to regulate the voltage across the power distribution network, and simple distributed control laws have been recently developed in the ...
Read More
Exploring Adversarial Attacks on Learning-based Localization
WiseML'23: Proceedings of the 2023 ACM Workshop on Wireless Security and Machine Learning

We investigate the robustness of a convolutional neural network (CNN) RF transmitter localization model in the face of adversarial actors which may poison or spoof sensor data to disrupt or defeat the algorithm. We train the CNN to estimate transmitter ...
Read More
Adversarial Attacks and Defenses: An Interpretation Perspective

Despite the recent advances in a wide spectrum of applications, machine learning models, especially deep neural networks, have been shown to be vulnerable to adversarial attacks. Attackers add carefully-crafted perturbations to input, where the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
July 2021
1077 pages
ISBN:9781450384735
DOI:10.1145/3461702
Program Chairs:
Marion Fourcade
University of California Berkeley, USA
,
Benjamin Kuipers
University of Michigan, USA
,
Seth Lazar
Australian National University, Australia
,
Deirdre Mulligan
University of California Berkeley, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial attacks
bias
language modeling
natural language processing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate61of162submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 274
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers

AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adversarial Attacks to Distributed Voltage Control in Power Distribution Networks with DERs

Exploring Adversarial Attacks on Learning-based Localization

Adversarial Attacks and Defenses: An Interpretation Perspective

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers

AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adversarial Attacks to Distributed Voltage Control in Power Distribution Networks with DERs

Exploring Adversarial Attacks on Learning-based Localization

Adversarial Attacks and Defenses: An Interpretation Perspective

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media