Exploring the vulnerability of black-box adversarial attack on prompt-based learning in language models

Tan, Zihao; Chen, Qingliang; Zhu, Wenbin; Huang, Yongjian; Liang, Chen

doi:10.1007/s00521-024-10669-2

Exploring the vulnerability of black-box adversarial attack on prompt-based learning in language models

Original Article
Published: 20 November 2024

Volume 37, pages 1457–1473, (2025)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Zihao Tan¹,
Qingliang Chen¹,
Wenbin Zhu¹,
Yongjian Huang² &
…
Chen Liang³

142 Accesses
Explore all metrics

Abstract

Prompt-based learning has been proved to be an effective way in pre-trained language models (PLMs), especially in low-resource scenarios like few-shot settings. However, the trustworthiness of PLMs is of paramount significance and potential vulnerabilities have been shown in prompt-based templates that could mislead the predictions of language models, causing serious security concerns. In this paper, we will shed light on some vulnerabilities of PLMs, by proposing a prompt-based adversarial attack on manual templates in black-box scenarios. First of all, we design character-level and word-level heuristic approaches to break manual templates separately. Then we present a greedy algorithm for the attack based on the above heuristic destructive approaches and further combine it with negative words. Finally, we evaluate our approach with the classification tasks on three variants of BERT series models and eight datasets. And comprehensive experimental results justify the effectiveness of our approach in terms of attack success rate and attack speed. On average, it achieves an attack success rate of close to 90% and a query time of around 3000, which is significantly better than the compared baseline methods. Further experimental studies indicate that our proposed method also displays good capabilities in scenarios with varying shot counts, template lengths and query counts, exhibiting good generalizability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

COVER: A Heuristic Greedy Adversarial Attack on Prompt-Based Learning in Language Models

Watch Your Words: Successfully Jailbreak LLM by Mitigating the “Prompt Malice”

UTPrompt: Cross-Task Backdoor Prompt Attacks Based on Universal Triggers

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The datasets used in the experiment are all publicly available, and the corresponding data links are shown: SST2 https://paperswithcode.com/dataset/sst IMDB https://ai.stanford.edu/~amaas/data/sentiment Jigsaw2018, HSOL, Amazon-LB, CGFake, Enron and SpamAssassin https://github.com/thunlp/Advbench.

References

Han X, Zhang Z, Ding N et al (2021) Pre-trained models: past, present and future. AI Open 2:225–250
Article MATH Google Scholar
Kabra A, Liu E, Khanuja S, Cahyawijaya S, et al (2023) Multi-lingual and multi-cultural figurative language understanding. paper presented at findings of the association for computational linguistics: ACL 2023, Toronto, Canada, pp 8269–8284
Yuan H, Yuan Z, Yu S (2022) Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning. Paper presented at proceedings of the 2022 conference of the North American Chapter of the association for computational linguistics: human language technologies, Seattle, United States, pp 4038–4048
Lam KN, Doan TG, Pham KT, et al (2023) Abstractive text summarization using the BRIO training paradigm. Paper presented at findings of the association for computational linguistics: ACL 2023, Toronto, Canada, pp 92–99
Liu P, Yuan W, Fu J et al (2023) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):1–35
Article MATH Google Scholar
Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. Paper presented at proceedings of the 16th conference of the european chapter of the association for computational linguistics, pp 255–269
Schick T, Schütze H (2021) It’s not just size that matters: small language models are also few-shot learners. Paper presented at proceedings of the 2021 conference of the north american chapter of the association for computational linguistics, pp 2339–2352
Gao T, Fisch A, Chen D (2021) Making pre-trained language models better few-shot learners. Paper presented at proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 3816–3830
White J, Fu Q, Hays S, et al (2023) A Prompt pattern catalog to enhance prompt engineering with ChatGPT. https://doi.org/10.48550/arXiv.2302.11382
Cai W, Hou C, Liu C et al (2023) PTC: prompt-based continual encrypted traffic classification. Paper presented at 2023 26th international conference on computer supported cooperative work in design, Rio de Janeiro, Brazil, pp 1557–1562
Ji J, Ren W, Naseem U (2023) Identifying creative harmful memes via prompt based approach. Paper presented at proceedings of the ACM web conference 2023, Austin, Texas, United States, pp 3868–3872
Zhang W, Hu L, Wei Y et al (2023) Verbalizer or classifier? A new prompt learning model for event causality identification. Paper presented at 2023 international joint conference on neural networks, Queensland, Australia, 1–7 June 2023
Xu L, Chen Y, Cui G et al (2022) Exploring the universal vulnerability of prompt-based learning paradigm. Paper presented at findings of the association for computational linguistics: NAACL 2022, Seattle, United States, pp 1799–1810
Shi Y, Li P, Yin C et al (2022) PromptAttack: prompt-based attack for language models via gradient search. Paper presented at the 11th CCF international conference on natural language processing and chinese computing, Guilin, China, pp 682–693
Lee D, Moon S, Lee J et al (2022) Query-efficient and scalable black-box adversarial attacks on discrete sequential data via Bayesian optimization. Paper presented at 39th international conference on machine learning, Baltimore, Maryland, United States, pp 12478–12497, July 2022
Min S, Lewis M, Hajishirzi H et al (2022) Noisy channel language model prompting for few-shot text classification. Paper presented at proceedings of the 60th annual meeting of the association for computational linguistics, Dublin, Ireland, pp 5316–5330
Zhao S, Liang Z, Wen J et al (2022) Sparsing and smoothing for the seq2seq Models. IEEE Trans Artif Intell 4(3):464–472
Article MATH Google Scholar
Roziere B, Gehring J, Gloeckle F et al (2023) Code llama: open foundation models for code. https://doi.org/10.48550/arXiv.2308.12950
Behnke H, Fomicheva M, Specia L (2022) Bias mitigation in machine translation quality estimation. Paper presented at proceedings of the 60th annual meeting of the association for computational linguistics, Dublin, Ireland, pp 1475–1487
Han X, Zhao W, Ding N et al (2022) Ptr: prompt tuning with rules for text classification. AI Open 3:182–192
Article MATH Google Scholar
Liu X, Zheng Y, Du Z et al (2021) GPT understands, too. https://doi.org/10.48550/arXiv.2103.10385
Chen Y, Gao H, Cui G, et al (2022) Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. Paper presented at proceedings of the 2022 conference on empirical methods in natural language processing, Abu Dhabi, United Arab Emirates, pp 11222–11237
Formento B, Foo CS, Luu AT et al (2023) Using punctuation as an adversarial attack on deep learning-based NLP systems: an empirical study. Paper presented at the 17th conference of the european chapter of the association for computational linguistics, Dubrovnik, Croatia, pp 1–34
Wang B, Xu C, Liu X et al (2022) SemAttack: natural textual attacks via different semantic spaces. Paper presented at findings of the association for computational linguistics: NAACL 2022, Seattle, United States, pp 176–205
Qi F, Chen Y, Zhang X et al (2021) Mind the style of text! Adversarial and backdoor attacks based on text style transfer. Paper presented at proceedings of the 2021 conference on empirical methods in natural language processing, Online and Punta Cana, Dominican Republic, pp 4569–4580
Xue J, Zheng M, Hua T, et al (2023) TrojLLM: a black-box trojan prompt attack on large language models. https://doi.org/10.48550/arXiv.2306.06815
Chao P, Robey A, Dobriban E, et al (2023) Jailbreaking black box large language models in twenty queries. https://doi.org/10.48550/arXiv.2310.08419
Yang Y, Huang P, Cao J, et al (2022) A prompting-based approach for adversarial example generation and robustness enhancement. https://doi.org/10.48550/arXiv.2203.10714
Zhu H, Li C, Yang H et al (2023) Prompt makes mask language models better adversarial attackers. Paper presented at 2023 IEEE international conference on acoustics, speech and signal processing, Rhodes Island, Greece, 1–5 June 2023
Sanh V, Debut L, Chaumond J et al (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://doi.org/10.48550/arXiv.1910.01108
Wang A, Singh A, Michael J, et al (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. Paper presented at proceedings of the 2018 EMNLP workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, Brussels, Belgium, pp 353–355
Paszke A, Gross S, Massa F, et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, pp 8024–8035
Maas AL, Daly RE, Pham PT et al (2011) Learning word vectors for sentiment analysis. Paper presented at proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, United States, pp 142–150
Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Paper presented at proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, Minneapolis, Minnesota, pp 4171–4186
Liu Y, Ott M, Goyal N et al (2019) Roberta: a robustly optimized bert pretraining approach. https://doi.org/10.48550/arXiv.1907.11692
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. Paper presented at the 7th international conference on learning representations, New Orleans, Los Angeles, United States, 1–18 May 2019

Download references

Acknowledgements

This research was supported by Guangdong Provincial Key Areas R&D Plan Project (No. 2022B0101010005), the National Natural Science Foundation of China under Grant 62071201, Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515010119 and Qinghai Provincial Science and Technology Plan Project (No. 2021-QY-206).

Author information

Authors and Affiliations

Department of Computer Science, Jinan University, Guangzhou, 510632, Guangdong, China
Zihao Tan, Qingliang Chen & Wenbin Zhu
Guangzhou Xuanyuan Research Institute Co., Ltd., Guangzhou, 510006, Guangdong, China
Yongjian Huang
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
Chen Liang

Authors

Zihao Tan
View author publications
You can also search for this author in PubMed Google Scholar
Qingliang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yongjian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingliang Chen.

Ethics declarations

Conflict of interest

All the authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Templates in experiments

The templates used in our experiments are shown in Table 11; we designed four templates for each dataset and then averaged the results during the experiments to get the final results. Similarly, we considered the same strategy when we studied the effect of different length templates on the effectiveness of the attack, and the templates used are presented in Table 12.

Table 11 Manual templates used in the experiments of eight datasets, $<text>$ refers to the original input text and $<mask>$ refers to the label needed to be predicted

Full size table

Table 12 Manual templates used in comparing different lengths, $<text>$ refers to the original input text and $<mask>$ refers to the label needed to be predicted

Full size table

Average values of main results

We tabulated the results in Table 6, specifically, we averaged the ASR and query of the different methods for the different PLMs over all datasets, and the results are recorded in Table 13.

Table 13 The average values of COVER and COVER-N versus ROCKET and COVE in ASR and query across eight datasets

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tan, Z., Chen, Q., Zhu, W. et al. Exploring the vulnerability of black-box adversarial attack on prompt-based learning in language models. Neural Comput & Applic 37, 1457–1473 (2025). https://doi.org/10.1007/s00521-024-10669-2

Download citation

Received: 19 September 2023
Accepted: 07 October 2024
Published: 20 November 2024
Issue Date: January 2025
DOI: https://doi.org/10.1007/s00521-024-10669-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the vulnerability of black-box adversarial attack on prompt-based learning in language models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

COVER: A Heuristic Greedy Adversarial Attack on Prompt-Based Learning in Language Models

Watch Your Words: Successfully Jailbreak LLM by Mitigating the “Prompt Malice”

UTPrompt: Cross-Task Backdoor Prompt Attacks Based on Universal Triggers

Explore related subjects

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Templates in experiments

Average values of main results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now