skip to main content
10.1145/3627673.3679639acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

SGFL-Attack: A Similarity-Guidance Strategy for Hard-Label Textual Adversarial Attack Based on Feedback Learning

Published: 21 October 2024 Publication History

Abstract

Hard-label black-box textual adversarial attack presents a challenging task where only the predictions of the victim model are available. Moreover, several constraints further complicate the task of launching such attacks, including the inherent discrete and non-differentiable nature of text data and the need to introduce subtle perturbations that remain imperceptible to humans while preserving semantic similarity. Despite the considerable research efforts dedicated to this problem, existing methods still suffer from several limitations. For example, algorithms based on complex heuristic searches necessitate extensive querying, rendering them computationally expensive. The introduction of continuous gradient strategies into discrete text spaces often leads to estimation errors. Meanwhile, geometry-based strategies are prone to falling into local optima. To address these limitations, in this paper, we introduce SGFL-Attack, a novel approach that leverages a <u>S</u>imilarity-<u>G</u>uidance strategy based on <u>F</u>eedback <u>L</u>earning for hard-label textual adversarial attack, with limited query budget. Specifically, the proposed SGFL-Attack utilizes word embedding vectors to assess the importance of words and positions in text sequences, and employs a feedback learning mechanism to determine reward or punishment based on changes in predicted labels caused by replacing words. In each iteration, SGFL-Attack guides the search based on knowledge acquired from the feedback learning mechanism, generating more similar samples while maintaining low perturbations. Moreover, to reduce the query budget, we incorporate local hash mapping to avoid redundant queries during the search process. Extensive experiments on seven widely used datasets show that the proposed SGFL-Attack method significantly outperforms state-of-the-art baselines and defenses over multiple language models.

References

[1]
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015).
[2]
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
[3]
Jianbo Chen, Michael I Jordan, and Martin J Wainwright. 2020. Hopskipjumpattack: A query-efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp). IEEE, 1277--1294.
[4]
Yahui Chen. 2015. Convolutional neural network for sentence classification. Master's thesis. University of Waterloo.
[5]
Minhao Cheng, Thong Le, Pin-Yu Chen, Jinfeng Yi, Huan Zhang, and Cho-Jui Hsieh. 2018. Query-efficient hard-label black-box attack: An optimization-based approach. arXiv preprint arXiv:1807.04457 (2018).
[6]
Minhao Cheng, Simranjit Singh, Patrick Chen, Pin-Yu Chen, Sijia Liu, and Cho-Jui Hsieh. 2019. Sign-opt: A query-efficient hard-label adversarial attack. arXiv preprint arXiv:1909.10773 (2019).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 50--56.
[9]
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
[10]
Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. 2021. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733 (2021).
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[13]
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 8018--8025.
[14]
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271 (2018).
[15]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
[16]
Rishabh Maheshwary, Saket Maheshwary, and Vikram Pudi. 2021. Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13525--13533.
[17]
Zhao Meng and Roger Wattenhofer. 2020. A geometry-inspired attack for generating natural language adversarial examples. arXiv preprint arXiv:2010.01345 (2020).
[18]
Paul Michel, Xian Li, Graham Neubig, and Juan Miguel Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. arXiv preprint arXiv:1903.06620 (2019).
[19]
Nikola Mrkvsić, Diarmuid O Séaghdha, Blaise Thomson, Milica Gavsić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892 (2016).
[20]
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv preprint cs/0506075 (2005).
[21]
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1085--1097.
[22]
Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017).
[23]
Muchao Ye, Jinghui Chen, Chenglin Miao, Han Liu, Ting Wang, and Fenglong Ma. 2023. PAT: Geometry-Aware Hard-Label Black-Box Adversarial Attacks on Text. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3093--3104.
[24]
Muchao Ye, Jinghui Chen, Chenglin Miao, Ting Wang, and Fenglong Ma. 2022. Leapattack: Hard-label adversarial attack on text via gradient-based optimization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2307--2315.
[25]
Muchao Ye, Chenglin Miao, Ting Wang, and Fenglong Ma. 2022. Texthoaxer: Budgeted hard-label adversarial attacks on text. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3877--3884.
[26]
Zhen Yu, Xiaosen Wang, Wanxiang Che, and Kun He. 2022. Texthacker: learning based hybrid local search algorithm for text hard-label adversarial attack. arXiv preprint arXiv:2201.08193 (2022).
[27]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, Vol. 28 (2015).

Index Terms

  1. SGFL-Attack: A Similarity-Guidance Strategy for Hard-Label Textual Adversarial Attack Based on Feedback Learning

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
        October 2024
        5705 pages
        ISBN:9798400704369
        DOI:10.1145/3627673
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 21 October 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        CIKM '24
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

        Upcoming Conference

        CIKM '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 51
          Total Downloads
        • Downloads (Last 12 months)51
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 03 Mar 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media