Abstract
The increasing adoption of deep learning algorithms for automating downstream natural language processing (NLP) tasks has created a need to enhance their capability to assess linguistic acceptability. The CoLA corpus was created to aid in the development of models that can accurately assess grammatical acceptability and evaluate linguistic proficiency. Transformer models, widely utilized in various natural language processing tasks, including the evaluation of linguistic acceptability, may possess limitations that undermine their perceived robustness. These models exhibit susceptibility to adversarial text attacks, which are characterized by inconspicuous modifications made to the original input text. The tactfully chosen modifications are such that the adversarial examples generated, although correctly classified by human observers, successfully mislead the targeted model of the attack, consequently hindering its reliability. This paper presents a novel framework called ‘Homograph’ to generate adversarial text in a black-box setting. The efficacy of the suggested attack in undermining models designed for linguistic acceptability is significantly enhanced by its capability to generate visually similar adversarial examples that do not compromise the grammatical acceptability of the original input samples. These examples effectively deceive the model, causing it to modify its predicted label. In the context of the linguistic acceptability task, our attack was effectively applied to five transformer models: ALBERT, BERT, DistilBERT, RoBERTa, and XL-Net, fine-tuned on the CoLA dataset. Our work distinguishes itself from existing text-based attacks through several contributions. Firstly, we surpass previous baselines in terms of attack success rate (\(ASR\)) and average perturbation rate (\(APR\)) for models trained on the CoLA dataset. Secondly, we generate more potent adversarial examples that contain imperceptible modifications, thereby preserving the original label. Lastly, we employ a straightforward character-level transformation technique to produce adversarial examples that closely resemble the original text.
Similar content being viewed by others
Data availability
The data used to support the conclusions of this study are available upon request from the corresponding author.
References
Sarker, I.H.: Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2(6), 420 (2021). https://doi.org/10.1007/s42979-021-00815-1
Aggarwal, S., Bhola, G., Vishwakarma, D.K.: Weighted voting ensemble of hybrid CNN-LSTM Models for vision-based human activity recognition. Multimed. Tools Appl. (2024). https://doi.org/10.1007/s11042-024-19582-1
Aggarwal, S., Pandey, A., and Vishwakarma, K. D.: ‘Multimodal sarcasm recognition by fusing textual, visual and acoustic content via multi-headed attention for video dataset’, in 2023 world conference on communication & computing (WCONF), pp. 1–5. (2023). https://doi.org/10.1109/WCONF58270.2023.10235179.
Goodfellow, I. J., Shlens, J., and Szegedy, C.: Explaining and harnessing adversarial examples’, arXiv: arXiv:1412.6572. (2015). https://doi.org/10.48550/arXiv.1412.6572.
Moosavi-Dezfooli, S.-M., Fawzi, A. and Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks’, In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA: IEEE, pp. 2574–2582. (2016). https://doi.org/10.1109/CVPR.2016.282.
Modas, A., Moosavi-Dezfooli, S.-M., and Frossard, P., SparseFool: a few pixels make a big difference, In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9079–9088. (2019). https://doi.org/10.1109/CVPR.2019.00930.
Aggarwal, S., Vishwakarma, D.K.: Exposing the achilles’ heel of textual hate speech classifiers using indistinguishable adversarial examples. Expert Syst. Appl. 254, 124278 (2024). https://doi.org/10.1016/j.eswa.2024.124278
Peng, H., Wang, Z., Wei, C., Zhao, D., Guangquan, X., Han, J., Guo, S., Zhong, M., Ji, S.: TextJuggler: fooling text classification tasks by generating high-quality adversarial examples. Knowledge-Based Syst. 300, 112188 (2024). https://doi.org/10.1016/j.knosys.2024.112188
Warstadt, A., Singh, A. and Bowman, S. R.: Neural network acceptability judgments, arXiv: arXiv:1805.12471 (2019). https://doi.org/10.48550/arXiv.1805.12471.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S.: ‘GLUE: A multi-task benchmark and analysis platform for natural language understanding’, In proceedings of the 2018 EMNLP workshop blackboxnlp: analyzing and interpreting neural networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds., Brussels, Belgium: association for computational linguistics, pp. 353–355 (2018). https://doi.org/10.18653/v1/W18-5446.
Zhu, Q.: On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Pattern Recognit. Lett. 136, 71–80 (2020). https://doi.org/10.1016/j.patrec.2020.03.030
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding’, in proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423.
Sanh, V., Debut, L., Chaumond, J., and Wolf,T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Feb. 29, 2020, arXiv: arXiv:1910.01108. Accessed: Jun. 06, 2023. [Online]. Available: http://arxiv.org/abs/1910.01108
Islam, S., et al.: A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 241, 122666 (2024). https://doi.org/10.1016/j.eswa.2023.122666
Aggarwal, S., and Vishwakarma, D. K.: Protecting our children from the dark corners of youtube: a cutting-edge analysis’, In: 2023 4th IEEE global conference for advancement in technology (GCAT), pp. 1–5 (2023). https://doi.org/10.1109/GCAT59970.2023.10353306.
Habbat, N., Nouri, H., Anoun, H., Hassouni, L.: Sentiment analysis of imbalanced datasets using BERT and ensemble stacking for deep learning. Eng. Appl. Artif. Intell. 126, 106999 (2023). https://doi.org/10.1016/j.engappai.2023.106999
Liu, T., Ke, Y., Wang, L., Zhang, X., Zhou, H., Xiaofei, W.: Clickbait detection on WeChat: a deep model integrating semantic and syntactic information. Knowl.-Based Syst. 245, 108605 (2022). https://doi.org/10.1016/j.knosys.2022.108605
Almerekhi, H., Kwak, H., Salminen, J., Jansen, B.J.: PROVOKE: Toxicity trigger detection in conversations from the top 100 subreddits. Data Inf. Manag. 6(4), 100019 (2022). https://doi.org/10.1016/j.dim.2022.100019
Formento, B., Foo, C. S., Tuan, L. A., and Ng, S. K.: Using punctuation as an adversarial attack on deep learning-based NLP Systems: an empirical study, In: findings of the association for computational linguistics: EACL 2023, Dubrovnik, Croatia: Association for computational linguistics, pp. 1–34. (2023). Accessed: Sep. 08, 2023. [Online]. Available: https://aclanthology.org/2023.findings-eacl.1
Bajaj, A., Vishwakarma, D.K.: Non-Alpha-Num: a novel architecture for generating adversarial examples for bypassing NLP-based clickbait detection mechanisms. Int. J. Inf. Secur. (2024). https://doi.org/10.1007/s10207-024-00861-9
Li,J., Ji, S. Du, T., Li, B., and Wang,T.: TextBugger: generating adversarial text against real-world applications’, In proceedings 2019 network and distributed system security symposium, San Diego, CA: Internet society, (2019). https://doi.org/10.14722/ndss.2019.23138.
Liu, J., et al.: Aliasing black box adversarial attack with joint self-attention distribution and confidence probability. Expert Syst. Appl. 214, 119110 (2023). https://doi.org/10.1016/j.eswa.2022.119110
Morris, J., Lifland, E.,Yoo, J. Y., Grigsby, J., Jin, D. and Qi, Y.: TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP’, In: proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, Online: association for computational linguistics, pp. 119–126. (2020). https://doi.org/10.18653/v1/2020.emnlp-demos.16.
Zang Y., et al.: Word-level textual adversarial attacking as combinatorial optimization’, in proceedings of the 58th annual meeting of the association for computational linguistics, online: association for computational linguistics, pp. 6066–6080. (2020). https://doi.org/10.18653/v1/2020.acl-main.540.
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.-J., Srivastava, M., and Chang, K.-W.: Generating natural language adversarial examples, arXiv: arXiv:1804.07998 (2018). https://doi.org/10.48550/arXiv.1804.07998.
Jia, R., Raghunathan, A., Göksel, K., and Liang, P.: Certified robustness to adversarial word substitutions’, Sep. 03, 2019, arXiv: arXiv:1909.00986. https://doi.org/10.48550/arXiv.1909.00986.
Wang, X., Jin, H., Yang, Y., and He, K.: Natural language adversarial defense through synonym encoding’, Jun. 14, 2021, arXiv: arXiv:1909.06723. https://doi.org/10.48550/arXiv.1909.06723.
Yoo, J. Y., and Qi, Y.: Towards improving adversarial training of NLP Models’, In: findings of the association for computational linguistics: EMNLP 2021, Punta Cana, dominican republic: association for computational linguistics, pp. 945–956. (2021). https://doi.org/10.18653/v1/2021.findings-emnlp.81.
Garg S., and Ramakrishnan, G.: BAE: BERT-based adversarial examples for text classification’, In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), ONLINE: association for computational linguistics, pp. 6174–6181. (2020). https://doi.org/10.18653/v1/2020.emnlp-main.498.
Ribeiro, M. T., Wu, T., Guestrin, C., and Singh, S.: Beyond accuracy: behavioral testing of NLP models with CheckList’, In: Proceedings of the 58th annual meeting of the association for computational linguistics, online: association for computational linguistics, pp. 4902–4912. (2020). https://doi.org/10.18653/v1/2020.acl-main.442.
Gao, J., Lanchantin, J., Soffa, M. L., and Qi, Y.: Black-box generation of adversarial text sequences to evade deep learning classifiers’, In 2018 IEEE security and privacy workshops (SPW), pp. 50–56 (2018). https://doi.org/10.1109/SPW.2018.00016.
Ebrahimi, J., Rao, A., Lowd, D., and Dou, D.: HotFlip: white-box adversarial examples for text classification’, In: proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), Melbourne, Australia: association for computational linguistics, pp. 31–36. (2018). https://doi.org/10.18653/v1/P18-2006.
Kuleshov, V. Thakoor, S., Lau, T., and Ermon, S.: Adversarial examples for natural language classification problems’, Feb. 2018, Accessed: Jul. 24, 2024. [Online]. Available: https://openreview.net/forum?id=r1QZ3zbAZ
Ren, S., Deng, Y., He, K., and Che, W.: Generating natural language adversarial examples through probability weighted word saliency’, In: proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy: association for Computational Linguistics, pp. 1085–1097 (2019). https://doi.org/10.18653/v1/P19-1103.
Jin, D., Jin, Z., Zhou, V., and Szolovits, P.: Is BERT really robust? A strong baseline for natural language attack on text classification and entailment, Proceedings of the AAAI conference on artificial intelligence vol. 34, no. 05, Art. no. 05, (2020), https://doi.org/10.1609/aaai.v34i05.6311
Pruthi, D., Dhingra, B., and Lipton, Z. C.: Combating adversarial misspellings with robust word recognition’, In Proceedings of the 57th annual meeting of the association for computational linguistics, florence, italy: association for computational linguistics, pp. 5582–5591 (2019). https://doi.org/10.18653/v1/P19-1561.
Yang, X., Qi, Y., Chen, H., Liu, B., Liu, W.: Generation-based parallel particle swarm optimization for adversarial text attacks. Inf. Sci. 644, 119237 (2023). https://doi.org/10.1016/j.ins.2023.119237
Dong Z., and Dong, Q.: HowNet - a hybrid language and knowledge resource, In: International conference on natural language processing and knowledge engineering, proceedings. pp. 820–824 (2003). https://doi.org/10.1109/NLPKE.2003.1276017.
Xu, J., Du, Q.: TextTricker: loss-based and gradient-based adversarial attacks on text classification models. Eng. Appl. Artif. Intell. 92, 103641 (2020). https://doi.org/10.1016/j.engappai.2020.103641
Chang, G., Gao, H., Yao, Z., Xiong, H.: TextGuise: adaptive adversarial example attacks on text classification model. Neurocomputing 529, 190–203 (2023). https://doi.org/10.1016/j.neucom.2023.01.071
Charikar, M. S.: ‘Similarity estimation techniques from rounding algorithms’, In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, in STOC ’02. New York, NY, USA: Association for Computing Machinery, pp. 380–388. (2002). https://doi.org/10.1145/509907.509965.
Liu, Z., et al.: HyGloadAttack: hard-label black-box textual adversarial attacks via hybrid optimization. Neural Netw. 178, 106461 (2024). https://doi.org/10.1016/j.neunet.2024.106461
Han, X., et al.: BFS2Adv: black-box adversarial attack towards hard-to-attack short texts. Comput. Secur. 141, 103817 (2024). https://doi.org/10.1016/j.cose.2024.103817
Chiang, C.-H., and Lee, H.: Are synonym substitution attacks really synonym substitution attacks?’, In: findings of the association for computational linguistics: ACL 2023, Toronto, Canada: association for computational linguistics, pp. 1853–1878. (2023). https://doi.org/10.18653/v1/2023.findings-acl.117.
Kennedy, J., and Eberhart, R.: Particle swarm optimization’, in Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4 pp. 1942–1948 (1995). https://doi.org/10.1109/ICNN.1995.488968.
Yoo, J. Y., Morris, J., Lifland, E., and Qi, Y.: Searching for a search method: benchmarking search algorithms for generating nlp adversarial examples, in Proceedings of the third BlackboxNLP workshop on analyzing and interpreting neural networks for NLP, Online: association for computational linguistics, pp. 323–332. (2020). https://doi.org/10.18653/v1/2020.blackboxnlp-1.30.
Cer D., et al.: Universal sentence encoder for english’, In: proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, Brussels, Belgium: association for computational linguistics, pp. 169–174. (2018). https://doi.org/10.18653/v1/D18-2029.
Trotta, D., Guarasci, R., Leonardelli, E., and Tonelli, S., Monolingual and cross-lingual acceptability judgments with the Italian CoLA corpus’, In: findings of the association for computational linguistics: EMNLP 2021, Moens, M.-F., Huang, X. Specia, L., and Yih, S. W., Eds., Punta Cana, Dominican Republic: association for computational linguistics, pp. 2929–2940. (2021). https://doi.org/10.18653/v1/2021.findings-emnlp.250.
Volodina, E., Mohammed, Y. A., and Klezl, J.: DaLAJ – a dataset for linguistic acceptability judgments for Swedish’, In: Proceedings of the 10th workshop on NLP for computer assisted language learning, online: LiU Electronic Press, pp. 28–37 (2021). Accessed: Jun. 08, 2023. [Online]. Available: https://aclanthology.org/2021.nlp4call-1.3
Mikhailov, V., Shamardina, T., Ryabinin, M., Pestova, A., Smurov, I., and Artemova, E.: RuCoLA: Russian corpus of linguistic acceptability, In: proceedings of the 2022 conference on empirical methods in natural language processing, Abu Dhabi, United Arab Emirates: association for computational linguistics, pp. 5207–5227 (2022). Accessed: Jun. 08, 2023. [Online]. Available: https://aclanthology.org/2022.emnlp-main.348
Jentoft, M., and Samuel, D.: ‘NoCoLA: The norwegian corpus of linguistic acceptability’, In: proceedings of the 24th nordic conference on computational linguistics (NoDaLiDa), Tórshavn, Faroe Islands: University of Tartu Library, pp. 610–617 (2023). Accessed: Jun. 08, 2023. [Online]. Available: https://aclanthology.org/2023.nodalida-1.60
Lin, T., Wang, Y., Liu, X., Qiu, X.: A survey of transformers. AI Open 3, 111–132 (2022). https://doi.org/10.1016/j.aiopen.2022.10.001
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach’, Jul. 26, 2019, arXiv: arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations, Feb. 08, 2020, arXiv: arXiv:1909.11942. Accessed: Jun. 08, 2023. [Online]. Available: http://arxiv.org/abs/1909.11942
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V.: XLNet: generalized autoregressive pretraining for language understanding’, In: advances in neural information processing systems, curran associates, Inc., 2019. Accessed: Dec. 21, 2023. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html
Tsai, Y.-T., Yang, M.-C., and Chen, H.-Y.: Adversarial attack on sentiment classification’, In: proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, Florence, Italy: association for computational linguistics, pp. 233–240 (2019). https://doi.org/10.18653/v1/W19-4824.
Kozik, R., Ficco, M., Pawlicka, A., Pawlicki, M., Palmieri, F., Choraś, M.: When explainability turns into a threat - using xAI to fool a fake news detection method. Comput. Secur. 137, 103599 (2024). https://doi.org/10.1016/j.cose.2023.103599
Grolman, E., Binyamini, H., Shabtai, A., Elovici, Y., Morikawa, I., and Shimizu, T.: hateversarial: adversarial attack against hate speech detection algorithms on twitter’, In: Proceedings of the 30th ACM conference on user modeling, adaptation and Personalization, in UMAP ’22. New York, NY, USA: Association for computing machinery, pp. 143–152 (2022). https://doi.org/10.1145/3503252.3531309.
Luo, Y., Li, Y., Wen, D., and Lan, L.: Message injection attack on rumor detection under the black-box evasion setting using large language model’, In: proceedings of the ACM on web conference 2024, in WWW ’24. New York, NY, USA: association for computing machinery, pp. 4512–4522 (2024). https://doi.org/10.1145/3589334.3648139.
Nguyen, P. T., Di Sipio, C., Di Rocco, J., Di Penta, M. and Di Ruscio, D.: ‘Adversarial attacks to API recommender systems: time to wake up and smell the coffee? In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 253–265 (2021). https://doi.org/10.1109/ASE51524.2021.9678946.
Author information
Authors and Affiliations
Contributions
Sajal Aggarwal, Ashish Bajaj: Software, Validation, Investigation, Data Curation, Writing – Original Draft, Visualization, Conceptualization, Methodology Dinesh Kumar Vishwakarma: Formal Analysis, Resources, Writing – Review & Editing, Supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors assert that they do not possess any apparent financial conflicts of interest or personal affiliations that may have influenced the findings presented in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Aggarwal, S., Bajaj, A. & Vishwakarma, D.K. HOMOGRAPH: a novel textual adversarial attack architecture to unmask the susceptibility of linguistic acceptability classifiers. Int. J. Inf. Secur. 24, 6 (2025). https://doi.org/10.1007/s10207-024-00925-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10207-024-00925-w